Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
-
Upload
duongkhanh -
Category
Documents
-
view
255 -
download
0
Transcript of Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
-
7/23/2019 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
1/15
4/1/12 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
1/15cmusphinx.sourceforge.net/wiki/tutorialsphinx4
CMUSphinx
Open Source Toolkit For Speech Recognition
Project by Carnegie Mellon University
Download Learn Research Develop Communicate
Sphinx-4 Application Programmer's Guide
This tutorial shows you how to write Sphinx-4 applications. We will use the HelloWorld demo as an example to show how a simple applicationcan be written. We will then proceed to a more complex example.
Simple Example - HelloWorld
We will look at a very simple Sphinx-4 speech application, the HelloWorld demo. This application recognizes very restricted type of speech -greetings. As you will see, the code is very simple. The harder part is understanding the configuration, but we will guide you through every
step of it. Lets look at the code first.
Code Walk - HelloWorld.java
All the source code of the HelloWorld demo is in one short file
sphinx4/src/apps/edu/cmu/sphinx/demo/helloworld/HelloWorld.java:
package edu.cmu.sphinx.demo.helloworld
import edu.cmu.sphinx.frontend.util.Microphone
import edu.cmu.sphinx.recognizer.Recognizer
import edu.cmu.sphinx.result.Result
import edu.cmu.sphinx.util.props.ConfigurationManager
/**
* A simple HelloWorld demo showing a simple speech application built using Sphinx-4. This application uses the Sphinx-4
* endpointer, which automatically segments incoming audio into utterances and silences.
*/
public class HelloWorld {
public static void main(String[] args) {
ConfigurationManager cm
if (args.length > 0) {
cm = new ConfigurationManager(args[0])
} else {
cm = new ConfigurationManager(HelloWorld.class.getResource("helloworld.config.xml"))
}
Recognizer recognizer = (Recognizer) cm.lookup("recognizer")
recognizer.allocate()
// start the microphone or exit if the programm if this is not possible
Microphone microphone = (Microphone) cm.lookup("microphone")
if (!microphone.startRecording()) {
System.out.println("Cannot start microphone.")
recognizer.deallocate()
System.exit(1)
}
System.out.println("Say: (Good morning | Hello) ( Bhiksha | Evandro | Paul | Philip | Rita | Will )")
// loop the recognition until the programm exits.
while (true) {
System.out.println("Start speaking. Press Ctrl-C to quit.\n")
Result result = recognizer.recognize()
if (result != null) {
String resultText = result.getBestFinalResultNoFiller()
System.out.println("You said: " + resultText + '\n')
} else {
System.out.println("I can't hear what you said.\n")
}
}
}
}
This demo imports several important classes in Sphinx-4:
edu.cmu.sphinx.recognizer.Recognizer
[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/recognizer/Recognizer.html]
edu.cmu.sphinx.result.Result[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/result/Result.html]
edu.cmu.sphinx.util.props.ConfigurationManager
[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/util/props/ConfigurationManager.html]
-
7/23/2019 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
2/15
4/1/12 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
2/15cmusphinx.sourceforge.net/wiki/tutorialsphinx4
The Recognizer is the main class any application should interact with. The Result is returned by the Recognizer to the applicationafter recognition completes. The ConfigurationManager creates the entire Sphinx-4 system according to the configuration specified bythe user.
Let's look at the main() method. The first few lines creates the URL of the XML-based configuration file.
A ConfigurationManager is then created using that URL.
The ConfigurationManager then reads in the file internally. Since the configuration file specifies the components recognizer andmicrophone (we will look at the configuration file next), we perform a lookup()[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/util/props/ConfigurationManager.html#lookuin the ConfigurationManager to obtain these components.
The allocate()[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/recognizer/Recognizer.html#allocate()]method of the Recognizer is then called to allocate the resources need for the recognizer.
The Microphone class is used for capturing live audio from the system audio device. Both the Recognizer and the Microphone areconfigured as specified in the configuration file.
Once all the necessary components are created, we can start running the demo. The program first turns on the Microphone (microphone.startRecording()[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/frontend/util/Microphone.html#startRecordi
After the microphone is turned on successfully, the program enters a loop that repeats the following: It tries to recognize what the user is
saying, using the Recognizer.recognize()[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/recognizer/Recognizer.html#recognize()]method.
Recognition stops when the user stops speaking, which is detected by the endpointer built into the front end by configuration.
Once an utterance is recognized, the recognized text, which is returned by the method Result.getBestResultNoFiller()[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/result/Result.html#getBestResultNoFiller()is printed out. If the Recognizer recognized nothing (i.e., result is null), then it will print out a message saying that.
Finally, if the demo program cannot turn on the microphone in the first place, the Recognizer will be deallocated, and the program exits. Itis generally a good practice to call the method deallocate()[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/recognizer/Recognizer.html#deallocate()]after the work is done to release all the resources.
Note that several exceptions are thrown. These exceptions should be caught and handled appropriately.
Hopefully, by this point, you will have some idea of how to write a simple Sphinx-4 application. We will now turn to the harder part,understanding the various components necessary to create a grammar-based recognizer. These components are specified in the configurationfile, which we will now explain in depth.
Configuration File Walk - helloworld.config.xml
In this section, we will explain the various Sphinx-4 components that are used for the HelloWorld demo, as specified in the configuration file.We will look at each section of the config file in depth. If you want to learn about the format of these configuration files, please refer to the
document Sphinx-4 Configuration Management [http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/util/props/doc-files/ConfigurationManagement.html].
The lines below define the frequently tuned properties. They are located at the top of the configuration file so that they can be edited quickly.
Recognizer
The lines below define the recognizer component that performs speech recognition. It defines the name and class of the recognizer,
Recognizer. This is the class that any application should interact with. If you look at the javadoc of the Recognizer class, you will seethat it has two properties, 'decoder' and 'monitors'. This configuration file is where the value of these properties are defined.
accuracyTracker
speedTracker
memoryTracker
We will explain the monitors later. For now, let's look at the decoder.
-
7/23/2019 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
3/15
4/1/12 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
3/15cmusphinx.sourceforge.net/wiki/tutorialsphinx4
Decoder
The 'decoder' property of the recognizer is set to the component called 'decoder', which is defined as:
The decoder component is of class edu.cmu.sphinx.decoder.Decoder[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/decoder/Decoder.html] . Its property'searchManager' is set to the component 'searchManager', defined as:
The searchManager is of class edu.cmu.sphinx.decoder.search.SimpleBreadthFirstSearchManager[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/decoder/search/SimpleBreadthFirstSearchManThis class performs a simple breadth-first search through the search graph during the decoding process to find the best path. This searchmanager is suitable for small to medium sized vocabulary decoding.
The logMath property is the log math that is used for calculation of scores during the search process. It is defined as having the log base of1.0001. Note that typically the same log base should be used throughout all components, and therefore there should only be one logMathdefinition in a configuration file:
The linguist of the searchManager is set to the component 'flatLinguist' (which we will look at later), which again is suitable for small tomedium sized vocabulary decoding. The pruner is set to the 'trivialPruner':
which is of class edu.cmu.sphinx.decoder.pruner.SimplePruner[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/decoder/pruner/SimplePruner.html] . Thispruner performs simple absolute beam and relative beam pruning based on the scores of the tokens.
The scorer of the searchManager is set to the component 'threadedScorer', which is of classedu.cmu.sphinx.decoder.scorer.ThreadedAcousticScorer[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/decoder/scorer/ThreadedAcousticScorer.htmlIt can use multiple threads (usually one per CPU) to score the tokens in the active list. Scoring is one of the most time-consuming step of thedecoding process. Tokens can be scored independently of each other, so using multiple CPUs will definitely speed things up. The
threadedScorer is defined as follows:
The 'frontend' property is the front end from which features are obtained. For details about the other properties of the threadedScorer,please refer to javadoc for ThreadedAcousticScorer[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/decoder/scorer/ThreadedAcousticScorer.html
Finally, the activeListFactory property of the searchManager is set to the component 'activeList', which is defined as follows:
-
7/23/2019 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
4/15
4/1/12 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
4/15cmusphinx.sourceforge.net/wiki/tutorialsphinx4
Linguist
Now let's look at the flatLinguist component (a component inside the searchManager). The linguist is the component that generatesthe search graph using the guidance from the grammar, and knowledge from the dictionary, acoustic model, and language model.
It also uses the logMath that we've seen already. The grammar used is the component called 'jsgfGrammar', which is a BNF-style grammar:
JSGF grammars are defined in JSAPI [http://java.sun.com/products/java-media/speech/]. The class that translates JSGF into a form that Sphinx-4
understands is edu.cmu.sphinx.jsapi.JSGFGrammar[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/jsgf/JSGFGrammar.html] . Note that this link tothe javadoc also describes the limitations of the current implementation).
The property 'grammarLocation' can take two kinds of values. If it is a URL, it specifies the URL of the directory where JSGF grammar files areto be found. Otherwise, it is interpreted as resource locator. In our example, the HelloWorld demo is being deployed as a JAR file. The'grammarLocation' property is therefore used to specify the location of the resource hello.gram
[http://cmusphinx.sourceforge.net/sphinx4/src/apps/edu/cmu/sphinx/demo/helloworld/hello.gram] within the JAR file. Note that it is not necessary to
the JAR file within which to search.
The 'grammarName' property specifies the grammar to use when creating the search graph.
'logMath' is the same log math as the other components.
The 'dictionary' is the component that maps words to their phonemes. It is almost always the dictionary of the acoustic model, which lists allthe words that were used to train the acoustic model:
The locations of these dictionary files are specified using the Sphinx-4 resource mechanism. The dictionary for filler words like BREATH andLIP_SMACK is the file fillerdict.
For details about the other possible properties, please refer to the javadoc for FastDictionary[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/linguist/dictionary/FastDictionary.html] .
Acoustic Model
The next important property of the flatLinguist is the acoustic model which describes sounds of the language. It is defined as:
'wsj' stands for the Wall Street Journal acoustic models.
Sphinx-4 can load acoustic models trained by Sphinxtrain. Common models are packed into JAR files during build and located in lib folder.Sphinx3Loader class [http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/linguist/acoustic/tiedstate/Sphinx3Loader.html] is used to loadthem. The JAR needs to be included into classpath.
The JAR file for the WSJ models is called WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar, and is in the sphinx4/lib
directory.
As a programmer, all you need to do is to specify the class of the AcousticModel, and the loader of the AcousticModel, as shownabove (note that if you are using the WSJ model in other applications, these lines should be the same, except that you might have called your'logMath' component something else). is in the sphinx4/lib directory. The acoustic model could be located in filesystem or on any other
-
7/23/2019 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
5/15
4/1/12 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
5/15cmusphinx.sourceforge.net/wiki/tutorialsphinx4
resource. You need to specify the model location in location property then.
The next properties of the flatLinguist are the 'wordInsertionProbability' and 'languageWeight'. These properties are usually for finetuning the system. Below are the default values we used for the various tasks. You can tune your system accordingly:
Vocabulary Size Word Insertion Probability Language Weight
Digits (11 words - TIDIGITS) 1E-36 8
Small (80 words - AN4) 1E-26 7
Medium (1000 words - RM1) 1E-10 7
Large (64000 words - HUB4) 0.2 10.5
Front End
The last big piece in the configuration file is the front end.
There are two different front ends listed in the configuration file: 'frontend' and 'epFrontEnd'. The 'frontend' is good for batch mode decoding(or decoding without endpointing), while 'epFrontEnd' is good for live mode decoding with endpointing.
Note that you can also perform live mode decoding with the 'frontend' (i.e., without endpointing), but that you need to explicitly signal thestart and end of speech (e.g., by asking the user to explicitly turn on/off the microphone). The definitions for these front ends are:
microphone
premphasizer
windower
fft
melFilterBank
dct
liveCMN
featureExtraction
microphone
speechClassifier
speechMarker
nonSpeechDataFilter
premphasizer
windower
fft
melFilterBank
dct
liveCMN
featureExtraction
As you might notice, the only different between these two front ends is that the live front end (epFrontEnd) has the additional componentsspeechClassifier, speechMarker and nonSpeechDataFilter. These three components make up the default endpointer ofSphinx-4.
Below is a listing of all the components of both front ends, and those properties which have values different from the default:
-
7/23/2019 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
6/15
4/1/12 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
6/15cmusphinx.sourceforge.net/wiki/tutorialsphinx4
Let's explain some of the properties set here that have values different from the default.
The property 'threshold' [http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/frontend/endpoint/SpeechClassifier.html#PROP_THRESHOLD]of the SpeechClassifier specifies the minimum difference between the input signal level and the background signal level in order thatthe input signal is classified as speech. Therefore, the smaller this number, the more sensitive the endpointer, and vice versa.
The property ' speechTrailer
[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/frontend/endpoint/SpeechMarker.html#PROP_SPEECH_TRAILER]' of the
SpeechMarker specifies the length of non-speech signal to be included after the end of speech to make sure that no speech signal is lost.Here, it is set at 50 milliseconds.
The property ' msecPerRead [http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/frontend/util/Microphone.html#PROP_MSEC_PER_READ]'of the Microphone specifies the number of milliseconds of data to read at a time from the system audio device. The value specified here is10ms.
The property ' closeBetweenUtterances
[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/frontend/util/Microphone.html#PROP_CLOSE_BETWEEN_UTTERANCES]' specifieswhether the system audio device should be released between utterances. It is set to false here, meaning that the system audio device will notbe released between utterances. This is set as so because on certain systems (Linux for one), closing and reopening the audio does not worktoo well.
Instrumentation
Finally, we will explain the various monitors which make up the instrumentation package. These monitors are components of the
recognizer (see above). They are responsible for tracking the accuracy, speed and memory usage of Sphinx-4.
The various knobs of these monitors mainly control whether statistical information about accuracy, speed and memory usage should beprinted out. Moreover, the monitors monitor the behavior of a recognizer, so they need a reference to the recognizer that they are monitoring.
More Complex Example - HelloNGram
HelloWorld uses a very small vocabulary and a guided grammar. What if you want to use a larger vocabulary, and there is no guidedgrammar for your application? One way to do it would be to use what is known as a language model, which describes the probability ofoccurrence of a series of words. The HelloNGram demo shows you how to do this with Sphinx-4.
Code Walk - HelloNGram.java
The source code for the HelloNGram demo is exactly the same as that of the HelloWorld demo, except for the names of the democlass. The demo runs exactly the same way: it keeps listening to and recognizes what you say, and when it was detected the end of anutterance, it will show the recognition result.
N-Gram Language Model
Sphinx-4 supports the n-gram language model (both ascii and binary versions) generated by the Carnegie Mellon University StatisticalLanguage Modeling toolkit.
The input file is a long list of sample utterances. Using the occurrence of words and sequences of words in this input file, a language model canbe trained. The resulting trigram language model file is hellongram.trigram.lm.
-
7/23/2019 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
7/15
4/1/12 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
7/15cmusphinx.sourceforge.net/wiki/tutorialsphinx4
Configuration File Walk - hellongram.config.xml
In this section, we will explain the various Sphinx-4 components that are used for the HelloNGram demo, as specified in the configuration file.We will look at each section of the config file in depth. If you want to learn about the format of these configuration files, please refer to the
document Sphinx-4 Configuration Management [http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/util/props/doc-files/ConfigurationManagement.html].
The above lines defines frequently tuned properties. They are located at the top of the configuration file so that they can be edited quickly.
Recognizer
accuracyTracker
speedTracker
memoryTracker
recognizerMonitor
The above lines define the recognizer component that performs speech recognition. It defines the name and class of the recognizer. This is theclass that any application should interact with.
If you look at the javadoc of the ''Recognizer'' class [http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/recognizer/Recognizer.html],
you will see that it has two properties, 'decoder' and 'monitors'. This configuration file is where the value of these properties are defined.
Decoder
The 'decoder' property of the recognizer is set to the component called 'decoder':
The decoder component is defined to be of class edu.cmu.sphinx.decoder.Decoder
[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/decoder/Decoder.html]. Its property 'searchManager' is set to the component'wordPruningSearchManager':
The searchManager is of class edu.cmu.sphinx.decoder.search.WordPruningBreadthFirstSearchManager[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/decoder/search/WordPruningBreadthFirstSearchManager.html]. It is better than the
SimpleBreadthFirstSearchManager for larger vocabulary recognition. This class also performs a simple breadth-first search throughthe search graph, but at each frame it also prunes the different types of states separately.
The logMath property is the log math that is used for calculation of scores during the search process. It is defined as having the log base of1.0001. Note that typically the same log base should be used throughout all components, and therefore there should only be one logMathdefinition:
-
7/23/2019 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
8/15
4/1/12 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
8/15cmusphinx.sourceforge.net/wiki/tutorialsphinx4
The linguist of the searchManager is set to the component 'lexTreeLinguist' (which we will look at later), which again is suitable for largevocabulary recognition. The pruner is set to the 'trivialPruner':
which is of class edu.cmu.sphinx.decoder.pruner.SimplePruner[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/decoder/pruner/SimplePruner.html]. This pruner performs simple absolute beamand relative beam pruning based on the scores of the tokens.
The scorer of the searchManager is set to the component 'threadedScorer', which is of class
edu.cmu.sphinx.decoder.scorer.ThreadedAcousticScorer[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/decoder/scorer/ThreadedAcousticScorer.html]. It can use multiple threads (usuallyone per CPU) to score the tokens in the active list. Scoring is one of the most time-consuming step of the decoding process. Tokens can be
scored independently of each other, so using multiple CPUs will definitely speed things up.
The threadedScorer is defined as follows:
The 'frontend' property is the front end from which features are obtained.
For details about the other properties of the threadedScorer, please refer to the javadoc for ''ThreadedAcousticScorer''[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/decoder/scorer/ThreadedAcousticScorer.html].
Finally, the 'activeListManager' property of the wordPruningSearchManager is set to the component 'activeListManager', which isdefined as follows:
standardActiveListFactory
wordActiveListFactory
wordActiveListFactory
standardActiveListFactory
standardActiveListFactory
standardActiveListFactory
The SimpleActiveListManager is of class edu.cmu.sphinx.decoder.search.SimpleActiveListManager[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/decoder/search/SimpleActiveListManager.html].
Since the word-pruning search manager performs pruning on different search state types separately, we need a different active list for eachstate type. Therefore, you see different active list factories being listed in the SimpleActiveListManager, one for each type. So how dowe know which active list factory is for which state type? It depends on the 'search order' as returned by the search graph (which in this case
is generated by the LexTreeLinguist).
The search state order and active list factory used here are:
State Type ActiveListFactory
LexTreeNonEmittingHMMState standardActiveListFactory
LexTreeWordState wordActiveListFactory
LexTreeEndWordState wordAct iveListFactory
LexTreeEndUnitState standardActiveListFactory
LexTreeUnitState standardActiveListFactory
LexTreeHMMState standardActiveL istFactory
There are two types of active list factories used here, the standard and the word. If you look at the 'frequently tuned properties' above, you
will find that the word active list has a much smaller beam size than the standard active list.
The beam size for the word active list is set by 'absoluteWordBeamWidth' and 'relativeWordBeamWidth', while the beam size for the standardactive list is set by 'absoluteBeamWidth' and 'relativeBeamWidth'.
-
7/23/2019 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
9/15
4/1/12 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
9/15cmusphinx.sourceforge.net/wiki/tutorialsphinx4
The SimpleActiveListManager allows us to control the beam size of different types of states.
Linguist
Lets look at the 'lexTreeLinguist' (a component inside the wordPruningSearchManager). The linguist is the component that generatesthe search graph using the guidance from the grammar, and knowledge from the dictionary, acoustic model, and language model.
For details about the LexTreeLinguist, please refer to the Javadocs of the LexTreeLinguist[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/linguist/lextree/LexTreeLinguist.html].
In general, the LexTreeLinguist is the one to use for large vocabulary speech recognition, and the FlatLinguist is the one to usefor small vocabulary speech recognition.
The LexTreeLinguist has a lot of properties that can be set, but the ones that are must be set are the 'logMath', the 'acousticModel', the'languageModel', and the 'dictionary'. These properties are the necessary sources of information for the LexTreeLinguist to build thesearch graph. The rest of the properties are for controlling the speed and accuracy performance of the linguist, and you can read more aboutthem in the Javadocs of the ''LexTreeLinguist
[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/linguist/lextree/LexTreeLinguist.html].
Acoustic Model
The 'acousticModel' is where the LexTreeLinguist obtains the HMM for the words or units. For the HelloNGram demo it's the same wsj model asfor HelloDigits:
Language Model
The 'languageModel' component of the lexTreeLinguist is called the 'trigramModel', because it is a trigram language model. It is defined as
follows:
The language model is generated by the CMU Statistical Language Modeling Toolkit. It is in text format, which can be loaded by theSimpleNGramModel [http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/linguist/language/ngram/SimpleNGramModel.html]class.
For this class, you also need to specify the dictionary that you are using, which is the same as the one used by the lexTreeLinguist.
Same for 'logMath' (note that the same logMath component should be used throughout the system).
The 'maxDepth' property is 3, since this is a trigram language model.
The 'unigramWeight' should normally be set to 0.7.
Dictionary
-
7/23/2019 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
10/15
4/1/12 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
10/15cmusphinx.sourceforge.net/wiki/tutorialsphinx4
The last important component of the LexTreeLinguist is the 'dictionary', which is defined as follows:
As you might realize, it is using the dictionary inside the JAR file of the Wall Street journal acoustic model. The main dictionary for words is the
WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/cmudict.0.6d file inside the JAR file, and the dictionary for filler wordslike BREATH and LIP_SMACK is WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/fillerdict. You can inspect thecontents of a JAR file by (assuming your JAR file is called myJar.jar)
jar tvf myJar.jar
You can see the contents of the WSJ JAR file by:
sphinx4> jar tvf lib/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar
0 Fri Feb 12 15:01:22 MSK 2010 META-INF/
106 Fri Feb 12 15:01:20 MSK 2010 META-INF/MANIFEST.MF
0 Fri Feb 12 15:01:22 MSK 2010 WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/
0 Fri Feb 12 15:01:22 MSK 2010 WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/cd_continuous_8gau/
0 Fri Feb 12 15:01:22 MSK 2010 WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/
0 Fri Feb 12 15:01:22 MSK 2010 WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/etc/
1492 Fri Feb 12 15:01:22 MSK 2010 WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/README
5175518 Fri Feb 12 15:01:18 MSK 2010 WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/cd_continuous_8gau/means
132762 Fri Feb 12 15:01:22 MSK 2010 WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/cd_continuous_8gau/mixture_weights
2410 Fri Feb 12 15:01:22 MSK 2010 WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/cd_continuous_8gau/transition_matrices
5175518 Fri Feb 12 15:01:18 MSK 2010 WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/cd_continuous_8gau/variances
354 Fri Feb 12 15:01:22 MSK 2010 WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/alpha.dict
4718935 Fri Feb 12 15:01:16 MSK 2010 WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/cmudict.0.6d
373 Fri Feb 12 15:01:22 MSK 2010 WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/digits.dict
204 Fri Feb 12 15:01:22 MSK 2010 WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/fillerdict
5654967 Fri Feb 12 15:01:22 MSK 2010 WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/etc/WSJ_clean_13dCep_16k_40mel_130Hz_6800Hz.4000.mdef
2641 Fri Feb 12 15:01:22 MSK 2010 WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/etc/WSJ_clean_13dCep_16k_40mel_130Hz_6800Hz.ci.mdef
375 Fri Feb 12 15:01:22 MSK 2010 WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/etc/variables.def
1797 Fri Feb 12 15:01:16 MSK 2010 WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/license.terms
The locations of the dictionary files with the JAR file are specified using the Sphinx-4 resource mechanism. In short, this mechanism looks forall JAR files for specified path to the resource. The general syntax is:
resource:/{location in the JAR file of the desired resource}
Take the 'dictionaryPath' property, for example. The location in the JAR file of the desired resource is
WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/cmudict.0.6d. This gives the string:resource:/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/cmudict.0.6d.
For details about the other properties, please refer to the javadoc for ''FastDictionary''[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/linguist/dictionary/FastDictionary.html].
The rest of the configuration file, which includes the front end configuration and the configuration of the monitors, are the same as in the
HelloWorld demo. Therefore, please refer to those sections for explanations. This concludes the walk-through of the simple HelloNGramexample.
Two ways of configuring Sphinx4
There are two options for configuring Sphinx4. Both methods work in many types of applications, and the choice is really just a question oftaste.
Configuration Management
In ConfigurationManagement the configuration is described by an XML file which is interpreted when the application initializes.ConfigurationManagement is described in detail here Sphinx-4 Configuration Management
[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/util/props/doc-files/ConfigurationManagement.html].
ConfigurationManagement offers the advantage of keeping the configuration and the code separate. With this choice one can alter theconfiguration without touching the application code. Here is an example of the front end configuration described by XML.
audioFileDataSource
dataBlocker
speechClassifier
speechMarker
nonSpeechDataFilter
preemphasizer
windower
fft
-
7/23/2019 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
11/15
4/1/12 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
11/15cmusphinx.sourceforge.net/wiki/tutorialsphinx4
melFilterBank
dct
liveCMN
featureExtraction
Raw Configuration
The other configuration option is to call the constructors directly. This is referred to as raw configuration.
Raw configuration is useful when the configuration is not easily described by a static XML structure. This occurs in applications that requireextremely complex, or dynamic configuration
Raw configuration is also preferred when writing scripts. In this case it is not desirable to separate the configuration and the code.
Here is an example of raw configuration:
protected void initFrontEnd() {
this.dataBlocker = new DataBlocker(
10 // blockSizeMs
)
this.speechClassifier = new SpeechClassifier(
10, // frameLengthMs,
0.003, // adjustment,
10, // threshold,
0 // minSignal
)
this.speechMarker = new SpeechMarker(
200, // startSpeechTime,
500, // endSilenceTime,
100, // speechLeader,
50, // speechLeaderFrames
100 // speechTrailer
)
this.nonSpeechDataFilter = new NonSpeechDataFilter()
this.premphasizer = new Preemphasizer(
0.97 // preemphasisFactor
)
this.windower = new RaisedCosineWindower(
0.46, // double alpha
25.625f, // windowSizeInMs
10.0f // windowShiftInMs
)
this.fft = new DiscreteFourierTransform(
-1, // numberFftPoints
false // invert
)this.melFilterBank = new MelFrequencyFilterBank(
130.0, // minFreq,
6800.0, // maxFreq,
40 // numberFilters
)
-
7/23/2019 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
12/15
4/1/12 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
12/15cmusphinx.sourceforge.net/wiki/tutorialsphinx4
this.dct = new DiscreteCosineTransform(
40, // numberMelFilters,
13 // cepstrumSize
)
this.cmn = new LiveCMN(
12.0, // initialMean,
100, // cmnWindow,
160 // cmnShiftWindow
)
this.featureExtraction = new DeltasFeatureExtractor(
3 // window
)
ArrayList pipeline = new ArrayList()
pipeline.add(audioDataSource)
pipeline.add(dataBlocker)pipeline.add(speechClassifier)
pipeline.add(speechMarker)
pipeline.add(nonSpeechDataFilter)
pipeline.add(premphasizer)
pipeline.add(windower)
pipeline.add(fft)
pipeline.add(melFilterBank)
pipeline.add(dct)
pipeline.add(cmn)
pipeline.add(featureExtraction)
this.frontend = new FrontEnd(pipeline)
}
This example was taken from the RawTranscriber demo.
RawTranscriber.java
TranscriberConfiguration.java
CommonConfiguration.java
Interpreting the Recognition Result
As you can see from the above examples, the Recognizer returns a Result object which provides the recognition results. The Result object
essentially contains all the paths during the recognition search that have reached the final state (or end of sentence, usually denoted by). They are ranked by the ending score of the path, and the one with the highest score is the best hypothesis. Moreover, the Resultalso contains all the active paths (that have not reached the final state) at the end of the recognition.
Usually, one would call the Result.getBestResultNoFiller[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/result/Result.html#getBestResultNoFiller()method to obtain a string of the best result that has no filler words like ++SMACK++. This method first attempts to return the best path
that has reached the final state. If no paths have reached the final state, it returns the best path out of the paths that have not reached thefinal state.
If you only want to return those paths that have reached the final state, you should call the method
Result.getBestFinalResultNoFiller[http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/result/Result.html#getBestFinalResultNoFilFor example, the HelloWorld demo uses this method to avoid treating any partial sentence in the grammar as the result.
There are other methods in the Result object that can give you more information, e.g., the N-best results.
You will also notice that there are a number of methods that return Tokens. Tokens are objects along a search path that record where we areat the search, and the various scores at that particular location.
For example, the Token object has a getWord method that tells you which word the search is in. For details about the Token object pleaserefer to the javadoc for Token [http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/decoder/search/Token.html]. For details about theResult object, please refer to the javadoc for Result [http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/result/Result.html].
Writing Scripts
One of the huge advantages of working in Java is the wealth of scripting options. These options include Groovy, Ruby, Python and Clojure andmany other choices to suit every programming taste and philosophy. All these languages compile to the JVM, and are are trivially able to call
Java code. Hence Sphinx4 can be scripted in any of these popular languages.
While the XML configuration files can be used with scripting languages, it is generally more elegant and readable to call Java constructorsdirectly. Compare the following Front End set up from the Groovy, Python and Clojure examples to the XML and Raw configurations describedabove.
Groovy Example
GroovyTranscriber.groovy
// init audio data
def audioSource = new AudioFileDataSource(3200, null)
def audioURL = (args.length > 1) ?
new 5.0%2Fdocs%2Fapi%2F">File(args[0]).toURI().toURL() :
new 5.0%2Fdocs%2Fapi%2F">URL("file:" + root + "/src/apps/edu/cmu/sphinx/demo/transcriber/10001-90210-01803.wav")
audioSource.setAudioFile(audioURL, null)
-
7/23/2019 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
13/15
4/1/12 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
13/15cmusphinx.sourceforge.net/wiki/tutorialsphinx4
// init front end
def dataBlocker = new DataBlocker(
10 // blockSizeMs
)
def speechClassifier = new SpeechClassifier(
10, // frameLengthMs,
0.003, // adjustment,
10, // threshold,
0 // minSignal
)
def speechMarker = new SpeechMarker(
200, // startSpeechTime,
500, // endSilenceTime,100, // speechLeader,
50, // speechLeaderFrames
100 // speechTrailer
)
def nonSpeechDataFilter = new NonSpeechDataFilter()
def premphasizer = new Preemphasizer(
0.97 // preemphasisFactor
)
def windower = new RaisedCosineWindower(
0.46, // double alpha
25.625f, // windowSizeInMs
10.0f // windowShiftInMs
)
def fft = new DiscreteFourierTransform(
-1, // numberFftPoints
false // invert
)
def melFilterBank = new MelFrequencyFilterBank(130.0, // minFreq,
6800.0, // maxFreq,
40 // numberFilters
)
def dct = new DiscreteCosineTransform(
40, // numberMelFilters,
13 // cepstrumSize
)
def cmn = new LiveCMN(
12.0, // initialMean,
100, // cmnWindow,
160 // cmnShiftWindow
)
def featureExtraction = new DeltasFeatureExtractor(
3 // window
)
def pipeline = [
audioSource,
dataBlocker,
speechClassifier,
speechMarker,
nonSpeechDataFilter,
premphasizer,
windower,
fft,
melFilterBank,
dct,
cmn,
featureExtraction
]
def frontend = new FrontEnd(pipeline)
Python Example
PythonTranscriber.py
# init audio data
audioSource = AudioFileDataSource(3200, None)
audioURL = URL("file:" + root + "/src/apps/edu/cmu/sphinx/demo/transcriber/10001-90210-01803.wav")
audioSource.setAudioFile(audioURL, None)
# init front end
dataBlocker = DataBlocker(
10 # blockSizeMs
)
speechClassifier = SpeechClassifier(
10, # frameLengthMs,
0.003, # adjustment,
10, # threshold,
0 # minSignal
)
speechMarker = SpeechMarker(200, # startSpeechTime,
500, # endSilenceTime,
100, # speechLeader,
50, # speechLeaderFrames
100 # speechTrailer
-
7/23/2019 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
14/15
4/1/12 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
14/15cmusphinx.sourceforge.net/wiki/tutorialsphinx4
)
nonSpeechDataFilter = NonSpeechDataFilter()
premphasizer = Preemphasizer(
0.97 # preemphasisFactor
)
windower = RaisedCosineWindower(
0.46, # double alpha
25.625, # windowSizeInMs
10.0 # windowShiftInMs
)
fft = DiscreteFourierTransform(
-1, # numberFftPoints
false # invert
)melFilterBank = MelFrequencyFilterBank(
130.0, # minFreq,
6800.0, # maxFreq,
40 # numberFilters
)
dct = DiscreteCosineTransform(
40, # numberMelFilters,
13 # cepstrumSize
)
cmn = LiveCMN(
12.0, # initialMean,
100, # cmnWindow,
160 # cmnShiftWindow
)
featureExtraction = DeltasFeatureExtractor(
3 # window
)
pipeline = [
audioSource,dataBlocker,
speechClassifier,
speechMarker,
nonSpeechDataFilter,
premphasizer,
windower,
fft,
melFilterBank,
dct,
cmn,
featureExtraction
]
frontend = FrontEnd(pipeline)
Clojure Example
ClojureTranscriber.clj
init audio data
(def audioSource (new AudioFileDataSource 3200 nil))
(def audioURL (new URL (str "file:" root "/src/apps/edu/cmu/sphinx/demo/transcriber/10001-90210-01803.wav")))
(.setAudioFile audioSource audioURL nil)
init front end
(def dataBlocker (new DataBlocker
10)) blockSizeMs
(def speechClassifier (new SpeechClassifier
10 frameLengthMs
0.003 adjustment
10 threshold
0)) minSignal
(def speechMarker (new SpeechMarker
200 startSpeechTime500 endSilenceTime
100 speechLeader
50 speechLeaderFrames
100)) speechTrailer
(def nonSpeechDataFilter (new NonSpeechDataFilter))
(def premphasizer (new Preemphasizer
0.97)) preemphasisFactor
(def windower (new RaisedCosineWindower
0.46 double alpha
25.625 windowSizeInMs
10.0)) windowShiftInMs
(def fft (new DiscreteFourierTransform
-1 numberFftPoints
false)) invert
(def melFilterBank (new MelFrequencyFilterBank130.0 minFreq
6800.0 maxFreq
40)) numberFilters
(def dct (new DiscreteCosineTransform
-
7/23/2019 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
15/15
4/1/12 Sphinx-4 Application Programmer's Guide - CMUSphinx Wiki
40 numberMelFilters
13)) cepstrumSize
(def cmn (new LiveCMN
12.0 initialMean
100 cmnWindow
160)) cmnShiftWindow
(def featureExtraction (new DeltasFeatureExtractor
3)) window
(def pipeline [
audioSource
dataBlocker
speechClassifier
speechMarkernonSpeechDataFilter
premphasizer
windower
fft
melFilterBank
dct
cmn
featureExtraction])
(def frontend (new FrontEnd pipeline))
Additional Information
Non-wiki Sphinx-4 Documentation home page [http://cmusphinx.sourceforge.net/sphinx4/index.html]
Sphinx4 Configuration management [http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/util/props/doc-
files/ConfigurationManagement.html]Sphinx-4 Frequently Asked Questions [http://cmusphinx.sourceforge.net/sphinx4/doc/Sphinx4-faq.html]
Sphinx-4 Transcriber Demo [http://cmusphinx.sourceforge.net/sphinx4/src/apps/edu/cmu/sphinx/demo/transcriber/README.html]
tutorialsphinx4.txt Last modified: 2011/09/30 07:37 by admin
Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0Unported [http://creativecommons.org/licenses/by-nc-sa/3.0/]