Download - Neural Network Classifiers for Speech Recognition - ll.mit.edu · PDF fileNeural Network Classifiers for Speech Recognition Neuralnetsofferanapproachtocomputationthatmimicsbiologicalnervoussystems.

RICHARD P. LIPPMANN

Neural Network Classifiers forSpeech Recognition

Neural nets offer an approach to computation thatmimics biological nervous systems.Algorithms based on neural nets have been proposed to address speech recognitiontasks which humans perlorm with little apparent effort. In this paper, neural netclassifiers are described and compared with conventional classification algorithms.Perceptron classifiers trained with a new algorithm, called back propagation, weretested and found to perform roughly as well as conventional classifiers on digit andvowel classification tasks. A new net architecture, called a Viterbi net, which recognizes time-varying input patterns, provided an accuracy ofbetter than 99% on a largespeech data base. Perceptrons and another neural net, the feature map, were implemented in a very large-scale integration (VLSI) device.

Neural nets are highly interconnected networks ofrelativelysimple processingelements,or nodes, that operate in parallel. They aredesigned to mimic the function ofneurobiological networks. Recentworkon neural networksraises the possibility of new approaches to thespeech recognition problem. Neural nets offertwo potential advantages over existing approaches. First, their use of many processorsoperating in parallel may provide the computational power required for continuous-speechrecognition. Second, new neural net algorithms, which could self-organize and build aninternal speech model that maximizes performance, would perform even better than existing algorithms. These new algorithms couldmimic the type of learning used by a child whois mastering new words and phrases.

The best existing non-neural speech recognizers perform well only in highly constrainedtasks. For example, to achieve 99% accuraterecognition on 105 words, a typical word recognizer must restrict the input speech to talker-dependent isolated words. That is. the recognizer must Qe trained with the speaker'svoice prior to the speech recognition task andindividual words must be spoken with pausesinterjected between words. Accuracy can be ashigh as 95% for 20,000 words - when sentencecontext can be used as·an aid in word recognition and when a pause is interjected betweenwords. Even at 95% accuracy for individual

The Lincoln Laboratory Journal, Volume 1, Number 1 (1988)

words, sentence accuracyonlyamounts to 50%.Performance for recognizers that don't re

strictspeech to isolated-word talker-dependentspeech is much worse. The best current algorithms for speech recognition use hidden Markov models (HMM) and therefore require computation rates of>100 million instructions persecond (MIPS) for large vocabulary tasks. Butcomputation rates of this magnitude are notavailable on conventional single-processorcomputers.

Building internal speech models and selforganization requires the development ofalgorithms that use highly parallel neuron-likearchitectures. Our approach to this problem isto design neural net architectures to implement existing algorithms and, then, to extendthe algorithms and add advanced learning andmodel-building capabilities.

COMPONENTS OF AN ISOLATED·WORD SPEECH RECOGNIZER

An isolated-word speech recognizer mustperform four major tasks. The structure of thespeechrecognizershown in Fig. 1 reflects thesetasks. The recognizer first includes a preprocessing section that extracts important information from the speech waveform. Typically,the preprocessor breaks the input waveforminto 10-ms frames and outputs spectral pat-

107

LippmaDD - Neural Network Classifiersfor Speech Recognition

1Word

Scores

1SpectralPatterns

(100 Per Second)

Word Models(Patterns and Sequences)

/ ~Speech Pattern SelectedInput Acoustic Pattern

Matching Time WordSignal ~ ~ ~ Sequence

Processor and AlignmentClassification

Classification

Fig. 1 - An isolated-word speech recognition system consists of preprocessor, pattern matching section, timealignment section, and pattern sequence classifier.

terns that represent the input waveform. Thespectral patterns are then compared withstored reference spectral patterns. This comparison yields the local distance between thestatic pattern from the preprocessor and thestored patterns, which is a measure of thematch or similarity between these patterns.

The sequence of spectral patterns from thepreprocessor mustbe time aligned to the nodesin word models. Time alignment compensatesfor variations in talking rate and pronunciation. Word models contain the sequence ofpatterns expected for each word and use internal nodes to time align input pattern sequences to expected pattern sequences. Followingthe temporal alignment and matching functions, a decision logic section selects the wordmodel with the best matching score as the recognizer output.

Neural nets have been applied to the fourtasks required of a speech recognizer. Two ofthese applications - pattern matching orclassification, and time alignment - are the focusof this paper.

NEURAL NETS AS CLASSIFIERS

A taxonomy of eight neural nets that can beused to classify static patterns is presented in

Fig. 2. These classifiers are grouped on thebasis of whether they accept continuous orbinary inputs, and on the basis ofwhether theyemploy supervised or unsupervised training.

Nets trained with supervision, such as perceptrons [1), are used as classifiers (see box,"Supervised Training"). These nets require labeled training data. During training, the number of classes desired and the class identification of each training sample are known. Thisinformation is used to determine the desirednet output and to compute an error signal. Theerror signal shows the discrepancy betweenthe actual and the desired outputs; it is usedto determine how much the weights shouldchange to improve the performance for subsequent inputs.

Nets trained without supervision [see box,"Unsupervised Training (Self-Organization)"),such as Kohonen's feature map-forming nets(2), can be used as vector quantizers. No information concerning correct classes is given tothese nets. Theyself-organize training samplesinto classes or clusters without interventionfrom a "teacher." Self-organized, or unsupervised, training offers the advantage of beingable to use any large body of available unlabeled data for training.

Classical algorithms that most closely resemble equivalent neural net algorithms are listed

108 The Lincoln Laboratory Journal, Volume I, Number 1 (1988)

Lippmann - Neural Network Classifiersfor Speech Recognition

Supervised Training

Decision regions after 50, 100, 150 and 200 trialsgenerated by a two-layer perceptron classifier trainedwith back propagation training algorithm.

Supervised trainingrequires labeled training data and a "teacher" that presents examples andprovides the desired output. Intypical operation, examples arepresented and classified by thenetwork; then error informationis fed backandweights are adapted, until weights converge to final values.

Multilayer perceptrons usesupervised training and overcome many of the limitations ofsingle-layer perceptrons. Butthey were generally not used inthe past because effective training algorithms were not available. The development of a newtraining algorithm, called backpropagation, has permitted multilayer perceptrons to come intoincreasing use (3). The illustration shows decision regionsformed, using back propagation,after 50,100,150 and 200 trainingexamples were presented to atwo-layer perceptron with eighthidden nodes and two outputs[4J. The shaded areas in the plotsare the actual decision regionsfor Class Aand the coloredcircular regions are the desired minimumerrordecision regions. Thecirculardecision region for ClassA formed after 200 iterations.Further iterations form a moreperfect circle.

Back propagation training begins with setting node weightsand offsets to small random values. Then a trainingexample anddesired outputs are presented to

the net. The actual outputs ofthenet are calculated, error signalsare formed, and an iterative algorithm is used to adapt the nodeweights - starting at the outputnodes and working back to thefirst hidden layer. An exampleand desired outputs are againpresented to the net and the entire process is repeated untilweights converge.

If the decision regions or desired mappings are complex,

t = 150

back propagation can requiremany repeated cyclical presentations of the entire training datato converge. This difficulty increases training time, but itdoesnot affect response time of thenetwork duringuse. Itdoes, however, set a practical limit to thesize of nets that can be trained.For single-layernets or for multilayer nets with restricted connectivity, this restriction is not asevere problem.

t = 100

t = 200

at the bottom ofFig. 2. The Hopfield, Hamming,and adaptive resonance theory classifiers arebinary-input nets: they are most appropriatewhen exact binary representations of inputdata, such as ASCn text or pixel values, areavailable. (These nets can also be used to solveminimization problems, in data compressionapplications, and as associative memories.)The single and multilayer perceptrons, feature

The Lincoln Laboratory Journal, Volume I, Number 1 (1988)

map classifiers and reduced Coulomb energy(RCE) nets are classifiers that can be usedwithcontinuous-value inputs. The perceptron andRCE classifiers require supervised training; thefeature map classifier does not (see box, "Supervised Training").

Blockdiagrams ofconventional and ofneuralnet classifiers are presented in Fig. 3. Bothtypes of classifiers detennine which one of M

109


classes best represents an unknown static input pattern containing N input elements.

The conventional classifier contains twostages. The first stage computes matchingscores, which indicate how closely an inputpattern matches the exemplar pattern of eachclass. (The exemplar is the pattern that is mostrepresentative of a class.) The second stage ofthe classifier selects the class with the bestmatching score and thus finds the class that isclosest to the input pattern.

In contrast with the preceding method ofclassification, the neural network classifier(Fig. 3b) accepts the N input values through Nparallel input connections. The first stage ofthe classifier computes matching scores, inparallel. It then transmits the scores, again inparallel, to the next stage over M analog lines.Here the maximum of these values is selectedand enhanced. At the end of the classificationprocess, only one of the M-class outputs ishigh. Information about classification errorscan be fed back to the first stage of the classifier, and connection weights between nodesadapted, thereby adjusting the algorithm and

making a correct identification in the futuremore likely. Adaptation is typically incremental and requires minimal computation andstorage.

The classifiers displayed in the taxonomy inFig. 2 can identify which internal class bestrepresents an input pattern. This capabilitycan be used, for instance, to identify speechsounds that can vary substantially betweenutterances Similarly, the classifiers can identify sounds that have been corrupted by noiseor by interference from other, competing, talkers.

Neural classifiers can also be used as associative, or content-addressable, memories. Anassociative memory is one that, when presented with a fragment ofa desired pattern, returnsthe entire pattern. Classifiers can also serve inspeech and image recognition applications tovector-quantize analog inputs. Vector quantization compresses data without losing important information, and can be used to reducestorage and computation requirements.

Neural nets differ from conventional classifiers in four important ways. The hardware has a

Neural Net Pattern Classifiers

Continuous-Value Input

Unsupervised

Reduced F M S If 0 ..C I b

eature ap e - rganlzlngou om CI ·f' F MEnergy tas"eatUf ap

'" k-Meansk-Nearest 'N ' hb Clusteringelg or ,

AlgOrithmMixture

MultilayerPerceptron

t

Supervised

GaussianClassifier

Perceptron

tLeaderClusteringAlgorithm

HammingNet

tOptimumClassifier

Binary Input

~Supervised Unsupervised

/\1Adaptive

ResonanceTheory

~

HopfieldNet

Fig. 2 - A taxonomy of eight neural net classifiers; the conventional pattern classification algorithms most similar tocorresponding neural net algorithms are listed at the bottom.

llO The Lincoln Laboratory Journal, Volume 1, Number 1 (1988)

LipplDaDD - Neural Network Classifiersfor Speech Recognition

Unsupervised Training (Self-organization)

~t = 5000

\ \

t = 100

..

---_ I

more responsive to the currentinput. This process is repeated for further inputs until theweights converge and are fIXed.

The self-organized training process is illustrated in Fig. B. Crosspoints of the lines in Fig. B representvalues ofweights betweentwo inputs andeach of100nodesin a feature map net. Lines connect points for physically nearest neighbors in the grid. Thecrosspoints grow with time andcover the rectangular region representing the distnbution fromwhich the input training examples were drawn. Lines don'tcross, indicating that nearbynodes are sensitive to similarinputs and that a topologicalmapping between inputs andnodes has been achieved.

A VLSI chip that implementsthe feature map algorithm hasbeen fabricated (see box, "ImplementingNeural Nets inVLSI").

t=1000

Xc

~ II

t - 25

--:--..:--------.-.~~

.---------

Fig.B - The vector weights to 100 output nodesfrom two input nodes generated by Kohonen's feature map algorithm [2]. The horizontal axis represents the value of the weight from input Xo and thevertical axis represents the value from input x l' Lineintersections specify two weights for each node.Lines connect weights for nearest-neighbor nodes.An orderly grid indicates that topologically closenodes are responsive to similar inputs. Inputs wererandom, independent, and uniformly distributedover the area shown.

t - 500

•

presented sequentially in time,without specifying the desiredoutput. After input vectors havebeen presented to the net,weights will specify cluster orvector centers that sample theinput space so that the pointdensity function of the vectorcenters tends to approximate theprobability density function ofthe inputvectors. This algorithmrequires a neighborhood to bedefined around each node thatdecreases in size with time.

Net training begins with setting random values for theweights from the N inputs to theM output nodes. Then a new input is presented to the net andthe distance between the inputand all nodes is computed. Theoutput node with the least distance is then selected and theweights to that node and allnodes in its neighborhoodare updated, which makes these nodes

~-....,.--OutputNodes

Unsupervised training, sometimes called self-organization,uses unlabeled training data; itrequires no external teacher tofeed errorsignals backto the networkduringtraining. Dataare entered into the network, whichthen forms internal categories orclusters. The clusters compressthe amount of input data thatmust be processed without losing important information. Unsupervised clustering can beused in such applications asspeech and image recognition toperformdatacompression and toreduce the amount of computation required for classification,as well as the space required forstorage.

Self-organized training is wellrepresented by self-organizingfeature maps. Kohonen's self-organizing feature maps [2) use analgorithm that creates a vectorquantizer by adjusting weightsfrom common input nodes to Moutput nodes, arranged in a twodimensional grid (Fig. A). Outputnodes are extensively interconnected with many local connections. During training, continuous-value input vectors are

Input

Fig.A - Using variable connection weights, thisfeature map connects every input node to everyoutput node.

The Lincoln Laboratory Journal, Volume I, Number 1 (l988) 111


InputSymbols

ComputeMatching

Score

Store~ Values

.----II~ 8 and~ Select

Maximum

OutputSymbol

forMost Likely

Class

a) Traditional Classifier

YoY1

M-1

Highy

Adapt Class IsWeights -- ."-- .,.."

Given Both Outputs -~-----and Correct Class

Values

Computez1 Select

• • and • \• Matching • •• Score • Enhance •J• • Maximum •• • •-1 Y

tzM-1 /Input Intermediate Output for

Analog Scores Onl One

b) Neural Net Classifier

Fig. 3 - a) The inputs and outputs of conventional classifiers are passed serially and processing is serial. b) Inputsand outputs of neural net classifier are in parallel and processing is in parallel.

parallel architecture; inputs and outputs areanalog and in parallel; computations are performed in parallel; and connection weights areadapted incrementally over time, which improves performance yet uses simple calculations and limited storage.

CLASSIFICATION WITH MULTILAYERPERCEPTRONS

A single-layer perceptron (Fig. 4) is the simplest possible feed-forward neural net classifier. It classifies an input pattern applied at theinput into one of two classes, denoted A and Bin Fig. 4. This net forms half-plane decisionregions by dividing the space that the inputspans, using a hyperplane decision boundary.Decision regions are used to determine whichclass label should be attached to an unknown

input pattern. If the values of the analog inputsfor a pattern fall within a certain decision region, then the input is considered to be a member of the class represented by that decisionregion.

An example of the evolution of a single-layerperceptron's decision boundaries during 80trials of training is shown on the right side ofFig. 4. This example uses a training algorithm,called the perceptron convergence procedure,which modifies weights onlyon trials where anerror occurs.

Multilayer perceptrons (3) are nets that include layers of hidden nodes not directly connected to both inputs and outputs. The capabilities of perceptrons with one, two and threelayers and step nonlinearities are summarizedin Fig. 5. Single-layer perceptrons cannot sepa-


Lippmann - Neural Network Classifiers for Speech Recognition

rate inputs from classes A and B in the exclusive-or problem shown in the third column ofFig. 5. This problem includes two classes withdisjoint decision regions that cannot be separated by a single straight line. Two-layer perceptrons typically form open- or closed-convexdecision regions and thus can separate inputsfor the exclusive-or problem, but they typicallycannot separate inputs when classes aremeshed.

Three-layer perceptrons with enough nodesin the first layer of hidden nodes can not onlyseparate inputs when classes are meshed, theycan form decision regions as complex as required by any classification algorithm [4].Three-layer perceptrons with sigmoidal nonlinearities can also form arbitrary continuousnonlinear mappings between continuous-valued analog inputs and outputs [4].

SPOKEN DIGIT CLASSIFICATION

A new training algorithm, called backpropagation, was recently developed for multilayer perceptrons [3]. Multilayer perceptronclassifiers trained with back propagation were

comparedwith conventional classifiers using adigit-classification task involving speech froma variety of speakers. This comparison was designed to gain experience with the performanceof multilayer perceptrons trained with backpropagation on a difficult classification problem - not to develop a digit recognizer.

Classifiers were evaluated using the first seven single-syllable digits ("one," "two," "three,""four," "five," "six" and "eight") of the TexasInstruments (TI) Isolated Word DataBase [5].Aversion ofthis database that had been sampledat 12 kHz was processed to extract 11 mel cepstra from every 10-ms frame. Mel cepstral parameters are derived from a spectral analysis ofa short segment of the input speech waveform;they have been found to be useful for speechrecognition [6]. Each classifier was presentedwith 22 input coefficients for everytest or training utterance word token. Eleven of these coefficients were obtained from the highest-energyframe for the word and 11 from the frame thatpreceded the highest-energy frame by 30 ms(Fig. 6). The additional coefficients from theearlier frame provided information about spec-

Xo 100

•• yOutput

Input. t =2•1•

xN-1 o 0

A!I..... 0N-l x

Y =fh ( ~ WjXj - () ) 0

J=O

-50_ { +1 l) Class A

y - -1 0 Class B -50 0 50 100

Xo----'

Fig. 4 - A single-layerperceptron and the decision regions it forms for two classes, A and B, using as many as 80 trials.

The Lincoln Laboratory Journal, Volume I, Number 1 (1988) 113


Implementing Neural Nets in VLSI

One of the most important applications of new-al network architectures is in the realizationof compact real-time hardwarefor speech, vision, and roboticsapplications. The chips shownhere are initial implementationsof two new-al network architectures in VLSI. Both chips used astandard MOSIS (MOS implementation system) process that combines digital and analog circuitryon one chip. Both the perceptronchip, shown on the left [8J, and

the feature map chip, shown onthe right (18), storeweights (connection coefficients), in digitalform. The feature map chipadapts the weights internally; incontrast, the weights are adaptedexternallyand downloadedto theperceptron chip. Analog circuitson each chip perform the multiplications and additions required by the respective algorithms.

The perceptron chipwastestedin a speech recognition applica-

tion and achieved a level of performance comparable to the performance of a digital computersimulation [8J. The multilayerperceptron has 32 inputs, 16nodes, and 512 weights. The feature map chip has 7 inputs, 16nodes, and 112 weights. As indicated bythe relative simplicityofthe designs, these are first-generation chips; VLSI implementations of much greater density,with corresponding increases incapability, are being explored.

PerceptronClassifier

tral changes overtime and improved the classifiers' performance.

Gaussian and k-nearest neighbor classifiers[7], which are often used for speech recognition, were compared with multilayer perceptron classifierswith different numbers ofnodesand layers. All classifiers were trained andtested separately for each of the 16 talkers inthe TI data base. Classifiers were trained using10 training examples per digit (70 total) andthen tested on a separate 16 tokens per digit(112 total). The Gaussian classifier used a diagonal covariance matrix with pooled varianceestimates. This technique producedbetter performance than a similar classifier with variances estimated separately for each digit.

The Gaussian classifier can be implementedwith a single-layer perceptron [4). In fact, when

114

Feature MapVector Quantizer

it was implemented in VLSI hardware and tested, it was found to provide the same performance on this digit classification problem as adigital computer-based Gaussian classifier [8)(see box, "Implementing Neural Nets in VLSI").A k-nearest neighbor classifier was also usedin this comparison because it performs wellwith non-unimodal distributions and becomesoptimal as training data are increased [7). Thenumber of nearest neighbors (k) for the classifier was set to "one" because it provided thebest performance.

All multilayer perceptron classifiers had 22inputs, seven outputs, and used logistic sigmoidal nonlinearities along with the backpropagation training algorithm (see box, "Supervised Training"). The class selected duringtesting was the one corresponding to the out-

The Lincoln Laboratory Journal, Volume I, Number 1 (l988)

Lippmann - Neural Network ClassifiersJor Speech Recognition

put node with the largest output value. Duringtraining, the desired output was greater than0.9 for the output node corresponding to thecorrect class and less than 0.1 for the otheroutput nodes. A single-layer perceptron wascompared with five two-layer perceptrons thatpossessed 16 to 256 hidden nodes, and to fourthree-layer perceptrons that possessed 32 to256 nodes in the first hidden layer and 16 in thesecond hidden layer.

Results averaged over all talkers are shownin Figs. 7 and 8. Figure 7 shows the percentageerror for all classifiers. The k-nearest neighborclassifier provided the lowest error rate (6.0%)and the single-layer perceptron the highest(14.4%). The Gaussian classifier was intermediate (8.7%) and slightly worse than the besttwo-layer (7.6%) and three-layer (7.7%) perceptrons.

The number of training examples that theperceptrons required for convergence is ofspecial significance. These results are illustrated in Fig. 8. Perceptron networks weretrained by repeatedly presenting them with the70 training examples from each talker. Theconvergence time for the multilayer perceptrons was defined as the time required for theerror rate to reach zero on the training data. Forthe single-layer perceptrons, this occurred infrequently, so the convergence time was defined as the time required for the error rate tofall and remain below 5%. The average numberof examples presented for the single-layer perceptron (29,000) excludes data from two talkerswhen this criterion wasn't met even after100,000 presentations.

There is a marked difference in convergencetime during training between the single-layer

Structure

One Layer

J\Two Layers

Three Layers

Types ofDecision Regions

Half-Plane

TypicallyConvex

Arbitrary

Exclusive-orProblem

®

Classes withMeshed Regions

Region Shapes

Fig. 5 - Decision regions can be formed by single-layer and multilayer perceptrons with one and two layers of hiddennodes and two inputs. Shading denotes decision regions for Class A. Smooth closed contours bound input distributionsfor Classes A and B. Nodes in all nets use hard limiting step nonlinearities.

The Lincoln Laboratory Journal, Volume 1, Number 1 (1988) 115


"One" "Three" "Five" "Eight"

"Two" "Four" "Six"

----Input}

~""",-.J Hidden Nodes

y

11 CepstralCoefficients

• •• •'----------..'Y---------'

11 CepstralCoefficients

>OJI-Ol·e::

UJ

Frame Number •

Fig. 6 - Cepstral data from the maximal-energy frame, and from the frame that was three frames before themaximal-energy frame in each word, were presented to multilayer perceptrons. Their recognition performance wascompared with the performance of conventional classifiers for spoken digits.

and multilayer perceptrons. The two largestthree-layer perceptrons (128 and 256 nodes inthe first hidden layer) converge in the shortesttime (557 and 722 examples, respectively), andthe single-layer perceptron takes the longesttime (> 29,000 examples).

These results demonstrate the potential ofmultilayerperceptron classifiers. Thebest multilayer perceptron classifiers performed betterthan a reference Gaussian classifier and nearlyas well as a k-nearest neighbor classifier, yetthey converged during trainingwith fewer than1,000 training examples. These results alsodemonstrate the necessity of matching thestructure of a multilayer perceptron classifierto the problem. For example, the two-layer perceptron with the most hidden nodes took muchlonger to train than other two- and three-layerperceptron classifiers but the performance wasno better. The poor performance of the Gaussian classifier (8.7%) relative to the k-nearest

neighbor classifier (6.0%) suggests that complex decision regions are required for thisproblem. These complex decision regionscouldn'tbe provided bythe single-layer perceptron. As a result, the single-layer perceptronclassifier provided the worst error rate (14.4%),required thousands of presentations for convergence, and never converged for two of thetalkers.

Multilayer perceptrons produce complex decision regions via intersections of half-planeregions formed in each layer (Fig. 5). The threelayer perceptrons converged fastest in the digit-classification experiment, presumably because there were more half-plane decisionboundaries close to required decision regionboundaries when training began. Since theseboundaries were closer to the desired decisionregions, they only had to be shifted slightly toproduce the required regions. Other boundaries, more distant from required decision re-



gions, moved only slightly because of a derivative term in the back propagation training algorithm. These results suggest that three-layerperceptrons produce thebestperformance evenwhen only simple convex decision regions arerequired. Although two-layer perceptrons canform such regions, convergence times withthese nets may increase greatly if too manyhidden nodes are provided.

the perceptron trained with back propagationrequired more than 50,000 labeled examples.The first stage ofthe feature map classifier andthe multilayer perceptron were trained by presenting them repeatedly with randomly selected examples from the available 338 trainingexamples. Although this set of 338 examplesprovided labels for training, the labels were

OL-..._.....L.__"--_.....L.__"--_--1..---J

Hidden Nodes in First Layer

25664

I I

32 128

in First Layer

Three Layers

~256

Two Layers

6416

One Layer(> 29.000)

o I I i I I I

KNN Gauss 0 32 128

Classifier Hidden Nodes

8000,..--1....---------------,

6000

2000

Fig. 7 - Percentage error in the spoken digit classification experiment for a Gaussian classifier, a k-nearestneighbor classifier, a single-layer perceptron, a twolayer perceptron with 16 to 256 hidden nodes, and athree-layer perceptron with 32 to 256 nodes in the firsthidden layer and 16 nodes in the second hidden layer.KNN =k-nearest neighbor.

Fig. 8 - Number of iterations before convergence in thespoken digit classification experiment, for a single-layerperceptron and for multilayer perceptrons as a functionof the number of nodes in the first layer.

20Ref. l Perceptrons

16 l-I- •0I- One LayerI-

l.LJ 12 .........

~cQ) • ~u 8l- I- GaussQ)a.. • Two Layers Three Layers

4 I- KNN

U)

co.~ 4000I-Q).....

Unsupervised training with unlabeled training data can often substantially reduce theamountofsupervisedtrainingrequired (9). Abundant unlabeled training data are frequentlyavailable for many speech and image classification problems, but little labeled data are available. The feature map classifier in Fig. 9 usescombined supervised/unsupervised training tocircumvent this dearth of labeled training data(10). This feature map classifier is similar tohistogram classifiers used in discrete observation HMM speech recognizers (11). The firstlayer of the classifier forms a feature map byusing a self-organizing clusteringalgorithm (2).The second layer is trained using back propagation or maximum likelihood training (10).

Figure 10 compares two conventional (k-nearest neighbor and Gaussian) and two neural net(two-layer perceptron and feature map) classifiers on vowel formant data (12). These datawere obtained by spectrographic analysis ofvowels in words formed by "h," followed by avowel, followed bya "d." The words were spokenby67 persons, including men, women, and children. First and second formant data of 10 vowels were split into two sets, resulting in 338training examples and 333 testing examples.These formants are the two lowest resonantfrequencies of a talker's vocal tract.

Lines in Fig. 10 represent decision regionboundaries formed bythe two-layer perceptron.These boundaries are very near those typicallydrawn by trained phonologists. All classifiershad similar error rates, but there were dramaticdifferences in the number of supervised training examples required. The feature map classifier with only 100 nodes required less than50 labeled examples for convergence, while

VOWEL CLASSIFICATION



Output (Only One High)

Select [Classwith Majorityin Top k

CalculateCorrelationto StoredExemplars

Input

SupervisedAssociativeLearning

UnsupervisedKohonenFeature MapLearning

Fig. 9 - This feature map classifier combines supervised and unsupervised training, which reduces the requiredamount of supervised training.

stripped before training, to test the performance of the feature map classifier trainedwithout supervision. These results demonstrate how a neural net can provide the rapid,singIe-triallearning characteristic of childrenwho are learning new words. The RCE classifier(13) listed in Fig. 2 also can, with supervisedtraining data, provide this type of rapid learning.

TEMPORAL ALIGNMENT

The digit and vowel classification tasks considered above used static input patterns.Speech, however, consists oftemporal pattern

118

sequences that vary across utterances, evenfor the same talker, because of variations intalking rate and pronunciation. A temporalalignment algorithm is required to cope withthis variability. Such an algorithm requires astored word model for each vocabulary word.Nodes in a word model represent expectedsequences ofinputpatterns for thatword. Timealignment typically associates each input pattern at a given time with one node in each wordmodel. Nodes in a word model compute thedistance between the unknown input patternsequence and the expected pattern sequencefor a word.

Figure 11 shows an example of the time

The Lincoln Laboratory Journal. Volume I, Number 1 (1988J

Lippmann - Neural Network Classifiers jor Speech Recognition

14001000500

First Formant (Hz)

500

o

4000 ,....----------,.--r--------y--------,

k-Nearest Neighbor

Gaussian

Two-Layer Perceptron

Feature Map Classifier

Error (%)

18.020.419.822.8

Time UntilConvergence

338338

50,000<50

Fig. 10- The four classifiers had similar error rates for the vowel classification task, but the required training varieddramatically. The feature map classifier needed less than 50 training examples; the two-layer perceptron required50,000.

The Lincoln Laboratory Journal. Volume I, Number 1 (1988) 119


Viterbi Nets

Fig.A - One Viterbi net is required for each word. The net's accuracy isgreater than 99%, equivalent to the most robust HMM system.

A Viterbi net is a neural network architecture that uses analog parallel processing to implement the temporal alignmentand matching score computationperformed in conventional HMMrecognizers. Ablockdiagram ofasimple Viterbi net for one word isshown in Fig. A. Nodes represented by large open trianglescorrespond to nodes in a left-toright HMM word model. Connections to these nodes are adjustedto provide maximal output whenthere is a match between the input spectral pattern sequenceand the sequence expected forthe word that the net is designedto recognize. A new input pattern is applied at the bottom ofthe net every 10 ms; the outputrepresents the matching scorebetween the input sequence andthe sequence expected for theword used to adjust connectionweights. The output is greatestwhen the input sequence matches the expected sequence.

An example of the outputs offour Viterbi nets - designed tomatch the words "go," "no," "hello," and "thirty" for one talker inthe Lincoln Stress-Speech DataBase - is presented in Fig. B. Atest token for" go" was input toeach of these nets. The curvesplottedare theoutputofthe nextto-last classifier node of each of

the nets, not the last node. Thelast node, oranchor node, matches background noise.

Time in the plot is relative tothe beginning of the fIle containing the word "go." The fIle contains a background noise interval, followed by "go," followed byan additional background noiseinterval. The word "go" begins atroughly 160 ms and ends at approximately 580 ms.

Input

The output from theViterbi netdesigned to recognize the word"go" is, as expected, the highest.The "no" and "hello" nets providethe next highest outputs, because "hello" and "no" are acousticallysimilarto"go."TheoutputoftheViterbi net designed for theword "thirty" is much lower thanthat ofany of the others, demonstrating the effectiveness of thenets' discrimination capability.

Input Word = Go

0.8

0.6

Fig. B - Four Viterbi nets, each designed for theword labeled on the output chart, were presentedwith the input word "go. "Those designed for wordssimilar to "go" responded with strongly positiveoutputs; the net designed for the word "thirty"responded with a weak output.

0.4

0.2

o .'200 300 400 500

Time (ms)

600

.... , ..700 800

120 The Lincoln Laboratory Journal, Volume 1, Number 1 (1988)


NI

Input .:.t!

Spectral >uPattern c

Q)

Sequence ~

c-Q)...

LL.. L,

PatternNumber

ReferencePattern

Sequencefor

"Histogram"

7654321o

HISTOGRAM

Fig. 11 - Time alignment is the process of matching an input pattern sequence to the stored expected patternsequence ofa vocabulary word. A word model is used to time-align patterns derived from frames in an unknown inputwith stored expected patterns.

alignment process for the word "histogram."The upper part of this figure shows a speechpressure waveform and a speech spectrogram.The spectrogram demonstrates the amount ofspectral energy at different frequency regionsplotted over time. Each region in the speechwaveform has been associated or aligned withone node in the stored word model of the word"histogram," shown at the bottom of Fig. ll.Temporal alignment algorithms perform thealignment and compute an overall word-matching score for each word contained in the vocabulary. The overall score is computed by summing the local matching scores computed byindividual nodes in stored word models. Thisscore, in tum, determines the word that therecognizer identifies.

Classical approaches to time alignment include the use of a Viterbi decoder in conjunction with HMM recognizers, dynamic programming for time warping recognizers, and thematched or chirp mters that are used in radarapplications. Anumber ofneural net approaches have been suggested for use in time alignment, but few have been tested with large


speech data bases.A new net that has been tested on such a data

base is described in the box, ''viterbi Nets." AViterbi net uses the Viterbi algorithm to timealign input patterns and stored-word patternsfor recognizers with continuous-value inputparameters [11). The process oftime alignmentis illustrated in Fig. 12. The solid line in thisfigure indicates how input patterns are alignedto stored reference patterns; the large dots represent the possible points of alignment. TheViterbi net differs from other neural net approaches because it implements a slightly modified version of a proven algorithm that provides good recognition performance. In addition, the Viterbi net's weights can be computedby using the forward-backward training algorithm [11]. It has a massively parallel architecture that could be used to implement manycurrent HMM word recognizers in hardware.

The recursion performed in the Viterbi net issimilar to the recursion required by the Viterbialgorithm when the node outputs before thefirst input [Yi (Ol, 0 ~ i ~ M - 1) are allowed todecayto zero. The recursion is different though,

121

LippllUlDD - Neural Network Classifiers for Speech Recognition

because node outputs can never become negative and because initial conditions do not forcethe first input frame to be aligned with the firstclassifier node. But a study performed with theLincoln Stress-Speech Data Base of 35 wordswith difficult, acoustically similar, subsetsshowed that these differences caused no degradation in performance [14,15]. Weights inViterbi nets with 15 classifier nodes were adjusted according to means, variances and transition probabilities obtained from the forwardbackward algorithm with five training examples per word for each of the 35 words in thedatabase. The first and last nodes in these netsserved as anchors and were designed to matchbackground noise input. Talker-dependent experiments were performed for each of the ninetalkers in the data base, using 13 normallyspoken tokens per word for a total of4,095 testtokens over all talkers.

An HMM recognizer [15] was comparedwith amodified recognizer that implemented theabove recursion. Inputs to both recognizersconsisted of 12 mel cepstra and 13 differential

mel cepstra thatwere updated every 10 ms. TheHMM word models were trained using multistyle training and grand variance estimates[15]. Performance was almost identical withboth algorithms. The error rate with a normalViterbi decoder was 0.54% (22 per 4,095 tokens) and the error rate with the Viterbi netalgorithm was 0.56% (23 per 4,095 tokens).

Ablock diagram ofthe Viterbi net is shown inthe box, ''viterbi Nets," Fig. A Classifier nodesin this net, represented bylarge open triangles,correspondto nodes in aleft-to-right HMM wordmodel. Each classifier node contains a threshold logic node followed bya fixed delay. Threshold logic nodes set the output to zero ifthe sumof the inputs is less than zero. Otherwise, theyoutput the positive sum. Nodes positionedabove classifier nodes are threshold logicnodes, and nodes below the classifier nodessimplyoutput the sum ofall inputs. A temporalsequence ofT input vectors is presented at thebottom ofthe net and the desired output is YM-l

(T), which is the output of the last classifiernode sampled at the last time step T. Here M is

7r-------------------------~

Fig. 12 - Viterbi alignment between the reference pattern sequence for a word model and the unknown input patternsequence. Frames from the input pattern sequence are time aligned to a storedpattern sequence for a word model and,in the process, the distance from the input sequence to the word model is computed.

122 The Lincoln Laboratory Journal, Volume 1, Number 1 (1988)


the number of classifier nodes in the Viterbinet. The output is monotonically related to theprobabilityP, which is normallycalculated witha Viterbi decoder. P is the probabilityofproducing the applied sequence by traversingthe HMMword model corresponding to the net over thebest path that maximizes this probability. Although the Viterbi net produces a score that isrelated to this probability, it does not store thebest path.

Delays used in the classifier nodes mustequal the time interval between changes in theinput vectors. Inputs at time step t, denoted":i (t) (0::::;; j ::::;; N-l, 1 ::::;; t::::;; T), are supplementedbyafixedinputxN(t) equal to I, which providesoffsets to the summing nodes. The values denoted si (t) (0::::;; i ::::;; M-l) represent a matchingscore between the current inputvector and theexemplar input expected in classifier node i.When inputs are assumed to be jointly Gaussian and independent, the exemplar classifiernode i is defined by mean values ~jandvariances 07. Weights Wij can then be assigned toform Ja Gaussian classifier and compute thematching score required by the Viterbi algorithm:

performed by the five-node subnets above thetwo right classifier nodes (see box, ''viterbiNets"; Fig. A). These nets output the maximumoftwo inputs [4). To improve clarity, the addition of the In (~j) terms to the classifier nodeoutputs is not shown in the Fig. This extra termrequires adding fixed inputs to all fed-backclassifier node outputs.

The recursion increases classifier node outputs successivelyfrom left to right. Itbuilds upnode outputs when the correct word is presented and the input patterns match exemplarpatterns stored in successive classifier nodes.Outputs build up only slightly or remain nearzero for acoustically dissimilar words.

The above description models only a simpleViterbi net and the simplest recursion. Therecursion and the Viterbi net can be generalized to allow more arbitrary transition matricesthan those used by the left-to-right model. Inaddition, the input data rate to the Viterbi netcan be increased, which provides fine temporalresolution but no change in net structure. Thisfeature is a potential advantage of the Viterbiarchitecture over other approaches.

SUMMARY

Yilt + 1) = f 1silt + 1) + max [Yk(t) + In(aki)] ~i-I';;; k';;; i

where 0::::;; t::::;; T -1,1::::;; i::::;; M -1. In these equations, the ~j terms represent transition probabilities from node i to node j in the HMM wordmodel corresponding to the Viterbi net. Theexpression f (a) is a threshold logic function[f (a) =0, a::::;; 0; f (a) =a, a> 0). The maximumpicking operation required in this recursion is

where 0 ::::;; i ::::;; M - 1. The score is strongly positive if the current input matches the exemplarthat is specified by weights on links to classifier node i; the score is slightlypositive or negative otherwise.

The Viterbi net uses the following recursionto update node outputs:

Yo(t + 1) = f Iso(t + 1) + Yo(t) + In(aooll

where O::::;;t::::;;T-l, and

a?J

Neural nets offer massive parallelism forreal-time operation and adaptation, which hasthe potential ofhelping to solve difficult speechrecognition tasks. Speech recognition, however, will require different nets for differenttasks. Different neural net classifiers were reviewed and it was shown that the three-layerperceptrons can form arbitrarily shaped decision regions. These neural net classifiers performed better than Gaussian classifiers for adigit classification problem. Three-layer perceptrons also performed well for a vowel classification task. A new net, called a feature mapclassifier, provided rapid single-trialleamingin the course of completing this task.

Another new net, the Viterbi net, implemented a temporal decoding algorithm found in current word recognizers by using a parallel neural net architecture and analog processing.This net uses threshold logic nodes and fixeddelays. Talker-dependent isolated-word testsusing a difficult 35-word vocabulary demon-

2mij Xj(t) N - 1-----'----'- - ~

j = 0

N-1

silt) = ~j=O



strated good perlonnance (greater than 99%correct) with this net, similar to that of currentHMM word recognizers. Current research efforts are focusing on developing sequentialadaptive training algorithms that do not rely onthe forward-backward training algorithm, andon exploring new neural net temporal decodingalgorithms.

For Further InformationA more extensive introduction to neural net

classifiers and a more complete set of references is contained in "An Introduction to Computing with Neural Nets [4]." Various conference papers [10,14,16] describe the experiments reviewed in this paper in greater detail.Papers describing recent work in the field ofneural nets are available in the IEEE First International Conference on Neural Networks(San Diego, June 1987), in the proceedings ofthe IEEE Conference on Neural InformationProcessing Systems - Natural and Synthetic(Denver, November 1987), and in the journal,Neural Networks, which is published by thenewly formed International Neural NetworkSociety. The Winter 1988 issue of Daedaluscontains many articles that discuss the relationships between neural networks and artificial intelligence and another recent article provides a survey of neural net models [17].

RICHARD P. LIPPMANN received the BS degree in electrical engineering from thePolytechnic Institute ofBrooklyn in 1970 and theSM and PhD degrees fromMIT in 1973 and 1978, respectively. His SM thesisdealt with the psychoacoustics of intensity perceptionand his PhD thesis with signal processing for the hearing impaired. From 1978 to1981 he was the Director ofthe Communication Engineering Laboratory at the

Boys Town Institute for Communication Disorders inChildren in Omaha Rich worked on speech recognition,speech training aids for the deaf, and signal processing forhearing aids. In 1981 he joined Lincoln Laboratory, wherehe has worked on speech recognition, speech input!outputsystems, and routing and system control of circuitswitched networks. Rich's current interests include speechrecognition; neural net algorithms; statistics; and humanphysiology, memory, and learning.

124

REFERENCES

1. R Rosenblatt, Principles ofNeurodynamics, SpartanBooks, New York, NY (1959).

2. T. Kohonen, K. Makisara, and T. Saramaki, "Phonotopic Maps: Insightful Representation of PhonologicFeatures for Speech Recognition," Proc. 7th Int Con].on Pattern Recognition. IEEE, New York, NY (1984).

3. D.E. Rumelhart, G.E. Hinton, and RJ. Williams, "LearningInternal Representations by Error Propagation," inParallel Distributed Processing (D.E. Rumelhart andJ.L. McClelland, eds.), MIT Press, Cambridge, MA(1986).

4. RP. Lippmann, "An Introduction to Computing withNeural Nets," IEEE ASSP Mag. 4, 4 (1987).

5. G.R. Doddington and T.B. Shalk, "Speech Recognition:Turning Theory into Practice," IEEE Spectrum 18, 26(1981).

6. D.B. Paul, "A Speaker-Stress Resistant HMM IsolatedWord Recognizer," Int. Con]. onAcoustics Speech andSignal Processing (1987).

7. RO. Duda and P.E. Hart, Pattern Classification andScene Analysis, John Wiley, New York, NY (1973).

8. J. Raffel, J. Mann, R Berger, A. Soares, and S. Gilbert,"A GenericArchitecture for WaferScale NeuromorphicSystems," Proc. FirstAnnualIEEEInt. Con]. onNeuralNets, IEEE, New York, NY (1987).

9. D.B. Cooper and J.H. Freeman, "On the AsymptoticImprovement in the Outcome of Supervised LearningProvided byAdditional Nonsupervised Learning," IEEETrans. Comput. e-19, 1055 (1970).

10. W.Y. Huang and RP. Lippmann, "Neural Net andTraditional Classifiers," Proc. IEEE Corif'. on Neural Information Processing Systems - Natural and Synthetic, IEEE, New York, NY (1987).

11. L.R Rabiner and B.H. Juang, "An Introduction to Hidden Markov Models," IEEE ASSP Mag. 3, 4 (1986).

12. G.E. Peterson and H.L. Barney, "Control Methods Usedin a Study of Vowels," J. Acoust. Soc. Am 24, 175(1952).

13. D.L. Reilly, C. Scofield, C. Elbaum, and L.N. Cooper,"Learning System Architectures Composed of Multiple Learning Modules," First Int. IEEE Corif'. onNeuralNetworks, IEEE, New York, NY (1987).

14. RP. Lippmann, W.Y. Huang, and B. Gold, "Neural NetClassifiers Useful For Speech Recognition," Proc. FirstAnnual IEEE Int. Con]. on Neural Nets, IEEE, NewYork, NY (1987).

15. RP. Lippmann, EA. Martin and D.B. Paul, "Multi-StyleTraining For Robust Isolated-Word Recognition," Int.Con]. on Acoustics Speech and Signal Processing(1987).

16. W. Huang and RP. Lippmann, "Comparisons BetweenNeural Net and Conventional Classifiers," Proc. IEEEFirst Int. Con]. on Neural Networks, IEEE, New York,NY (1987).

17. RP. Lippmann, "A Survey of Neural Network Models,"3rd Int. Con]. on Supercomputing, InternationalSupercomputing Institute (1988).

18. J. Mann, R.P. Lippmann, R Berger, and J. Raffel, "ASelf-Organizing Neural Net Chip," NeuralNetworksforComputing Con]. (1988).