Thanassis Trikas - DSpace@MIT Home
Transcript of Thanassis Trikas - DSpace@MIT Home
FTL REPORT R87-2
AUTOMATED SPEECH RECOGNITIONIN AIR TRAFFIC CONTROL
Thanassis Trikas
January 1987
FLIGHT TRANSPORTATION LABORATORY REPORT R87-2
AUTOMATED SPEECH RECOGNITIONIN AIR TRAFFIC CONTROL
Thanassis Trikas
January 1987
AUTOMATED SPEECH RECOGNITIONIN AIR TRAFFIC CONTROL
by
THANASSIS TRIKAS
Abstract
Over the past few years, the technology and performance of Automated Speech Recog-nition (ASR) systems has been improving steadily. This has resulted in their successful usein a number of industrial applications. Motivated by this success, a look was taken at theapplication of ASR to Air Traffic Control, a task whose primary means of communication isverbal.
In particular, ASR, and audio playback was incorporated into an Air Tiraffic ControlSimulation task in order to replace blip-drivers, people responsible for manually keying inverbal commands and simulating pilot responses. This was done through the use of a VOTANVPC2000 ASR continuous speech recognition system which also possessed a digital recordingcapability.
Parsing systems were designed that utilized the syntax of ATC commands, as definedin the controller's handbook, in order to detect and correct recognition errors. As well,techniques whereby the user could correct any recognition errors himself were included.
Finally, some desirable features of ASR systems to be used in this environment wereformulated based on the experience gained in the ATC simulation task and parser design.These predominantly include continuous speech recognition, a simple training procedure, andan open architecture to allow for the customization of the speech recognition to the particulartask at hand required by the parser.
Acknowledgements
I would like to express my sincere gratitude and appreciation to the following people: Prof.R. W. Simpson, my thesis advisor, for his suggestions and guidance; Dr. John Pararas, for allof his help and encouragement with every stage of this work; My fellow graduate students,Dave Weissbein, Mark Kolb, Ron LaJoie and Jim Butler for their help and friendship, bothin and out of the classroom; and finally, my parents and brothers for their encouragement.
As well, I would also like to thank the NASA/FAA TRI-University Program for fundingthis research.
Contents
Abstract
Acknowledgements iii
1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 11.2 Application Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 ATC Command Recognition: Operational Environment . . . . . . . . 7
1.2.2 ATC Command Recognition: Simulation Environment . . . . . . . . . 101.3 O utline ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Automatic Speech Recognition 132.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 How ASR Systems Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 Recognition Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.2 Factors Affecting Recognition . . . . . . . . . . . . . . . . . . . . . . . 21
3 ASR Systems Selected for Experimentation 243.1 LISNER 1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.1.2 Evaluation and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 273.1.3 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 VOTAN VPC2000 System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.2.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.2.2 Evaluation and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 ATC Simulation Environment: Command Recognition System Design 444.1 ATC Simulation and Display . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Speech Input Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2.1 ASR System . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 484.2.2 User Feedback and Prompting . . . . . . . . . . . . . . . . . . . . . . 504.2.3 Speech Input Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
- i i .i -
4.3 Pseudo-Pilot Responses . . . . . . . . . . . . .. . . . . . . . . . . . . . . . 80
4.4 Discussion . . . . . . . . . . . . . . . . . . . . - - . - - - - . .. . . . . . 85
5 Air Traffic Control Connand Recognition: Operational Applications 935.1 General Difficulties . . . . . . . . . . . . . . . . . - . . . . . . .. . . . . . . 93
5.1.1 Recognition of Aircraft Names . . . . . . . . . . . . . . . . . . . . . . 94
5.1.2 Issuance of Non-Standard Commands . . . . . . . . . . . . . . . . . . 955.2 Application Specific Difficulties . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2.1 Digitized Command Transmission - Voice Channel Offloading . . . . . 965.2.2 Command Prestoring . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6 Conclusions and Recommendations 1036.1 Sum m ary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1036.2 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
-i v -
List of Figures
2.1 Block Diagram of Generic ASR system. . . . . . . . . . . . . . . . . . . . . . 16
3.1 LIS'NER 1000 Functional Block Diagram. . . . . . . . . . . . . . . . . . . . . 253.2 Commands used in preliminary ASR system evaluation. . . . . . . . . . . . . 313.3 VAX Display for Command entry feedback. . . . . . . . . . . . . . . . . . . . 323.4 Example of word boundary mis-alignment due to misrecognition errors. . . . 42
4.1 Configuration of ATC Simulation Hardware. . . . . . . . . . . . . . . . . . . . 464.2 Icon used for display of fixes in the simulation display . . . . . .-. . . . . . . 474.3 Icon used for display of airports in the simulation display . . /. . . . . . . . 474.4 Icon used for display of aircraft in the simulation display . . . . . . . . . . . . 484.5 Sample of the ATC Simulation Display on the TI Explorer. . . . . . . . . . . 494.6 Display format including feedback for spoken commands. . . . . . . . . . . . 524.7 Example of the Finite State Machine logic for the specification of a heading . 554.8 Superblock structure of the FSM implemented. . . . . . . . . . . . . . . . . . 574.9 Internal structure of the Aircraft Name Superblock . . . . . . . . . . . . . . . 594.10 Internal structure of the Heading Command Superblock . . . . . . . . . . . . 604.11 Internal structure of the Altitude Command Superblock . . . . . . . . . . . . 614.12 Internal structure of the Airspeed Command Superblock . . . . . . . . . . . . 624.13 Airspeed Command Superblock maintaining original ATC syntax . . . . . . . 634.14 Table of ATC Commands used in Pattern Matcher database . . . . . . . . . 744.15 Flowchart of sequencing of VPC2000 functions . . . . . . . . . . . . . . . . . 83
5.1 Duplex voice channel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
List of Tables
1.1 Table of ICAO Phonetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1 Table of typical words used for ASR evaluation . . . . . . . . . . . . . . . . . 28
4.1 Table of discrete messages recorded for Pseudo-pilot response formulation . . 81
-vi -
Chapter 1
Introduction
1.1 Motivation
Since airline deregulation, the amount of commercial air traffic has been steadily increas-
ing. This increase has had two major repercussions.
First, as the amount of air traffic increases, the Air Traffic Control (ATC) system is
rapidly approaching its saturation capacity. Thus, an ever increasing number of aircraft
are being delayed, either on the ground at their originating airport, or in the air at their
destination, until they can be accommodated by the ATC system. These delays, apart
from being annoying from a traveler's point of view, are also the cause of increased fuel
consumption and operating costs of aircraft waiting for take-off clearance or waiting for
landing clearance. Since current air traffic growth trends are expected to continue, a great
deal of study is being made into techniques for increasing the capacity of the ATC system as
well as utilizing the existing capacity more efficiently. These techniques, although they often
only involve procedural changes, almost always introduce a heavy reliance on computers and
automation. Thus the Air Traffic Controller will more and more be forced to interface with
computers in the execution of his everyday tasks in an increasingly automated system[1).
Second, the amount of air traffic for which controllers are responsible is also increasing.
This, in conjunction with the loss of skilled personnel arising from the PATCO strike of
19811, means that air traffic controllers are working harder now than ever before. It is an
'Many people feel that only now is the ATC system beginning to return to the level of expertise and staffingthat was prevalent before the PATCO strike.
issue of great concern since this increase in workload could possibly translate directly into a
decrease in safety. In order to help alleviate this increase in workload, some airports increase
the number of active controllers on duty during busy periods and give them each smaller
sectors to control. Still, there are practical limits to this subdivision of sectors or the number
of controllers on duty and for this reason, a greater and greater emphasis is being placed on
automation in the ATC environment in order to reduce workload.
While the number and scope of automation strategies is fairly broad, all of these have
one common factor; the dissemination of information from a human operator, typically the
controller, to a computer. It is here where the increase in automation places the greatest
strain. Speech recognition is a means of alleviating this by providing a simpler controller-
computer interface as well as performance improvements not possible with more conventional
interfaces.
Current input modalities such as the keyboard, special function keys, or a mouse (with
pull down menus), although sufficient for a great number of tasks, can become somewhat
awkward or clumsy in an ATC environment. This because the primary means of information
transfer in ATC is verbal. Thus, it is conceivable that in some situations, information would
have to be repeated twice, once through speech for humans, and another time through key-
boards for computers. For example, in today's semiautomated system, changes in flight plans
or cruising altitudes have to be transmitted by the controller through the voice channel to
the pilots as well as entered through the keyboard into the computer in order to maintain
the veracity and integrity of the flight plan database. This type of redundancy will become
even more acute as more automation is introduced into the ATC system, with the obvious
adverse effects on controller workload. The problem is more pronounced if the information
must be entered in real-time in order to, for example, reflect the current state of an aircraft
or number of aircraft in the ATC system.
Even if we ignore these real-time strategies and the requirement of redundant information
transfer, speech still has a large number of benefits over more conventional input modalities
[2,3]. It is easier, simpler, less demanding and more natural than other more conventional
input modalities. Furthermore, it requires almost no training on the part of the user in its
use2 . It also allows the controller to use his eyes or hands simultaneously for other tasks,
thus potentially allowing for multi-modal communication strategies (i.e., simultaneous use of
more than one input modality such as keyboard and speech, or trackball and speech). The
consequences of these factors are that the task may be performed faster or more accurately
or that an extra operator may no longer be required.
These however are not the only possible benefits. Some studies indicate that under certain
circumstances, memory retention in tasks performed using speech is often better than that
using other input modalities. As well, speech is the highest capacity output channel of
humans[4,5,6), yielding roughly a threefold (or more) improvement in data entry rates over
a keyboard in problem solving tasks that require thinking and typing[2]. Thus, there are
significant benefits to utilizing this channel in terms of operator workload reductions.
The goal of the work reported here is not to design an ASR system but instead to use an
off the shelf system, applying it in the context of an ATC environment, in order to explore
the potential benefits and problems in applying ASR to this environment. A secondary goal
is to determine desirable features and requirements for an ASR system designed specifically
for ATC. It is often lamented that one of the problems facing designers of ASR systems is
that they do not have any specific criteria for their design (other than the obvious ones of
low recognition error rates and delays)[2]. Granted, the required or desirable features may be
dependent upon the exact ATC application, but it appears that there are some generalities
that can still be made which could lead to an ASR system well suited for ATC applications
as a whole.
1.2 Application Areas
Technically termed Automated Speech Recognition3 or ASR, the recognition of human
speech by computers is a technology that is widely acclaimed as being "here". Although a
2In practice however, some restrictions to the natural flexibility of speech must still be applied as shall be
shown later.
3The term speech recognition should not be confused with the term voice recognition. The latter deals with the
recognition of a particular speaker based on his or her vocal patterns while the former deals with the actual
recognition of what the speaker is saying.
lot of work is being and still remains to be done, ASR has already moved out of the realm
of pure research and is being used successfully in industry, where significant operational
benefits have been accrued [7,8,9]. Thus, although each particular application should be
analyzed in its own right in order to determine its specific benefits and pitfalls, it appears
that the significant amount of real-world practical experience and success with using this
type of technology indicate that it is feasible. It is these successes and the rapidly advancing
state-of-the-art technology that have motivated interest in ASR systems and how they can
be used in an ATC environment 4 .
In general, tasks for which ASR should be considered for use are those which either cannot
be accomplished using conventional methods such as the keyboard or trackball, or which in
some way are being inadequately performed currently.
Initial applications of this type of technology in ATC would be in existing data-entry
tasks. These tasks entail replacing or complementing more conventional input modalities,
principally keyboards, with ASR in areas where the sole function is the straightforward entry
of data into the computer[3,14).
There have, for example, already been studies into the use of speech input to replace
keyboards in the flight-strip entry and updating functions [15]. It was found that under
certain traffic conditions at even moderate traffic densities, it was possible for the controller
responsible for maintaining this information to become overloaded. Thus, it would seem
possible that data entry rates could be improved by using speech recognition. Although
these studies demonstrated no significant difference in data entry rates over keyboards, ASR
technology has advanced significantly since 1977 when the study took place and as such it
is likely that improvements are now possible. Of primary significance is the fact that the
recognition system used was a discrete speech system which is inherently slower and than
a continuous speech recognition system. Therefore, it seems plausible that a continuous
speech recognition system would provide improvements in data entry rates. Regardless of
this, it was found that the error rate for entering flight data using ASR was lower than
'Although some mention will be given to ASR in the aircraft cockpit, this work deals primarily with applica-
tions of ASR to the controllers task. The reader wishing further information on ASR in the cockpit is urged
to consult, amongst others, the related articles [10,11,12,13).
that using a keyboard, indicating that there are indeed possible payoffs. In addition, even
if no significant performance improvements can be realized, there is still the issue of which
modality is preferred by controllers.
What will be covered in this work is a broad range of applications that involve the
recognition, by computer, of verbal controller commands currently directed towards aircraft
pilots. This for two reasons. First, this information, as shall be shown later, can be very
useful when made available to automation systems and second, ATC commands, by design,
contain features that are similar to those that yield optimum ASR performance. These
features are as follows.
First, since there is only one user of the ASR system at each ATC sector, a speaker de-
pendent recognition system (to be explained in Chapter 2) can be used. This is advantageous
because speaker dependent systems are inherently more accurate than speaker independent
systems which must recognize speech from a number of different speakers. Different con-
trollers for any given sector can still be readily accommodated with this system simply by
storing their speech data on a floppy or cassette and calling it up when they report onto the
sector.
Second, the procedures used for communication between pilots and controllers are de-
signed to reduce recognition errors made by communication over a possibly noisy radio chan-
nel. Thus, similar sounding words that are easily confused by humans (and thus even more
likely to be confused by an ASR system) have been eliminated. This is exemplified by the
use of the word "niner" instead of "nine" in order to reduce the likelihood of confusion with
the word "five". As well, short words such as the letters of the alphabet, which are also very
difficult to recognize correctly, have been replaced by the "zulu" or phonetic alphabet[16]
(see Table 1.1).
Finally and most importantly, the overall structure of ATC commands, in terms of their
distinctness from one another and the rigid syntax that is used[16], coupled with the fact
that the task is to recognize entire commands as opposed to individual words implies that
there is a lot of additional information that can be brought to bear to aid in the recognition
process.
Table 1.1: Table of ICAO Phonetics
Thus, for example, the ATC command syntax can be used to constrain the input to only
those words that are syntactically valid and thereby reduce errors. T~is however is not of
help in detecting errors between two syntactically valid words, such as two different numbers.
Although it might seem over-restrictive to rigidly enforce this command syntax, this is not so.
In fact, during training, controllers are forced to adhere fairly well to it, and most continue
to do so throughout their careers.
There is however additional information contained in the rest of a command that provides
further capability for error detection and correction. For example, consider the vectoring
command "TWA turn left heading zero one zero". If the recognized command is "TWA
turn left heading zero five zero" and the aircraft's heading is 040, then the "turn left"
would signal, to the pilot for example, that a mistake has been made somewhere and that
clarification should be requested. This same information that is used by the pilot is also
potentially usable by an ASR system.
Another example of this occurs if the word "descend" is not recognized in a "descend
and maintain five thousand" command. Here, it is still clear, based on the rest of the input,
what the desired action is and this can potentially be used to infer what the un-recognized
word was.
A Alpha N NovemberB Bravo 0 OscarC Charlie P PapaD Delta Q QuebecE Echo R RomeoF Foxtrot S SierraG Golf T TangoH Hotel U UniformI India V VictorJ Juliett W WhiskeyK Kilo X X-rayL Lima Y YankeeM Mike Z Zulu
In fact, the human listener uses this same type of information to aid in the recognition
process. This was demonstrated by Pollack and Pickett [17] who showed that roughly 20 to
30 percent of the words from tape-recorded conversations cannot be understood by a given
listener when they are played back individually in random order, even though every word
was perfectly understood in the original conversation.
The actual ways in which the information made available through the recognition of Air
Traffic Controller commands can be used are numerous and will be discussed in the following
sections. They are loosely grouped into two classes called "Operational Applications" and
"Simulation Applications". These involve applications in the current or projected operational
environment and in the simulation environment respectively.
1.2.1 ATC Command Recognition: Operational Environment
By far the simplest application of Air Traffic Control Command Recognition, or ATCCR,
would be to use it in order to provide a memory aid to controllers. Here, ATCCR could be
used to recognize controller commands issued to pilots directly (without the need to type
them in) and display them on a scrolling history of issued commands. This could possibly
be used by the controller in order to determine which commands have already been issued
since, during high workload situations, it is possible to forget these. Because the ASR system
would not be a direct part of ATC operations in this application, recognition errors would
not have any significant effects on the controller's performance of his duties. Thus, this type
of system could be used in order to generate data on the recognition accuracy and error rates
in a real-world ATC environment, as a precursor to the implementation of ATCCR for other
tasks. As well, it would create a database of controller commands issued in a format readily
readable by a computer and would thus allow for computer analysis of different aspects of
controller operating procedure.
Once some practical experience has been gained with ATCCR, a far more ambitious
application can be undertaken. This would be to use speech recognition to allow the computer
system to listen in to the commands issued by the controller and responses issued by pilots.
This information, in conjunction with the other information available to the computer (such
as radar tracks, minimum safe altitudes, restricted zones, etc.) could be used to provide a
backup controller to catch any potentially dangerous situations that might be missed by the
controller. The system would in effect provide for conformance monitoring and conflict alert.
This application however is extremely difficult since it requires not only very good speech
recognizer performance, but also the integration of a large number of different technologies
including such fields as AI and Natural Language Understanding. This task is also greatly
complicated by the difficulties involved in the recognition of pilot transmissions arising not
only from the noise and low bandwidth of the radio channel, but also from the great variability
possible in pilot speech.
As mentioned previously, one of the primary motivating factors for the use of ASR relates
to the increase in automation in the ATC environment. This increase in automation is not
only occurring on the ground, but in the air as well. As more and more new aircraft progress
to "digital" cockpits, it is becoming increasingly obvious that there would be significant
benefits to linking these two systems together digitally, in a format that would allow direct
communication between ground controller, ground computer, airborne computer and pilot, as
opposed to verbally from controller to pilot as is the case now. This could be accomplished
using by using ATCCR to first recognize the controller's commands and then transmit a
representation of these to the specified aircraft5 . Once received by the aircraft, they could
then be reproduced, for presentation to the pilot, either aurally, using, for example a speech
synthesizer, or visually using a standard display.
With this configuration, one can easily envision a future system where the flight director
in an aircraft would receive commands directly from the controller and his computer and
then, pending acknowledgment and verification from the pilot, execute them[10].
The benefits that can be accrued with this digital link between air and ground are nu-
merous. Most importantly, message intelligibility could be enhanced significantly. Currently,
commands transmitted over a noisy and often over-used radio channel are somewhat difficult
to make out and often result in errors made by pilots. Messages transmitted in a digital
'Although the exact method by which this transmission would take place remains to be seen, it could beaccomplished using the digital communication capability made possible by Mode S.
format however are less likely to be corrupted by noise. Even if they are, checks can easily
be made as to their integrity. Furthermore, since these commands are now in a format where
they can be readily manipulated by computer, they can be made available for recall by the
pilot in order to avoid the need for the controller to repeat or re-issue commands should the
pilot forget them.
The increased use of this digital link would also greatly off-load the voice channel. This,
would improve the intelligibility of any verbal communications made using it, as well as
increase the effective bandwidth of the controller-pilot communication channel as a whole.
Although more conventional technology in the form of vocoders6 could also be used for
command digitization, again, without requiring the controller to key in his command, ASR
possesses significant advantages over these. First, current vocoders operate at a minimum
of about 1200 baud. ASR systems however can reduce this to something on the order of
200 baud if straight recognized text is transmitted, and even lower if this text is further
compressed. Thus, much less of a strain on the bandwidth of the digital link is incurred
using an ASR system.
Second, since vocoders simply sample the speech waveform, the only way of recreating
and presenting the digitized information transmitted is to reproduce the original audio signal.
This however results in a format that is unusable by a computer and as such, the issued
command could not be displayed visually to the pilot, or used in any future automation
functions unless it was also keyed into the computer.
In a command digitization and transmission application scenario such as the one de-
scribed earlier, another problem, arising from the mix of "digital" and conventional aircraft,
is created. In particular, it would be necessary for controllers to keep track of which aircraft
must be "spoken" to and which can have keyed commands/information sent to them. If
ASR is used, possibly in conjunction with keyboards and/or other input modalities, then
it would be possible, if he so desires, for the controller to issue commands verbally to all
of the aircraft. It would then be up to the computer to determine the capabilities of the
aircraft being referred to. If it possesses "digital" capability, then then the verbal message
6 Systems that sample a speech waveform and compress it for more efficient transmission.
would be sent digitally. If not, the message could be sent verbally over the radio link, either
through reproduction by a speech synthesizer of some sort (advantages in terms of a distinct
voice over a possibly cluttered radio channel, disadvantages in terms of intelligibility) or by
replaying a recording of the message made as it was previously said by the controller.
A further application of ATCCR, although one for which ATCCR is not essential, is to
allow for the prestoring and automatic (or semi-automatic) issuance of clearances to aircraft.
In practice, the controller can often anticipate what clearances should be issued to aircraft
often minutes in advance. Thus, with an ATCCR system, he could pre-store these clearances
for issue later by the computer and divert some of his attention to other tasks. These
clearances could be transmitted, as described earlier for the digital cockpit scenario, either
digitally or verbally depending on aircraft capabilities.
The actual issuance of these clearances could be accomplished either by simply recalling
the clearance from the computer when it is desired, or by having the computer automat-
ically monitor the specified aircraft to determine when it should be issued. Granted, the
controller might want to validate or acknowledge every command sent to the aircraft by the
computer, but this could be done by simply prompting the user whenever it is time to trans-
mit a command and asking if the command should still be issued. In this way, the ultimate
responsibility still lies with the ATC controller.
1.2.2 ATC Command Recognition: Simulation Environment
Before any of the (aforementioned) real-world applications of Air Traffic Controller com-
mand recognition can be studied and evaluated, a facility to demonstrate them in an envi-
ronment typical of Air Traffic Control should be available. For this reason, one of the initial
goals of this work was the incorporation of ASR into an existing real-time ATC simulation.
In investigating the configuration of this simulation, it was readily apparent that it, in
itself, was an ideal application of ATC command recognition. In particular, current real time
ATC simulations use what are called pseudo-pilots or blip drivers. These are people whose
task it is to translate verbal commands from the subject controller involved in the simulation
into typed commands that are keyed into a computer. The use of these people adds to the
cost and complexity of the experiments using the simulation. ASR would allow these people
to be replaced by a direct data path from the subject controller to the simulation computer.
Granted, this is not likely to have the flexibility that is available with a human blip driver,
but the advantages in terms of cost and manpower requirements could possibly outweigh
this.
The other function often performed by blip-drivers is the simulation of pilot responses to
the controller in order to add to the fidelity of the simulation for certain ATC research. This
function can also be replaced by the computer by using computer generated verbal responses.
This technology is much more mature and does not pose the same types of technological
problems posed by ASR. Computer generated responses can be produced in basically two
ways. The first is through a rule-based text-to-speech synthesizer. This takes written text
and through a series of often empirically derived algorithms or rules, specifies the output
of an electronic sound synthesis circuit in order to generate an imitation of human speech.
The major drawbacks of this system are perception or intelligibility problems that arise due
to the flat monotone and "robot" like quality of the speech output. Most systems available
however, provide at least some ability for the user to specify stress and intonation patterns,
in order to make the output more intelligible (see related articles [18,19,20,21,22,23]). Such a
system's advantages lie in its flexibility, in that there is no requirement to know beforehand
exactly what the words or phrases that the system will be required to say are.
The second method for simulating pilot responses utilizes pre-recorded messages and plays
them back in a specified order as required by the user. An example of this kind of system is
the response when requesting a telephone number from Information. This technique results
in messages that are more intelligible than text-to-speech because they are in fact, simply
tape or digital recordings of human speech. It is however, far less flexible in that all possible
messages must be recorded beforehand. As well, it requires a lot of memory to store these
pre-recorded messages in the computer although there are a number of techniques to reduce
this [24,25). Furthermore there is the difficulty of introducing intonation and emphasis into
the speech output since this requires that all possible occurrences of these can be predicted
beforehand and suitable messages recorded.
Recognition errors can be handled much more easily in the simulation environment. If one
occurs, the computer can simply respond with a verbal "Say again." message thus adding
even more to the realism of the simulation. Note that recognition error rates must be kept
fairly low since a system that responds with "Say Again." too often is not very practical.
1.3 Outline
As has been shown, there is quite a variety of possible applications arising from the use
of ASR to recognize Air Traffic Controller verbal commands. Some of these, although still
potentially useful, might turn out to be impractical especially in light of current technology.
Thus, this work will explore these applications in more detail.
Before commencing on a description of the work performed and the results obtained, it
is important to first define some terms and concepts dealing with ASR systems. These are
covered in Chapter 2. The reader wishing more detailed information is urged to consult the
references.
In Chapter 3, the ASR systems that were purchased and used are described with particular
reference to the features that were found to be desirable.
The work performed was concerned predominantly with the development of both speech
input and output capabilities for an ATC terminal area simulation. In Chapter 4, the imple-
mentation of this task is detailed. In particular, the design of a system for ATC command
recognition (ATCCR) will be presented. Integral to this are such things as the methods
used to incorporate ATC syntax requirements and constraints as well as the handling of
recognition errors, both their detection and correction.
In Chapter 5, some of the operational applications of ATCCR alluded to earlier are
revisited and analyzed in greater detail as to their feasibility and possible shortcomings,
based on the experience gained from the Simulation application.
Finally, Chapter 6 outlines the conclusions of this work, along with suitable recommenda-
tions for further work both with the existing hardware as well as with new technology ASR
systems that are currently appearing on the market.
Chapter 2
Automatic Speech Recognition
2.1 Introduction
The basic purpose of a speech recognition system is to recognize verbal input from some
pre-defined vocabulary and to translate it into a string of alphanumeric characters. Before
beginning a description of how these systems work however, it is first important to present
some general categorizations of the systems that are available. In general, ASR systems can
be categorized based on three different features and capabilities. These are:
1. Speaker Dependence/Independence
2. Discrete/Connected/Continuous Speech Recognition
3. Vocabulary Size
The first of these deals with whether or not the system is designed to be used by only
one speaker at a time. If it is speaker dependent, then it must be trained to a particular
user's vocal patterns, typically by having him repeat to the system all of the words that
are desired for it to recognize. With speaker independent systems, there is no need for this
extensive training procedure because some basic information about how the words in the
vocabulary are spoken is usually incorporated directly into the system. In general however,
the speaker dependence distinction is one that is closely related to accuracy. A speaker
dependent system can be made somewhat speaker independent simply by having multiple
users train the system to their voices. Thus, it would possess data from a spectrum of speakers
and should in theory be able to recognize speech from any speakers with roughly similar vocal
patterns. If it possessed sufficient accuracy, then it could be termed speaker dependent. Its
accuracy however, would tend not to be as good as a system that was explicitly designed for
speaker independence.
The second categorization deals with the type of speech input that is allowable. For
discrete speech recognition systems, it is assumed that the words or utterances contained in
the vocabulary will be spoken with a brief period of silence in between. This period of silence
is typically on the order of 150 to 200 msec long and is used to delineate utterances, thereby
allowing a simpler and more accurate recognition algorithm to be implemented.
These utterances are in general not restricted to being single words. In fact, they can be
entire phrases. Individual words contained in these phrases however cannot be recognized
unless they have been trained as such.
Connected word recognition systems, however, impose fewer restrictions on the user in
that these periods of silence need not occur after every word. Thus, the user can run words
together during his speech. Every so often however, a pause must still be included (the actual
recognition of the speech does not commence until this is detected). This is not much of a
problem since normal speech tends to be liberally sprinkled with these pauses.
Continuous speech recognition systems provide the most flexibility in how the user speaks.
With them, there is no requirement for the user to pause anywhere during speech input.
Unlike connected systems, recognition is performed as the words are spoken. Thus, it is
possible for words at the beginning of a stream of continuous speech to be recognized before
the user is finished talking or has paused.
The third categorization deals with how many words can be recognized by the ASR
system. This varies a great deal from system to system. In general, the limiting factor in
vocabulary size is the inherent accuracy of the recognition system. The higher the accuracy,
the larger the vocabulary possible. This is why speaker independent systems which, as a
rule, possess lower recognition accuracies than comparable speaker dependent systems, have
smaller vocabularies.
With some systems, every word to be recognized must be explicitly trained by the user.
This can be very cumbersome and time consuming for large vocabularies and thus limits
the practical size of the vocabulary to roughly 100 words or so. Other systems however,
use training procedures that do not require each word to be explicitly trained. With these
systems, the vocabulary size is often in the hundreds or thousands of words.
In defining vocabulary size, there is another factor to consider. This involves the capability
of some systems to activate only certain sections of the entire vocabulary. Hence, a better
indicator of performance is the size of the active or instantaneous vocabulary. Clearly the
larger the active vocabulary, the greater the likelihood of recognition errors and the larger
the recognition delays since more comparisons must be performed.
2.2 How ASR Systems Work
Although the actual details of how ASR systems work varies a great deal from system
to system, their basic internal structure is very similar. It consists of a Feature Extractor, a
Recognizer and a Vocabulary Database as indicated in Figure 2.1.
The feature extractor is basically responsible for analyzing the incoming speech input
signal and extracting data from it in a format that can be used by the recognition algorithm.
The recognition algorithm then takes this data and compares it to data in the vocabulary
database in order to determine which, if any, word was said.
The vocabulary database contains all of the words that can be recognized by the ASR
system. It is created by having the user train or enroll onto the system or through purely
theoretical means. In the simplest training procedure, the user simply repeats, a given
number of times, all of the words contained in the vocabulary so as to provide the ASR
system with information as to how these words "look" when spoken. This information is
then used to create a set of reference patterns or templates each one describing a particular
word. Other training procedures however, are much simpler and only require the user to
read a few paragraphs of text aloud from which the ASR system extracts information about
how the user articulates his words and in this way, generates the required templates.
The extraction from the input signal of data used to generate these templates is the
MicrophoneSignal
RecognizedOutput
Figure 2.1: Block Diagram of Generic ASR system.
VocabularyDatabase
Recognizer
primary responsibility of the Feature Extractor. The simplest form of feature extraction
is to sample the incoming speech signal. Since the bandwidth of human speech is roughly
4kHz, this implies a sampling rate of at least 8kHz minimum. At this rate, 8 kbytes of data
(assuming 8 bit quantization of the data) are produced for every second of speech. This
creates serious problems both in terms of memory requirements as well as recognition delays
(it takes a long time to process this much data). For this reason, alternative techniques are
used in order to reduce the data rates required.
One of the simplest and most successful of these takes advantage of the fact that the
frequency spectrum of the speech signal, although it varies in time, does not vary quickly.
Thus, if the signal is passed through a bank of bandpass filters in order to determine its
spectrum, these can be sampled at rates much slower than 8 kHz (typically at rates near 100
Hz).
Another successful technique to reduce data rates is to use Linear Predictive Coding or
LPC [26]. Here, an estimate is made of the present value of the input signal based on a linear
combination of the last n values, in conjunction with an all pole model of the vocal tract.
The output of this system is then related to the coefficients that minimize the estimation or
prediction error and again produces data at a rate of roughly 100 Hz.
In some systems, this data is further processed to produce a more compact representation.
For example, in vector quantization, each data sample (it is usually a vector of data) is
compared to a set of standard reference frames and replaced by a symbol associated with the
frame that best matches it. Thus, the output of the feature extractor can be transformed
into a sequence of symbols.
In a slightly more complex system, the incoming speech signal undergoes more extensive
processing in order to recognize the actual phonemes that it contains. Phonemes are the
different sounds that are made during speech (eg; "oo", "ah", and so on) and form a set of
basic building blocks for speech. Both the number and the actual phonemes themselves differ
somewhat from language to language but they are relatively constant for a given language.
There are roughly forty different phonemes contained in the English language [243. Thus,
the speech signal can be characterized by roughly forty different symbols. Thus, the data
rates generated by the feature extractor are very low, on the order of 50-100 Hz. Feature
extractors of this sort are potentially more accurate in that they try to extract the same
features from the speech signal that the user consciously tries to reproduce when he says a
word.
In any case, no matter what the data outputted by the feature extractor actually repre-
sents, there is a lot of similarity in how it is subsequently processed.
In a discrete speech recognition system for example, the data output by the feature
extractor is saved in a buffer until the end of a word, as indicated by a short period of silence,
is detected. This yields a matrix of data, assuming that the feature extractor outputs data
in the form of frames or vectors of data, whose one axis corresponds to time. This matrix
or pattern is then compared to patterns contained in the Vocabulary database that were
generated in a similar manner while the user was training the system. Here, a problem is
readily evident. Because different words and even different vocalizations of the same word
are different lengths, a common reference for comparison must be found. A simple way to
accomplish this is to time-normalize the patterns, so that they are of uniform length, by
merging adjacent frames together or by interpolating between them as required. If this is
done uniformly along the length, or time axis, of the matrix or pattern, then it is termed
linear time-normalization.
Now that the two patterns are the same length, they can be compared to determine
whether or not they match. The actual method of doing this again varies from system
to system. The simplest involves measuring the norm of the difference between the two
matrices in order to compute the distance between them. The most common norm used
is the Euclidean or 2-norm which is simply the square root of the sum of the squares of
each entry in the difference matrix. There are in addition, other more complicated and
computationaly intensive methods for distance measure but these are described elsewhere
[26,24,27]. If the distance between these two patterns is lower than a prespecified threshold,
then a match is declared. This threshold test prevents random noises such as doors slamming
and phones ringing from creating false recognitions.
The performance of systems utilizing linear time normalization however decreases dra-
matically as the size of the vocabulary increases (they are typically confined to vocabulary
sizes on the order of 20 to 40 words). This is primarily because the time normalization
procedure obscures a lot of the features of the spoken word. Furthermore, if one examines
an utterance closely, it can be seen that when its length changes, it does not do so uniformly
(linearly) along the length of the word. Consider the word "five". If the duration of this
word is increased as it is. spoken, it can be seen that most of the stretching occurs in the "i"
sound and not the f" and "v". In order to more readily account for this phenomenon, a
non-linear time normalization technique is used. With this technique, time normalization is
accomplished by aligning features found in the reference and input patterns in such a way
as to obtain the best match. Since the number of possible non-linear time alignments can
be quite numerous, dynamic programming techniques are used to eliminate some of these
and thereby reduce the computational complexity of this algorithm. For this reason, this
technique is often termed Dynamic Time Warping (DTW).
DTW not only yields greatly improved results for discrete speech, but it can readily be
extended to both connected and continuous speech recognition. The process of extending
this procedure to connected speech is fairly straightforward and consists of finding a similar
non-linear alignment, but this time relating the entire spoken input phrase to a super-pattern
consisting of the "best" sequence of reference patterns. This "best" sequence of words is then
the recognized phrase.
This procedure is modified in continuous speech recognition systems so that non-linear
time alignments between reference and input templates are calculated as the speech comes
in. In this way, when the score of the comparisons using this alignment dips below a certain
threshold, a match is declared and the normalization procedure is begun anew from the point
where this current matched pattern ended'. Thus, the principal difference between this and
connected speech recognition lies in where recognition events occur.
With these systems however, some problems arise due to the differences between words
when spoken in discrete as opposed to continuous or connected speech. First of all, words
tend to be shorter when spoken as part of a continuous stream than when spoken individually.
'A good explanation of this procedure is given in 128].
The resulting differences in length are sometimes too excessive for the recognition algorithm
to handle and thus errors result. Furthermore, the actual articulation of adjacent words is
sometimes changed significantly due to the slurring together of words. This phenomenon,
termed co-articulation often results in sounds that were not part of either individual word. A
good example of this occurs when the words "did you" are spoken quickly. The result sounds
more like "dija" than anything else and unless allowances are made for this, it is certain to
cause recognition errors.
In order to attempt to take this into account, some systems use what is termed embedded or
in phrase training. With this, vocabulary words are trained as part of a stream of continuous
speech in order to include co-articulation effects on word boundaries. This however is not
very general since these effects are to a great deal dependent on exactly what the surrounding
words are and it is unrealistic to train for all possible word combinations.
It is in order to account for some of these variations in how words/are said that other
procedures are being used as well'. These include such techniques as statistically based DTW
[29) as well as a process known as Hidden Markov Modeling (HMM) [30,31].
In a standard Markov Model, the various vocalizations of a word are used in order to con-
struct a finite state machine type structure where each state is associated with a particular
data frame or feature and each branch with the probability of receiving that feature. With
HMM however, this one-to-one correspondence between states and features is eliminated. In-
stead, each state is probabilistically associated with a number of features. Since assumptions
are no longer made about exactly which features are required in the input, and the actual
feature (or state) sequence is hidden, the number of states that are required to represent an
utterance can be reduced without a large degradation in performance. This also allows for
some errors to be made during the feature extraction process. The training procedure for
a system using HMM however is quite time consuming since quite a few repetitions of each
word must be used in order to evaluate the probabilities associated with each of the branches
of the finite state machine.
2Only the gross variabilities have been presented so far. There are other, somewhat smaller, but potentiallyequally significant sources of variability and these will be discussed in Section 2.3.2.
Thus, it can be seen that there are quite a few different techniques for recognizing speech.
This section has presented a brief introduction into some of these in order to provide the
reader with some necessary background. If more detailed information is desired, then addi-
tional references should be consulted.
2.3. Recognition Errors
2.3.1 Categorization
Any speech recognition system, even human, is certain to make at least some recognition
errors. The difference between systems lies however in the rate at which these errors occur
and in the inherent capabilities for recovering from them and correcting for them. In general,
recognition errors fall into three major categories.
1. Mis-recognition errors are those in which one word is mistaken for another. This also
implies that the comparison successfully passed any recognition threshold tests.
2. Non-recognition errors are those in which spoken words, that are members of the
current active vocabulary, are not recognized at all. This is usually because the utter-
ance, when spoken, is sufficiently different from that as trained so that the recognition
threshold is not passed. Some reasons why these differences arise will be discussed
later.
3. Spurious-recognition errors are those in which the ASR system indicates that a word
was spoken when in fact none was. These typically arise when extraneous noises are
mistaken for speech.
In general, the occurrence of these types of errors is very dependent on both the particular
recognition system that is being used as well as its operating parameters such as recognition
thresholds.
2.3.2 Factors Affecting Recognition
More important than the types of recognition errors are the questions of how and why
they occur and what affects their frequency. It is by understanding these issues that error
rates can be reduced. In general, since ASR systems are simply pattern matchers, it is
obvious that anything creating differences between these patterns as they are trained and as
they are produced during speech will increase the error rates.
These factors can range from stress, high workload, and nervousness on the part of the
user when speaking to the system to such things as day to day variations in his speech
possibly arising from such things as fatigue and colds. Some systems try to counter some
of these day to day variations by putting the speakers through a short enrollment session
each time they begin to use the system. In this, the user simply reads a short phrase prior
to using the system in order to allow the recognition system to adapt to how he is speaking
that particular day or session.
Other factors affecting recognition accuracy include environmental or background noise.
This can affect recognition accuracy in three basic ways. First, it can lead to spurious
recognitions through the ASR systems' mistaking of these sounds for valid speech input.
Second, users might be forced to change their articulation in order to compensate for and
be heard over background noises, and third, the noise might actually corrupt the speech
signal itself and mask a lot of information. While some systems simply use a noise canceling
microphone to counteract these, this is sometimes not sufficient and techniques to more
directly account for background noise must be incorporated into the recognition algorithm
itself.
The type of microphone used can have even further effects on recognition accuracy [37,38}.
In particular, microphones with a poor frequency response or highly non-linear or time vary-
ing response will greatly affect the quality and constancy of the signal made available to the
ASR system. The end result might be that not enough "clean" signal is available to the ASR
system for it to accurately discern between the utterances spoken.
To a great extent, it is the training procedure itself that produces a lot of the differences
between templates and the words as they are usually spoken. This is because users often train
the vocabulary words in a way that is significantly different from the way that they actually
say them. This is termed the training effect and is caused by nervousness or hesitance on
the part of the user. Furthermore, they are often simply reading words off a list and this
results in different pronunciations of words. Granted, one would desire a system that is not
sensitive to variations this small in the way that words are spoken however, most systems,
especially speaker dependent continuous speech ones are, unfortunately.
As mentioned earlier, speaker dependency is closely tied to recognition accuracy. In
general, speaker independent systems are much less sensitive too the types of variations
mentioned earlier, but are consequently also much less accurate than speaker dependent
systems since they must accommodate a much broader range in how words are said. These
variations come not only from pitch and inflection changes from user to user, but also from
dialect and accent. In order to keep error rates reasonable, these systems tend to confine
themselves to small vocabularies. Conversely, speaker dependent systems are trained by the
eventual user and hence know with much greater accuracy, how each of the words said by
the user would appear. This is analogous to a human's ability to recognize more easily the
speech of someone with whom they are familiar.
Chapter 3
ASR Systems Selected for
Experimentation
In selecting an ASR system for this research, a two step approach was taken. First,
an inexpensive, low performance, system was purchased. This was done in order to give
a better insight into ASR technology so that the requirements and desirable features of a
higher performance system to be used in subsequent research and development could be more
accurately defined.
The goal of this work, as stated previously, was to obtain some practical information
about the incorporation of ASR in ATC. It was not to test and document the performance
of a number of ASR systems currently on the market. As such, the evaluation details and
results are presented in an exploratory, qualitative, rather than a quantitative manner. It
was felt that this would give the reader a better idea of the problems typically encountered
using this type of technology, without creating a false sense of confidence in performance
figures which are, after all, highly subject to a number of factors and difficult to duplicate
from test to test.
3.1 LISNER 1000
3.1.1 Description
The ASR system purchased for initial evaluation was the LIS'NER 1000 system produced
by Micro Mint of Cedarhurst NY. This system consists of a plug-in card, with appropriate
Microphone Amplifier Bandpass A/DInput Filter
SP1000
LIS'NER 1000
APPLE II+
Figure 3.1: LIS'NER 1000 Functional Block Diagram.
software, for the APPLE II family of home computers and cost $250 at the time of purchase
in May 1985.
The LIS'NER 1000 ASR system is a speaker dependent, discrete word recognition sys-
tem[27,32,33]. Total vocabulary is 64 words or utterances.
A block diagram of the system hardware is shown in Figure 3.1. Its basic operation is as
follows. First, the signal from a headset mounted electret microphone is filtered to prevent
aliasing and remove low frequency biases. It is then digitized by an A-to-D and sampled by
the SP1000 chip at a rate of 6.5 khz. The SP1000 uses this incoming signal to generate LPC
data. This LPC data is organized in frames, each frame consisting of 8 LPC and one energy
parameter, and is made available to the APPLE at a rate of 50 hz. It is this data that is
used by the system in the recognition process.
During normal operation, a value for the background noise is constantly being monitored
by looking at the energy level of the incoming signal. A significant increase in the energy of
the incoming signal (6 db) signals the start of an utterance. All subsequent LPC data from
the SP1000 is then saved in a buffer until the end of the word is detected. This is specified
by a period of silence (roughly 200 msec) determined, again, by looking at the energy of the
incoming signal. The resulting data is compressed or time normalized into a block of data
12 frames long to allow for a more uniform means of comparison as well as to minimize the
amount of data that must be stored. This compression is accomplished by merging together
any adjacent frames that are very similar. Thus, any "interesting" features of the waveform
are preserved.
What is then done differs depending upon whether the user is in recognition or training
mode. In training mode, this data is averaged with data from previous vocalizations of the
same utterance to form a template. Currently, the software requires that each utterance
be repeated twice during training. When all of the utterances in the vocabulary have been
repeated twice, the training phase is finished.
In recognition mode however, the resulting data is compared to the vocabulary templates
to find the best match. This comparison is performed using dynamic time warping and a
Euclidean norm as a distance measure. The template that possesses the shortest distance is
the one that is selected as the best match. This distance however, must be greater than a
minimum threshold (termed the acceptance threshold). This threshold provides a trade-off
between unrecognized words and false recognitions. If it is too low, then valid utterances
will not be successfully recognized. If it too high, then utterances not in the vocabulary or
spurious noises will be misrecognized as valid words.
Since the time to recognize a word depends directly on the number of words or templates
in the vocabulary, a small trick is used to reduce the search time. This involves examining
the distance measure as it is computed for each template. If the distance is less than a certain
prespecified threshold (termed the lower threshold), then that template is treated as the best
match and no further computations are performed. As well, if the distance is greater than a
third threshold value (termed the upper threshold), all further comparisons to the utterance
are stopped. This hopefully allows the system to quickly disregard spurious noises such as
doors slamming, phones ringing, and so on since these will likely result in distances that
are greater than the upper threshold in almost all cases. Those cases of spurious noise that
do pass this test however, will still not likely cause recognition errors due to the acceptance
threshold test.
A useful feature incorporated into this system is the ability to divide the entire trained
vocabulary into what are termed groups, by assigning every trained utterance to a particular
group. Using these, an active vocabulary, that is to say, the vocabulary of words or utterances
which is searched through during the recognition algorithm, can be reduced to a subset of the
total trained vocabulary. Once a word is recognized, a search byte for the group containing
the recognized word is used to specify which groups comprise the new active vocabulary.
Since reduction of the size of the active vocabulary reduces the number of comparisons that
must be made by the recognition algorithm it has the potential for decreasing recognition
delays as well as reducing the probability of mis-recognition errors. This grouping structure
however, is not entirely arbitrary as each trained word can occur only in one group and a
search byte can only be specified on a group by group instead of on a word by word, basis.
3.1.2 Evaluation and Testing
There were basically two modes of testing that were performed on the Lis'ner 1000. The
first involved the straightforward training of a particular vocabulary and subsequent testing
of the recognition accuracy and other parameters on a word by word basis while the second
involved the recognition of entire sequences of words, in sentences typical of ATC commands.
Word Entry
The primary goal of testing on a word by word basis was to study basic parameters
of interest such as recognition speeds, delays, and accuracy. This was accomplished by
having the user talk to the LIS'NER 1000 and having the APPLE display on its screen a
representation of the recognized utterance.
The testing was performed with a value of zero for the lower threshold. This meant that
the distance between any template and an utterance had to be less than zero, clearly an
impossibility, for the template to be declared a match and all further comparisons stopped.
Thus it forced the recognition algorithm to search through the entire active vocabulary. As
United-Airlines one descend Alpha Mike Yankee
TWA two climb Bravo November Zulu
Air-Canada three and-maintain Charlie Oscar
Piedmont four turn Delta Papa
Lufthansa five left Echo Quebec
Republic six right Foxtrot Romeo no
US-Air seven heading Golf Sierra over
eight fly Hotel Tango check
Manjo niner airspeed India Uniform cancel
Celts zero cleared-for-final Juliett Victor delete
Boston hundred feet Kilo Whiskey enter
Revere thousand degrees Lima X-Ray execute
Table 3.1: Table of typical words used for ASR evaluation
well, the "grouping" feature of the LISNER 1000 which allows for the reduction of the size
of the active vocabulary was disabled. This was done in order to get a better idea of what
the actual recognition rates were with a well known and constant size vocabulary.
The testing itself consisted simply of speaking words that were trained beforehand. These
words were selected as being typical of the ATC vocabulary. A list of some of the words used
can be seen in Table 3.1. This list is by no means an exhaustive list of all the words that were
trained and tested but it is indicative of the types of utterances that recognition information
was desired for. Note that the convention used throughout this report is to treat hyphenated
words as one utterance. That is to say, they are trained as one word and the ASR system
will not recognize them individually unless they are also trained individually. For example,
the utterance "United-Airlines" is trained as one word. Thus, the words "United" and
"Airlines" cannot be recognized as separate words unless they are also trained as such.
The recognition accuracy of this system was found to vary not only with size of the
vocabulary, but also with its content. It was very common for recognition errors to be
made between two words that sounded similar, such as "fly" and "five", but what was not
expected was the large number of errors (mis-recognitions) that occurred between words that
did not sound similar or were not even the same length. This was primarily due to the data
compression algorithms used which tended to obliterate word features.
The recognition accuracies themselves were on the order of 70 % to 80 %. These values
varied however depending on the actual words contained in the vocabulary and which speakers
were using the system, often dropping to as low as 60 % for some users. Best results were
typically obtained with "loud", confident speakers who articulated clearly as opposed to
"quiet", timid speakers. It is also interesting to note that better results were obtained for
multi-syllabic or long words. This is probably a direct result of the time normalization and
compression algorithms and their effect on the data quality. In particular, longer words
possess many more features than do shorter ones and it is less likely for these to be "lost"
during data processing and compression.
These figures do not take into account the significant number of recognitions triggered
by background noise (conversations, telephones ringing, etc.) however. In fact, background
noise alone was in some cases responsible for the degradation of recognition accuracy to the
neighborhood of 40%. The principal cause of this was that other than placing the microphone
close to the user's mouth where the signal magnitude was likely to be much larger than
the noise magnitude, there was no attempt made by the ASR system to compensate for
external noise. In general, best results were obtained by training the system in very quiet
surroundings and then moving the system to the somewhat noisier ones for testing and
evaluation. Background noise however, need not be a problem if noise canceling mikes are
used, as shall be shown in the testing described in later sections.
The system was also found to be very sensitive to differences in the pronunciation of
trained words. This arose in two different contexts. First, the user often spoke differently
when training the system than when testing its recognition performance. Thus, the templates
generated during training were significantly different from those generated during actual
speech input tests. This effect however was reduced, although not eliminated, as the users
became more confident and familiar with the system.
Second, changes in emotional state, health or intent colored words and this resulted in
different vocalizations thus making the utterances unmatchable. In this respect, the greatest
problem occurred when a recognition error was made and the operator had to repeat the word.
The natural tendency was to repeat it in a much slower manner, articulating each syllable
clearly, as would be done when repeating something to a person. Although this may make it
easier for another person to recognize what was said, it changes the acoustical pattern of the
utterance greatly and actually degrades the performance of the speech recognition system.
Thus, the user had to consciously force himself to maintain a consistent enunciation of the
vocabulary both during training and while using the system. This was especially difficult
with this system due to the frustration factor. This is a positive feedback effect, common
with the lower performance recognition systems, which arises from the user becoming more
and more frustrated with the fact that the system will not recognize a particular word and
as such, changing his pronunciation of that word more and more.
As would be expected, recognition speeds were found to be a function the vocabulary size.
For vocabularies of 32 words, the recognition delay was determined to be about 2.5 seconds.
For 64 words, it was found to be about 5 seconds. The primary reason for this delay was
the fact that the LIS'NER 1000 did not possess its own dedicated processor. Instead, it
relied on the Motorola 6502 processor in the host APPLE, a fairly old and slow processor,
to perform the calculations required by the recognition algorithm. These large delays made
it very difficult to use the system.
ATC Command Entry
In order to more properly asses the requirements of an ASR system, some testing in an
environment typical of what the applications environment would eventually be had to be
performed. For this reason, the LIS'NER 1000 was connected so as to serve as a speech
input front end for a VAX 11/750, which, at the time of the work, was the computer on
which most of the ATC simulation research in the Flight Transportation Laboratory was
being performed. The inter-connection was implemented using an RS-232 serial link and
the LIS'NER 1000 simply sent an ASCII representation of a word or utterance, as it was
recognized, to the VAX through this link. It was then the responsibility of software in the
VAX to perform any error checking and parsing of the input as required.
Since the primary application to be studied, ATC simulation, involves the entry of entire
commands or phrases as opposed to single words or utterances, a few commands typical of
1. (aircraft) CLIMB/DESCEND AN D-MAINTAIN (altitude)where (aircraft) is the aircraft call sign (eg; air-canada, united-
airlines) followed by the digits of the flight number.eg; "United-Airlines six five zero"
"TWA three five"
and (altitude)
11. (aircraft) TURN-LEFT/TUwhere (heading)
is either the word "FLIGHT-LEVEL" followed bythe three separate digits of the flight level or theseparate digits of the thousands plus the hundreds
terminated by the word feet.eg; "One seven thousand niner hundred feet"
"Flight level one eight zero"
RN-RIGHT HEADING (degrees)is the three separate digits of the heading omittingthe word degrees with 360 indicating a northheading.eg; "zero zero five" for 50
"three two zero" for 320%
Figure 3.2: Commands used in preliminary ASR system evaluation.
ATC were formulated so that some experience could be gained in the operational problems
of command entry. These commands were very simple and were drawn almost directly from
the ATC Handbook. They consisted of the vectoring and altitude change commands and can
be found in Figure 3.2. In general, the procedure was to first identify the particular aircraft
being referred to and then to issue the desired command. The termination of the command
was indicated by the receipt of the syntactically complete command. Thus, for example, once
the third digit in the heading specification was received, the system would be reset, awaiting
the input of another command.
Feedback to the user was provided through the use of a CRT display. The display consisted
of three lines (see Figure 3.3). The first was used to display system messages such as "Invalid
Command!" if an invalid command was entered, or "Please Repeat!" if the utterance could
not be matched to any of the words in the vocabulary. The second line was used to display
EXECUTING COMMAND
United-Airlines turn left heading zero niner zero F1
(send (fetch-nth '(:aircraft 0) '(:name 'ua)) :fly-heading 90)
Figure 3.3: VAX Display for Command entry feedback.
the current state of the command as recognized so far, allowing the user to detect mistakes
and keep track of where in the command he was. The third line was used to display the final
command in a format (Lisp code) executable by the ATC simulation.
It was obvious that a mechanism for correcting errors was also required. Thus, the
keywords "NO"and "CANCEL" were included in the vocabulary. Upon the receipt of the
"NO" keyword, the command line parser would back up past the last utterance, in effect,
acting like a delete key for an entire word. When the "CAN C EL" keyword was received, the
entire command was canceled and the display cleared and reset to await a new command.
On the whole, the system performed fairly well considering the accuracy and speed of the
LISNER 1000 speech recognition system. A lot of the problems evidenced during word entry
became even more critical during ATC command entry. In particular, the large recognition
delays (and errors) made it extremely difficult to enter entire command strings. This was
compounded by the fact that the next word could not be spoken until the previous word had
been recognized. This forced the user to be unrealistically and impractically slow in inputing
an entire command, often requiring six or more seconds, if even a single recognition error
was made, to input a single word (2 seconds delay for the initial mis-recognition of the word
plus 2 seconds delay to recognize the keyword "NO", plus 2 seconds delay to recognize the
word, hopefully correctly, the second time).
Another problem arose from the way command termination was implemented. This
because there was no allowance made for the user to correct an error in the last utterance,
ie; the one that signaled the end of the command sequence. For example, there would be no
way to correct an error in recognizing the "five" in a "... heading zero four five" command
if another digit were substituted in its place. Thus, command termination was modified
so that the use of the keywords "Enter" or "Execute" indicated to the processor that the
command was both finished and correct. In addition to this, some sort of timeout on the
microphone input could be implemented to indicate that the user was finished speaking and
hence finished entering the command. These and other command termination strategies will
be discussed in more detail in later chapters.
3.1.3 Recommendations
Performance of the LIS'NER 1000 ASR system was found to be lacking in two major
respects. First, its recognition accuracy was very poor, especially in comparison to other,
more expensive, recognizers on the market. Second, the recognition delays were large and
this made command entry very impractical. These delays arose not only from the serial
architecture of the system', but also due to the slow operating speed of the 6502 processor.
It did however demonstrate some potentially very useful features. In particular, the
"grouping" feature was found to potentially be of great use both in reducing recognition
delays and improving accuracy, especially in conjunction with the fairly rigid command syntax
found in ATC commands by reducing the size of the active vocabulary.
'The processor must complete the execution of the recognition algorithm before performing other tasks,including the reading of data from the SP1000. Thus the user must not only pause for a sufficient timebetween words or utterances to delineate them, he must also wait until the previous word was recognized.
The limited size of the vocabulary, 64 words, although it likely would not encompass
the entire vocabulary required for all projected applications, was not found to be overly
restrictive. This is especially true when different groups of 64 words can be switched into
and out of memory, thereby increasing the effective size of the total vocabulary.
In conclusion, some of the features that were found desirable, at this stage of the work, in a
more capable ASR system are listed below. The first three requirements are all very closely
related. Tradeoffs must routinely be made between these during the ASR system design
process based on what the designer feels is most important. Thus, it is difficult to determine
exactly which system will meet user needs without some research and experimentation.
* Continuous Speech Recognition
One of the underlying principles in this work is to impose as few additional constraints
on the user as possible. The requirement to pause between words when entering a com-
mand, thus creating a very halting form of speech, was felt to be'too restrictive, and
created a a strong preference for continuous or connected speech recognition systems.
These would allow the user to concentrate on his work instead of on his speech. With
either of these systems however, the user still retains the option to revert to discrete
speech if, for example, better recognition accuracy were desired. Furthermore, contin-
uous was preferred over connected speech because it was felt that connected speech
would result in excessive delays since the speech data is not analyzed until a pause is
detected in the stream of incoming speech. Thus in a continuous stream of words, the
first might not be recognized until after the last had been spoken. Note however that
this preference on speech input mode is also affected significantly by the recognition
delays and error rates of the system. In particular, continuous speech recognition sys-
tems with high error rates and large recognition delays would likely not be desirable
over connected or even discrete systems with better performance.
* Short Recognition Delays
This requirement is closely related to the preference for continuous speech ASR and is
very difficult to quantify. Obviously a recognition delay as short as possible is desired
but at what point do the delays become too large? Furthermore, there are some trade-
offs to be made. For example, how much additional delay can be tolerated in order
to gain the benefit of continuous speech? In particular, is the user more willing to
tolerate larger recognition delays as long as he can speak continuously, or reduced
recognition delays with the imposition of discrete speech? This would also depend on
the recognition accuracy of the respective systems and on the exact nature of the task
being performed.
High Recognition Accuracy
Clearly we want the ASR system to be as accurate as possible. The actual recognition
accuracy is difficult quantify exactly but should realistically be a minimum of roughly
95 %. This would imply a recognition error for every three commands issued if we
assume an average command length of eight to nine words. The actual error rate
however is also affected by things such as size and content of thy vocabulary. Thus,
some experimentation is required to determine the accuracies possible with an ATC
vocabulary for each specific ASR system. Also related is the desire for continuous
speech recognition since discrete ASR systems are more accurate than continuous ASR
systems. Thus, another question is "How much, if any, degradation in accuracy can
be tolerated for the acquisition of continuous speech capabilities?". This can only be
answered by experimentation.
. Vocabulary Size of roughly 60 words minimum.
Although a total vocabulary size of 64 words would not likely be sufficient for all of
the applications envisioned, it would at least allow their demonstration. When coupled
with the ability to switch in different groups of 64 words however, it was thought that
this would be sufficient to meet the needs of all the tasks envisioned.
. Speaker Dependence
Clearly since we are concerned with speech input from only one controller, there is no
need for speaker independent systems. Different controllers could still be accommo-
dated by storing their templates on floppy disc and recalling these when required.
" Good Noise Immunity
Since the controller operates in an environment where there is a lot of background
noise, good noise immunity is essential. This however, was easily taken care of (as shall
be seen in a later section) through the use of noise canceling microphones.
" Vocabulary Grouping or Set Switching Capability
Initial work indicated that this would have potentially large benefits in terms of both
decreasing recognition delays and increasing recognition accuracy when used to incor-
porate syntactical and grammatical constraints into the recognition process. Thus, it
was very desirable in an ASR system used for ATC command recognition.
3.2 VOTAN VPC2000 System
3.2.1 Description
After some analysis of the existing technology, the ASR system selkcted for further re-
search and developmental work was the VOTAN VPC 2000 system. This is a speaker depen-
dent, continuous speech ASR system produced by VOTAN of Fremont, CA[34]. It consists
of a plug in card for an IBM PC or compatible computer and associated driving software.
List price, at the time of purchase, was $1500 ($1200 for the hardware and $300 for the
software) 2. As a further benefit, it also had the capability for the digital recording and play-
back of spoken messages. This feature was used in conjunction with the ATC simulation as
will be described in the next chapter.
The internal operation of this system is not documented but it operates in much the
same way as other continuous speech ASR systems. That is to say, it utilizes dynamic time
warping techniques on the data made available by some sort of feature extraction process to
compare templates from a trained vocabulary to those obtained from speech input. If the
comparison meets a prespecified threshold test, then a recognition event is declared.
It contains its own processor, a Motorola 6809 as well as custom digital signal processing
chips and as such, does not require the host computer's processor to execute the recognition
2The cost when the decision was made to purchase this system was $2100. Over the month that it took toorder, the price dropped to the above amount. This is indicative of the cost trends of ASR technology.
algorithm. This configuration allows for fairly short recognition delays as well as for software
to be run concurrently on the host computer.
Vocabulary is limited to a maximum of 64 words at a time. This, as is the case for
nearly all ASR systems, is due to memory limitations as the VPC board uses its own on-
board memory (only 22K) for template storage. The actual vocabulary size is influenced
greatly by the number of training passes made for each word as well as the length of the
words trained. The figure of 64 words is the nominal vocabulary size for average length
words and two training passes per word. More training passes per word would obviously
reduce this figure. There is also the ability to swap in different vocabularies from main PC
memory, assuming the user can tolerate a brief delay. Thus the effective vocabulary size can
be increased dramatically.
The system operates with two basic software packages. These are VOICEKEY and its
associated utilities, and Voice Programming Language (VPL) and its associated utilities. Al-
though the actual recognition algorithm is identical irrespective of which particular software
package is being used, each incorporates a different user interface and thus provides features
and capabilities not found in the other.
In particular, VOICEKEY operates by making the speech input seem like keyboard input.
Thus, the operation of the VPC is for the most part, transparent to the user and this results
in a reduced capability for control of the VPC functions.
VPL however is an actual programming language. It allows for much more flexibility and
control of the VPC functions. It also has the additional feature that it provides information
on not only the best guess as to which word was spoken, but also the second best guess.
This information can be used to great advantage, as shall be shown in Chapter 4, in order to
correct recognition errors. It however does not provide the same "set switching" or vocabulary
"grouping" capabilities as are found in VOICEKEY.
Training of the vocabulary proceeds much the same as in other speaker dependent sys-
tems, requiring the user to repeat the words contained in the vocabulary a specified number
of times. With this system however, the number of repetitions is left up to the user and can
even be changed from word to word. Thus, more templates could be generated for difficult to
recognize words in order to hopefully improve recognition performance. Furthermore, data
from different training passes are not averaged together to create vocabulary templates as
was the case with the LISNER system. Instead, a template is created and saved each time
a word is trained. As well, the templates are not normalized to a constant length. This
hopefully eliminates some of the feature masking that occurred with the LISNER system.
Since this system is a continuous speech system, it must be able to accommodate some
of the co-articulation effects common in continuous speech. This is accomplished by allowing
the user to train words or utterances "in phrase". In this type of training, a stream of
continuous speech containing the word for which "in phrase" training is desired is spoken to
the system. This allows the system to generate a template based on what the word would
actually "sound" like in a stream of continuous speech. Granted, this is not entirely general
since the co-articulation effects are highly dependent on the neighboring words and it is
unrealistic to train for all word combinations, but it does at least address the problem.
3.2.2 Evaluation and Testing
Although the majority of testing and evaluation of this system was performed in conjunc-
tion with the development of the ATC simulation and is described in Chapter 4, there was
significant testing of this system in a standalone environment. This testing basically involved
the implementation of a command entry task such as that outlined in Section 3.1.2. This
was performed using both continuous and discrete speech as input in order to get an idea of
the baseline recognition performance and how it would be affected by the use of continuous
speech.
Discrete Speech
To a certain extent, a comparison between continuous and discrete speech input is difficult
to make. Since the VPC2000 is a continuous ASR system, it does not require a pause between
words at all, and as such, it is up to the user to introduce what he feels is a pause of "sufficient
duration". Thus, in instances where the pause is not of sufficient duration and discrete word
recognition systems would fail, the VPC2000 would succeed. As well, discrete speech ASR
systems are tailored to the much simpler task of recognizing discrete speech. Thus, their
performance in this task can, in general, be expected to be much better.
However, the performance of the VPC, even for discrete speech, was much better than the
LISNER. This because the VPC represented a significant technological leap in its architecture
and design over that used in the LISNER. Improvements in performance were evidenced both
in delay reduction and increased recognition accuracy.
The decrease in delays took on two forms. First, due to the parallel construction of the
system, the user was allowed to speak the next word even before the current one had been
recognized. This reduced inter-word delays to only that required to delineate the words for
discrete speech. This however, is actually quite common with the higher performance discrete
speech recognition systems as well.
Second, the actual time taken to recognize a particular word dropped dramatically. For
a vocabulary of 64 words, this turned out to be on the order of 0.8 seconds or less. Even
though still significant, this did not affect the user as much as might be expected since it
only delayed feedback and did not in any way affect how quickly he could say words. In
general, he would typically enter an entire string of words sequentially, without waiting for
each word to be recognized. This would create a delay between the time he finished speaking
to the time the last word was recognized, but this was was still within acceptable levels. For
example, for a string of ten words, this delay was only on the order of 2 to 3 seconds.
For discrete speech input, the recognition accuracy of this system, roughly 97%, was much
higher than that of the LIS'NER 1000. As well, some of the sensitivity to the "vocal coloring"
of words was reduced, although not eliminated. In this respect, it still did not match the
performance of other ASR systems examined which were far less sensitive to these variations
in how a word was said (although these were discrete speech systems). A tendency for
recognition accuracy to degrade slightly if it had been a long time (approximately 3 weeks)
since the vocabulary was trained was also noted with this system. This was not evidenced
with the LIS'NER 1000 system primarily due to its poorer performance and the masking of
this phenomenon by other factors.
Continuous Speech
The greatest benefit of this system over other systems examined was the ability to use
continuous speech. In fact, performance was such that it allowed for true continuous speech.
Thus, not only was there no requirement for pauses between words, there was no need to limit
the duration of a stream of continuous speech as is sometimes the case for connected speech
systems. To illustrate this, numerous, but unsuccessful attempts were made to "out-talk"
the system. "Out-talking" a system occurs when a stream of speech, of sufficient duration
that the ASR system cannot recognize it fast enough and words that would otherwise be
recognized are lost, is spoken. For discrete speech, this can occur, depending on the exact
definition of "out-talking", even with two words if there does not exist a pause between them
of sufficient duration to delineate them3 . With this system, this is accomplished not through
large amounts of memory but with software that processes the speech data as it comes in.
Thus, in a long stream of words, as the first words are recognized, the c6rresponding speech
data can be ignored, thus freeing up memory, even before the user is finished talking. This is
not the case with connected speech where the speech data is saved in a buffer until a pause
is detected.
The command entry task of Section 3.1.2 was again repeated, this time allowing the user
to enter commands in either continuous or discrete speech as desired. This served to indicate
some problems that were not initially evident while using discrete speech.
The major problem that arose was a decrease in recognition accuracy. This resulted
primarily from co-articulation effects that created a significant difference between the words
as trained and the words as spoken. The "in phrase" training procedure, although it did
help, did not completely solve the problem. The problem was especially acute for short
words such as the digits. These were often not recognized when they were part of a stream of
continuous speech. This because the co-articulation or slurring effects were so great that there
was relatively little "data" associated with these words to allow for confidence (recognition
3 This is not really a fair criticism of discrete speech recognition systems however. A more realistic test wouldbe to impose the requirement for discrete speech on the user. In this case, the higher performance discreteASR systems (parallel construction) are impossible to out-talk as well.
threshold test) in the comparison. Granted, the recognition threshold could be lowered but
this would create other problems (spurious recognitions). At first glance, it was thought that
this could be countered with more "in phrase" training using samples of highly co-articulated
speech but this actually created more errors than it eliminated. In particular, since templates
generated in this way became very short, there was a significantly increased likelihood that
they could be matched to spurious noises, such as taking a breath between words, or speech
data "left over" between words (this "left over" data arose principally from the alignment of
words during the recognition algorithm and the fact that different vocalizations of the same
word were different lengths). Thus the rate of spurious recognitions increased dramatically.
This problem was especially acute with the word "eight" since the "t" sound is often omitted
during speech and the resulting sound was very easily confused with "left over" data or the
sound made while taking a breath between words. 4 For this reason, significant care had to
be taken during the training procedure.
When recognition errors were made during a stream of continuous speech, this system
demonstrated very robust performance in its ability to get back "on track" and recognize
succeeding words. That is to say, even though a recognition error would often cause an error
in recognizing the following word, by the second, or at most third word, the system would
be recognizing correctly again. The reason for this error in adjacent words involves the fact
that for continuous speech ASR systems, the recognition algorithm begins re-analyzing the
data at the point where the previously recognized word finished. Thus, if an error is made in
this word, data associated with the next word could be masked. Consider for example, the
sequence of words "fly present heading". Since "five" is longer than "fly", if a recognition
error is made, the recognition algorithm will commence recognition at a point after the word
"present" actually begins (see Figure 3.4). Thus, it is not likely that the word "present"
would be recognized and it is even possible for the remaining "stub" or "left-over" data to be
misrecognized as some other word. This ability to get back on track is somewhat similar to
the word spotting feature of some systems. With this, words contained in the vocabulary can
"Take for example, the sequence of digits "8 2 2". If these are said rapidly, the "t" in the eight is omittedand the pronunciation changes from eight two two to eigh two two.
InputWaveform I A
CorrectflRecognition present
-- heading -
Incorrect fvRecognition f
-- heading --
Figure 3.4: Example of word boundary mis-alignment due to misrecognition errors.
be recognized in a stream of speech containing both trained and untrained words. The same
can also be accomplished with the VPC system through judicious selection of the recognition
threshold.
The performance results and figures given so far assumed ideal conditions (quiet environ-
ment). These however could not be expected in typical operating conditions and were even
difficult to obtain in a Laboratory environment. Anytime the conditions were less than ideal,
there were significant reductions in recognition accuracy. In fact, even the simple operation of
a fan and the resulting breeze blowing across the microphone were enough to trigger spurious
recognitions at the rate of roughly one every two seconds. Clearly this could not be allowed.
In order to solve this problem, noise canceling microphones were used. These provided
almost ideal noise immunity even allowing people nearby to carry on normal conversations
while the system was being used without significantly affecting recognition performance. Two
noise canceling microphones were tested. These were a Communications Applied Technol-
ogy CAT 1 electret mike and a Telex Airman 750. Both of these were headset mounted
microphones but whereas the CAT mike was specifically designed for ASR use, the Telex
was designed for use in the aircraft cockpit. Thus, understandably, performance of the CAT
mike was superior to that of the Telex. Both of these however performed much better than
the gooseneck mike that was standard issue with the VOTAN system. The headset mounted
mikes also had the advantage that the distance between the mouth and the microphone was
kept constant. This greatly reduced some of the variability in the input signal and thus
further improved recognition accuracy over that using the gooseneck microphone.
A CAT throat microphone was also tested but, although it still performed better than
the gooseneck mike, it did not perform nearly as well as either of the headsets. Its advantage
was that it offered noise immunity superior to that of the noise canceling microphones. Noise
levels encountered or expected however were not sufficient to justify its use, especially in
light of its poorer performance.
Chapter 4
ATC Simulation Environment:Command Recognition SystemDesign
In this chapter, the major portion of the work performed will be presented. This, as
mentioned in Chapter 1, involved the development of speech input and output capabilities
for a terminal area ATC simulation. It was in conjunction with this simulation that a system
for recognizing ATC commands was designed and implemented.
The ATC Simulation itself deals with operations in the terminal area airspace of an
airport. Here, it is the controller's responsibility to issue appropriate commands to any
aircraft so as to avoid conflicts and minimize any delays to all aircraft arriving and departing
from the airport. This is accomplished through verbal communications between the controller
and the pilots over a radio link. In order to aid him in determining the position of aircraft,
the controller also possesses a display presenting information, made available by surveillance
radar, about names, positions and velocities of the aircraft being tracked in his airspace.
Functionally, the simulation can be split into three basic components. These are:
1. ATC Simulation and Display: The basic simulation task. This involves simulation
of the terminal area environment including aircraft, surveillance radar, winds, navaids,
and so on, as well as the duplication of the Air Traffic Controller's display used to
present aircraft radar tracks and other information to the controller
2. Speech Input Interface: A suitable user interface that will allow the controller to
input commands verbally directly into the computer.
3. Pseudo Pilot Response: A system to simulate pilot responses and queries to con-
troller commands in order to create a more realistic environment.
The overall configuration of hardware that was used to complete this task can be seen in
Figure 4.1. It consists principally of a host computer, a Texas Instruments Explorer, with an
inter-connection, via an RS-232 serial link, to the speech recognition (and audio playback)
system, the VPC2000. Since the ASR system required hosting on an IBM PC, this was also
included in the hardware.
In incorporating ASR into this simulation, two conflicting philosophies were apparent. On
the one hand, there was the desire to design a system with which ASR could be incorporated
into existing ATC simulation environments, without the modification of the interface to the
controller, in order to present an environment as similar as possible to that found in the
real world. This implies that there be no operational differences arising from whether the
controller was talking to a pilot, a blip-driver/pseudo-pilot, or an ASR system. Thus, the only
feedback available to him to indicate command transmission/recognition errors had occurred
would be (simulated) verbal pilot responses and actions in response to his commands. On
the other hand however, the incorporation of ASR should include a good user interface for the
controller. This almost certainly implies that he be presented with some sort of visual display
in order to provide additional feedback required for the detection and possible correction of
recognition errors. These conflicting criteria were resolved by designing both capabilities into
the simulation. In this manner, if one or the other was no longer desired, it could easily be
removed.
4.1 ATC Simulation and Display
Here, the basic simulation functions that must be performed regardless of the whether
speech recognition and playback is used or not are implemented. These entail the modeling
and simulation of the airspace and the aircraft flying in it as well as any additional factors
Figure 4.1: Configuration of ATC Simulation Hardware.
desired for fidelity.
In the current configuration, this task was the responsibility of the simulation computer, a
Texas Instruments Explorer. This is a Lisp Machine using the Common Lisp implementation
of the Lisp programming language. The actual simulation itself was written in this language
using object oriented programming techniques in order to allow for ease of development and
modification.
The details of the actual airspace being simulated are specified by a user defined database.
This database contains items such as airports, VOR's, and fixes. Aircraft flying in this
airspace are separate entities and can be generated/introduced in a number of ways. These
include allowing the user to specify exact scenarios in which aircraft entering the airspace
are defined deterministically or, creating aircraft randomly at prespecified entry-fixes and
prespecified rates and distributions.
The simulated controller's display was presented on the Lisp Machine's screen, a black
and white raster scan display with a resolution of 1024 by 750 pixels. It consisted of a
x T T OR
Figure 4.2: Icon used for display of fixes in the simulation display
HFD
Figure 4.3: Icon used for display of airports in the simulation display
Tectangulr window, created using the Lisp Machine's window interface, and gave the user
the abilijy to perform standard window operations such as resizing and reconfiguration as
wel as allowing him to use the mouse. These capabilities were very useful)ater when format
changes were desired.
T afeiialdisylaj of items on the simulated controller's display was accomplished using
icons. There were basically three different icons representing navaids, airports, and aircraft.
The icons used to represent navaids consisted of an "X" symbol with the name of the fix
beside it. Airports were represented by circles with a dot at their center and had the name
of the airport beside them. Finally, aircraft were represented by a circle with a cross at
its center. A lot of additional information was also displayed with the aircraft icon. This
consisted of the aircraft name and flight-number, its estimated altitude in hundreds of feet,
if available, and its estimated groundspeed in knots. Examples of these icons can be seen in
FiguesA.2 through 4.4.
Therwere o other windows included in the display that could be used for
other purposes. One of these was used to display the elapsed time during the simulation.
The uses of the rest will be described shortly.
For the current simulation task, the airspace within roughly 50 miles of Boston's Logan
airport was used. A picture of how the display would actually appear can be seen in Figure
070 ---- Altitude in hundreds of feet250 -- Estimated speed in knots
Figure 4.4: Icon used for display of aircraft in the simulation display
4.5. Note the location of Boston's airport (labeled BOS) in the center of the display as well as
the three aircraft (CP123, UA66, and AA151) flying in the simulation. The positions of these
aircraft were updated, as would be the case in the real world where positional information is
made available by surveillance radar, at roughly five second intervals.
4.2 Speech Input Interface
With the simulation configuration as described in the last section, commands issued to
a particular aircraft by a subject controller are relayed to a blip-driver and keyed into the
Lisp Machine manually. It is the replacement of this by the use of speech directly that is the
primary function of the Speech Input Interface (SII).
The SII can be split up into two basic divisions. The first, the ASR system, is responsible
for the actual recognition of the controller's spoken input and the second, the Speech Input
Parser is responsible for performing any error detection and correction as well as translating
the input into executable code. These two systems are described in further detail in the
following sections.
4.2.1 ASR System
The ASR system used for the simulation, as mentioned previously, was the VOTAN
VPC2000. This system was selected primarily because it provided the capability for contin-
uous speech thereby freeing the user from any artificial constraints in how he spoke. This
added to the fidelity of the simulation and allowed for the emphasis on what the controller
was doing as opposed to how he was speaking.
The capabilities and use of this system have, for the most part, already been described in
Figure 4.5: Sample of the ATC Simulation Display on the TI Explorer.
49
Chapter 3. Its operation and performance in conjunction with the simulation task however
will be described in later sections.
Of significant import to the operation of the VPC system for speech recognition was the
fact that it was also used to generate simulated pseudo-pilot responses. Thus, since it could
not do both at the same time, there were limitations as to when the controller could talk to the
system. Furthermore, a reliable method for switching between these functions was required.
Since primary simulation control was the responsibility of the Explorer, it would ordinarily
have been this that controlled the switching of functions in the VPC. This was not possible
however since no control could be exerted on the VPC when it was in recognition mode by
input made through the serial port. Therefore, the switching of modes was accomplished
through the selection of a keyword, "Over" which, when recognized internally by the VPC,
would switch the function from speech recognition to speech playback. This required that
every command issued be terminated by the word "Over". Once in speech playback mode,
the VPC could be commanded to switch back to speech recognition mode directly by the
Explorer through commands issued through the serial port.
4.2.2 User Feedback and Prompting
Before beginning an explanation of the actual procedure used in the parsing of speech
input, it is important to discuss the methods that were used to provide feedback to the
controller of the recognized commands as mentioned previously.
In current simulations, the only feedback available to the controller indicating an error in
the recognition and execution of his commands arise from the pseudo-pilot acknowledgments
and actions in response to his commands. Although it would be desirable to utilize only
these for the feedback of errors in the recognition of spoken commands in the simulation
task, these acknowledgments, as shall be discussed in Section 4.3, are somewhat limited in
their error handling capability. Thus, the capability for additional feedback was desired.
The additional feedback was accomplished by displaying the recognized words on the
screen, using one of the auxiliary windows of the Simulation Display mentioned in the previous
section. The recognized words were displayed in two ways. In the first, they were simply
echoed onto the screen as they were received by the Explorer and in the second, they were
displayed as part of the current command string, assuming that they passed the parsing tests
on their validity. When the current command was completed, the display advanced to the
next line and all subsequent input was treated as part of the new command. In this way, a
scrolling list of the commands issued was generated.
If a parsing error was detected however, then a suitable message was displayed informing
the user of the problem. This message was presented in its own window in order to avoid
cluttering the speech recognition feedback display.
An example of what this would look like in operation can be seen in Figure 4.6. The
window on the bottom right hand side is used to echo the recognized words as they are
received from the VPC. Note however that there are two words and two numbers in each
line of this display. This is because the top two matches from the recognition system, along
with their respective scores, are being displayed. Thus, for the example shown, the last word
recognized was either "turn", with a score of 44 or "three", also with a score of 44. Recall
that the lower the score, the better the match.
Above this window can be seen the display used to provide the primary feedback of the
recognized commands. In this example, there are nine completed commands and the user is
currently midway through entering the tenth.
Finally, just below the window displaying the simulation time, there is the error feedback
window. Here, an error message corresponding to an invalid input (neither "turn" nor
"three" are valid inputs at this stage of the command) is being displayed. Thus, the user is
alerted to this and can correct it.
Once a command was terminated, a command terminator character was printed at the
end of the current line. If the command was a valid one, then this command terminator
was a period. If there was any type of error however, then the terminator was a question
mark. The use of this enabled the user to determine when the previous command had been
terminated and the next command could be input as well as whether or not the command
was one that could be executed properly.
CL En
0 *0 C C tn CD
C-) 0 P O
4.2.3 Speech Input Parser
The basic function of the speech input parser is to translate the spoken commands of
the controller, once they are recognized by the ASR system, into a format suitable for en-
try into the 'computer. In general, this task, although similar to Natural Language (NL)
understanding systems in Artificial Intelligence, possesses some significant differences.
In NL systems, the basic problem lies in understanding what the meaning of a statement
made in everyday conversational speech is. There tend to be very few restrictions as to the
scope of what can be said and even fewer on the syntax that must be used. Although there
can be some ambiguities, with a NL system, it is assumed that the user input is, for the most
part, correct. It is here that the principal differences between the two tasks differ.
In the current task, the introduction of a speech recognition system into the data input
process can result in errors in the words arriving at the computer even if the user's input
was correct. Thus the problem now becomes one of determining what was meant by the user
when there exists the possibility of errors having been made in the transcription of his speech.
This is a much more difficult problem. Fortunately, in ATC command recognition a rigid
and well defined syntax can be imposed on the commands input. This syntax is specified in
the Air Traffic Controllers handbook and although it might not be strictly adhered to in the
real world, it is not unreasonable to expect users in the simulation and training environment
to follow it. It is through the restriction in the scope of the conversation to these commands
only and the utilization of the ATC syntax that the problem of understanding what was
said can be greatly simplified. In this way, the real problem becomes one of eliminating, or
otherwise accounting for, errors made during the speech recognition process.
There are basically two ways that this can be done. The first is to improve the accu-
racy of the recognition algorithms in the ASR systems themselves. For most of the higher
performance systems however this would be very difficult as they are operating near peak
performance now. Furthermore, this is not the goal of this work.
The second is to use the additional information made available by syntax and grammar
in conjunction with existing ASR systems in order to improve their recognition accuracy. It
is here that the greatest potential for improvement lies. In fact, for any realistic or non-
trivial application of ASR, especially continuous or connected speech ASR, this information
is essential for resolving ambiguities in the recognition of speech. For this reason, many of
the more successful ASR systems do indeed use this information directly in the recognition
process [21. The exact methods for utilizing this information, however, are many and diverse
and require, to a great extent, tailoring to the capabilities and performance of the actual
ASR system being used.
Finite State Machine
Since the construction of a parser that would be able to correct recognition errors without
external aid was a very difficult undertaking, at least initially, the approach first attempted
was to construct a parser that would utilize this syntactical information in order to merely
detect recognition errors. It would then be up to the user to correct these himself.
In order to do this, a Finite State Machine (FSM) was used to specify valid word se-
quences. In the FSM approach, a parser transitions through a finite number of states, as
a function of the recognized input words. The valid word sequences, and hence the valid
commands, are thus defined by the possible paths through the FSM as it transitions from
state to state. Errors are indicated by the receipt of a word that is not defined with reference
to the current state of the parser.
For example, consider the FSM shown in Figure 4.7. This FSM is used to specify the
syntax required for the input of a heading azimuth. Here the states are represented by circles
and transitions or branches from one state to another by arrows. The quoted words contained
in the arrows represent the verbal inputs required to transition between the states. Here,
the symbol "<i-j>" is used to represent a branch using only one of the digits i through jinclusive. Thus, for example, "<0-5>" implies that a branch exists for any one of the digits
"zero", "one", "two", "three", "four" or "five".
If we assume that the parser is initially at state S1, then the only valid inputs are "zero",
"one", "two" and "three". Any other input received would indicate an error. If the words
"zero", "one" or "two" are received, then the parser would transition to state S2. From
XSi
"<0-2>"
S2
Figure 4.7: Example of the Finite State Machine logic for the specification of a heading
here, the valid inputs become all of the digits "zero" through "niner". If, on the other hand,
the word "three" is received, then the parser would transition to state S6 where the valid
inputs now become the digits "zero" through "six", and so on.
With this system, detection of errors, whether they arise from recognition errors made
by the ASR system or input errors made by the controller1 , is fairly straightforward. This
because the valid words are defined as a function of the current state of the parser by the
branches to other states. Thus, if a word is received for which a transition branch does not
exist, an error is signaled. For example, if the word "five" were received while the parser
were at state S1, then an error would be signaled since there is no branch defined for this
word.
With the FSM structure, it is very easy to add other words or otherwise change the
syntax of commands. This can be done simply by adding, removing or otherwise modifying
any desired branches or states of the FSM. For this reason, as well as the added simplicity, a
reduced subset of the ATC commands was first implemented. These consisted of the vectoring
commands, altitude change commands, and the airspeed control commands, as taken directly
from the ATC handbook.
The resulting FSM was too large to be drawn on a single page and as such was split
into conceptual blocks called Superblocks (see Figure 4.8). Each superblock contains its own
internal FSM for implementing its required subdivision of the syntax (see Figures 4.9 through
4.11) and connects to other Superblocks through branches defined by the to-other-superblocks
and from-other-superblocks labels. With this Superblock representation, it was much easier
to picture the overall structure and flow of the FSM. For the current example, it consisted
of getting an aircraft name, followed by a command (one of altitude, airspeed, or heading),
followed by a terminator (the keyword "Over") which reset the FSM to await the input of
the next command.
In general, the operation of such a system would be as follows. As the user speaks a
command, the words composing it are recognized by the ASR system and transmitted to the
SIP. Upon receipt by the SIP, they are immediately displayed on the initial feedback window
'These two are indistinguishable by the parser since only the controller knows exactly what was said
InitialState
Aircraft NameSuperblock
HeadingSuperblock
AltitudeSuperblock
AirspeedSuperblock
"Over"
Figure 4.8: Superblock structure of the FSM implemented.
(bottom right hand corner of Figure 4.6) to provide "raw" user feedback to the recognition
process. If they parse correctly, then the parser transitions to the next state and these words
are also displayed in the command feedback window (middle right hand window in Figure
4.6) as part of the current command being input. If an error is detected however, then a
suitable message informing the user is displayed in the error feedback window.
The capability to correct this error was implemented, as was the case the preliminary
evaluation performed in Chapter 3, through the use of the "Delete" and "Cancel" keywords.
Upon the receipt of the "Delete" keyword, the parser would "back-up" one state to the
previous state, thus, in effect, deleting the last recognized utterance from the input stream.
The command feedback display would then be updated to reflect this change. With the
"Cancel" command however, the parser would be reset to its initial state thus deleting the
entire command received so far. In this case, the display of the current command would also
be cleared in order to reflect its cancellation.
In order to allow the user to make any changes or corrections to the recognized command,
the input command was not executed by the simulation until the user had validated it by
saying "Over". In this way, the keyword "Over" served two functions; that of switching
VPC modes and that of command validation or termination.
The compilation of the recognized input into syntactically correct commands however was
only one part of the problem. The other dealt with how these were used in order to implement
the desired action. With this configuration of the SIP, this was accomplished by constructing
an executable Lisp s-expression or statement during the state-to-state transitions of the FSM
parser. Once the "Over" keyword was recognized, this statement, if valid, was evaluated and
an appropriate pseudo-pilot response generated.
Although with the FSM structure the possible word sequences, and hence the command
syntax was rigidly defined, there was still the capability of introducing some flexibility into
how the commands were entered. This was accomplished by explicitly incorporating into the
FSM any "short-cuts" or alternative possibilities in how a command might be issued. Take
for example the FSM defining the format required for specifying an aircraft name (Figure
4.9). Here, although the aircraft call sign was deemed essential, the flight number was not
from-other-superblocks
"TWA" "Air-Canada" "United-Airlines" "CP-Air"
n-1
<0-9>"
of<0-9>"'
n-4
Aircraft "turn" (fig 4.10)Name -climb/descend" (fig 4.11)Superblock "increase/decrease" (fig 4.12)
to-other-superblocks
Figure 4.9: Internal structure of the Aircraft Name Superblock
from-other-superblocks
turn"
HeadingCommandSuperblock
to-other-superblocks
Figure 4.10: Internal structure of the Heading Command Superblock
"left" "right"
"0 or "5"
from-other-superblocks
"climb" "descend"
"flight-level""and-maintain"
(2 "flight-level"
"thousand"
"hundred"
"hundred"
"thousand"
0a.Altitude
CommandSuperblock
to-other-8uperblocks
Figure 4.11: Internal structure of the Altitude Command Superblock
from-other- uperblocks
"increase" "decrease"
"speed-by"
s-2
"speed-to"
s_5
"<1-9>"
s..e
AirspeedCommandSuperblock
to-other-superblocks
Figure 4.12: Internal structure of the Airspeed Command Superblock
from-other-superblocks
"increase" "decrease"
os..1
speed"
os..2
"2" . "1">"3","4",.9"
<-9" os -
<0-9>" "<0-9>"
Lto-other-superblocks
Figure 4.13: Airspeed Command Superblock maintaining original ATC syntax
7
I
OriginalSyntax
AirspeedCommandSuperblock
and could occasionally be omitted. This was a reasonable thing to expect since if only one
flight of a particular air carrier was under the supervision of the controller, it might be likely
that he would address it by the carrier name only. Thus, the section of the FSM dealing
with the receipt of the aircraft name could be exited at any of the states n_1, n-2, n-3 or
n.4. If this resulted in his not uniquely identifying the aircraft to which he was referring,
then contextual error checks on the controller's input, mentioned later in this Chapter, would
come into play to handle this in an appropriate manner.
This same type of flexibility was also incorporated into the vectoring command where
the use of the word "heading" was treated as optional as well as in the altitude command
where "and-maintain" was also not essential. If examination of operating procedures were
to reveal other modifications of the command syntax that were used in practice, then these
could also be incorporated in a similar manner.
An interesting problem was incurred in the definition of the airspeed control commands.
This arose from the fact that the command utilized to issue a change in airspeed, by a
specified increment or decrement, was almost identical to that used to signal a new absolute
airspeed to be flown. In particular, a relative speed change was indicated by the command
"increase/decrease speed (number of knots)" whereas an absolute speed command was of
the form "increase/decrease speed to (speed)" [16]. For example, "increase speed to two
two zero knots" implied that the pilot should now fly at 220 knots whereas "increase speed
two zero knots" implied that he should increase his current speed by 20 knots. Addition of
another word, "to", and hence another branch to the FSM would create problems however
since it would be impossible for the ASR system to distinguish the word "to" from the
number "two"2. In fact, there is no way for even a pilot to know which one was said by the
controller until the entire command is finished.
At first glance, it would seem possible to solve this problem by adding two different
utterances "speed" and "speed-to" to the vocabulary. Thus, when the user was entering a
relative speed change command, the word "speed" would be recognized and the appropriate
2Note that since it is very likely for a controller to use the word "to" in the issuance of this type of command,it cannot realistically be eliminated from the vocabulary.
branch taken in the FSM. Similarly, the word "speed-to" could be used to define a different
path. This method however would not work because it would be very difficult for the ASR
system to distinguish between the utterances "speed two" and "speed-to". Thus, in some
cases, an invalid change in speed would be recognized and in others, an invalid new speed
would occur. As well, it could not realistically be expected that the controller would strictly
adhere to the format of using "speed-to" only for absolute speed changes and "speed" only
for relative speed changes.
There are however, two other ways that this problem can be handled. In the first, the
standard syntax can be modified slightly so that absolute speed changes are specified by
the "increase/decrease speed-to ... " command and relative speed changes by the "in-
crease/decrease speed-by ... " command3 . This would then be incorporated into the FSM
by using the the two utterances "speed-to" and "speed-by" to yield a FSM such as the one
in Figure 4.12.
If, however, the requirement that the command syntax not be modified is imposed, this
problem could still be handled by using the same techniques that a human listener would use.
In particular, the FSM would be modified so that all combinations of input were allowable
(see Figure 4.13). Here, no attempt would be made by the FSM to distinguish between "to"
and "two" and the decision on which was actually entered would be handled internally by
determining what the argument of the command (i.e., the "(speed)") was. If it was less
than say 90 knots, then it would be assumed that the desired command was a relative speed
change command. If not, then it would be assumed to be an absolute airspeed command.
Note that the actual speeds entered can range all the way up to the two thousands. This
because the initial "to" would be recognized as a "two". If this occurs then the leading
"two" would be eliminated from the speed and it would again be assumed that an absolute
airspeed command was issued.
3Since the word "speed" is no-longer contained in the vocabulary, there is no longer the possibility of errorsbeing made between "speed-to" and "speed two"
Error Detection Using Contextual Information
Oftentimes a recognition or input error occurs that does not create any parsing problems
and can in no way be detected by the parser 4 . These types of errors create special difficulties
and fall into two general classes. The first of these involves errors that arise from the non-
recognition of words which are optional and need not be included in the command. Examples
of these are the words "and-maintain" or "heading". In most cases, these words are not
critical to the understanding or execution of the command issued and thus their omission
does not create any problems. However in other cases, such as in specification of an aircraft
flight number, non-recognized words can cause problems.
The second class of these errors involves mis-recognition errors are made amongst words
that lie on the same state-to-state branch. In this case, the parser transitions to the correct
state but does so based on the wrong input (with respect to what was actually said). Since
the parser still transitions to the proper state, there is no way for these errors to be detected,
at least with the standard FSM mechanism. A good example of this is in the transitions
between states h_4 and h-5 in Figure 4.10. Here, if a mis-recognition is made amongst any of
the digits, for example, mistaking "four" for "five", there is no way for the parser to detect
it. It would however result in an incorrect heading being issued if the user did not detect it
and explicitly correct it.
Since these errors could not be detected by syntax alone, an alternative technique had
to be employed. This technique utilizes information about the airspace and the aircraft
flying in it, termed contextual information, in order to detect some of the discrepancies and
ambiguities in the issued commands.
A good example of its use can be found in the determination of which aircraft was being
referred to by the controller. Here, after the command has been terminated and before the
commanded action has actually been executed, comparisons are made in order to determine
which of the aircraft in the simulation possess names similar to the one recognized. If the
result is unique, then the command can be executed. If however there is more than one
'They can however be detected by the user if he monitors the visual feedback
possibility, either because not enough of the aircraft's name was issued to make it unique,
or because of a recognition error made in the aircraft name, then a suitable error message is
displayed. This action is readily modifiable to, for example, determine the likeliest candidate
aircraft, based on a user defined measure of merit, and assume that the controller was referring
to this aircraft (if he really wasn't, then the pseudo-pilot response would indicate this error
to him).
This type of contextual error checking could, in a similar manner, be extended to include
other possible error occurrences. For example, before executing a vectoring change command,
the direction to turn onto ("left" or "right") could be used in conjunction with the new
heading and the aircraft's current heading to check for possible errors. The same could be
done with "climb" and "descend" in an altitude change command. These however were not
implemented in this initial version of the SIP.
Evaluation
Although the FSM approach provided a straightforward and simple method of using ATC
command syntax for error detection, its performance when a recognition error was made was
lacking. Since it possessed no innate ability to correct for and/or otherwise compensate for a
recognition error, when an error did occur, the user was forced to stop and correct it before
he could proceed with any verbal input.
The reason for this was the very rigid structure of the FSM. With this, it was very critical
that the parser transition correctly from state to state since the valid vocabulary words are
defined based solely on what the current state is. Thus, if an error caused a transition to the
wrong state, then the valid vocabulary became something other than what it should actually
have been. This implied that subsequent controller input would not be parsed correctly even
if it was validly recognized.
This problem was ironically made even more acute by the capability of the VPC for
continuous speech (ironic because the continuous speech capability was one of the primary
reasons for selecting the VPC system). Using this, users spoke entire commands in a stream
of continuous speech, without stopping after every word to make sure that it was recognized
properly. In this manner, if an error resulting in an incorrect state transition were made, the
rest of the words, having already been spoken, would be, when recognized, either discarded
by the parser as invalid input or parsed incorrectly.
This is perhaps best illustrated by an example. Consider the input of the command
"descend three thousand five hundred over". If "descend" were misrecognized for "turn
left", then the parser would be expecting a heading. The first word arriving, the "three" is
a valid heading digit and would thus parse properly. The next, "thousand", however, would
cause an input error to be indicated. The "five" would again be valid as a heading digit
but the next word, "hundred", would be invalid. Thus, the parser would assume that the
command issued so far was "turn left three five" and would be awaiting the final digit of
the heading specification.
Although this type of problem5 would simply require the user to either correct or repeat
his command, if the stream of speech were terminated by the keyword "Over" the problem
would be complicated further. Under these circumstances, since the VPC was still recognizing
the user's speech (it was the SIP that was treating it as invalid), the "Over" keyword could
be recognized and cause the VPC to switch functions from speech recognition to speech
playback. Thus, it would be impossible for the user to correct the transcribed command
since the VPC was no longer recognizing his verbal input and the error correction keywords
could not be entered. (This same problem would also occur if a mis-recognition of the word
"Over" for some other word, typically the word "four", occurred in the middle of a command
input.)
There were basically three methods of handling this problem. In the first, the receipt of
the word "Over" at any state in the FSM would cause the parser to be reset to the initial state
and an appropriate pseudo-pilot message indicating the error, if possible, to be generated
(see Section 4.3 for a description of how this would take place). The VPC would also be
switched back into speech recognition mode. This procedure however, made no allowances for
the user to correct the errors in the current command and instead forced him to repeat it in
6Although illustrated with an example using a mis-recognition error, this type of problem could occur withany error that resulted in an incorrect state transition including spurious and non-recognition errors.
its entirety6. The underlying philosophy here was that only pseudo-pilot responses would be
used for error feedback (no feedback display) and that these would indicate to the controller
that an error was made in interpreting his command.
Although this approach solved the VPC mode switching problem, it forced the user to
repeat the entire command just issued. In order to remedy this, a second approach in which
the branches based on the word "Over", at states that were not syntactically valid states for
command termination, were modified so that they branched back to the same state. In this
way, the command recognized so far was not lost. Furthermore, since the VPC was again
reset to speech recognition mode, the user could correct any errors and continue on with the
command input from this point.
A third, though not as elegant, technique for handling this type of error was to simply
refrain from "Over" until the display could be examined in order to determine that no errors
were made. If any were detected, then these could be corrected before the command was
terminated. This approach slowed down the command input process significantly, even in
those cases where there were no errors made, but was successful.
Finite State Machine with Set Switching for Vocabulary Size Reduction
From the testing done with the previous configuration, it was readily apparent that a
means of reducing recognition errors had to be found. One -way that it was thought this
could be accomplished was through the use of the set switching capability of the VPC as
outlined in Chapter 3. This could be used in conjunction with the FSM implementation of
the parser in order to specify the active vocabulary of the VPC as a function of the state of
the parser. In this way, the ASR system would only recognize words from a list of valid input
words. Thus, when these recognized words were received by the FSM, they were guaranteed
to be syntactically valid. By reducing vocabulary size in this way, recognition delays could
also be significantly reduced.
In order to do this, the required vocabulary set switching logic had to be contained on
*Its action was similar to that taken when a "Cancel" command was issued except the feedback display wasnot cleared. Instead, a "?" was displayed as the command terminator in order to indicate that an errorhad occurred.
the IBM PC (the delay required to transmit this information from the Explorer would make
implementation intractable). Since this significantly complicated the code operating on the
PC (a FSM complete with all of the error detection and correction logic, would have had
to be constructed), some of the simpler "built-in" set switching features of the VPC system
were used. These however resulted in an active vocabulary was not as rigorously defined
as it could have been. That is to say, the active vocabulary at any given state contained
words that would not be valid given that state. This however did not alter the validity of
the results obtained using this configuration since recognition errors resulting in syntactically
invalid words were rare.
As would be expected, the recognition accuracy was improved by the reduction of the
active vocabulary size at each state. This configuration however- still exhibited the same
sensitivity to errors that occurred with the previous configuration. With the previous config-
uration, an error causing an improper transition would cause subsequent controller input to
be discarded as invalid. This would be determined at the Explorer end, after the words had
been recognized. With this configuration however, the same error would cause the controller's
verbal input to be discarded at the VPC end, before recognition had even occurred (actually,
during the recognition stage itself). This because the VPC would now be trying to recognize
the controller's input based on the wrong active vocabulary. Thus, it would seem from the
perspective of the Explorer and the SIP, that the controller had stopped talking. Although
this made almost no difference as to what was seen by the user with either configuration, (he
still had to stop and correct the error that caused this before he could go on), there was a
lot of potentially useful information contained in the rest of the command that could be of
use in correcting this error that was lost.
FSM with Inferior Choice Words
A technique that yielded similar results to set switching again incorporated the baseline
FSM structure but this time made available to it information about how well all of the words
in the vocabulary matched the current verbal input. Thus, if the first choice word (as specified
by the ASR system) would not parse correctly, then the second could be examined and so
on until one did parse properly. In this way, all of the syntactically invalid words would be
disregarded, thus, in effect, generating the same results as with set switching. Although this
parser did stifle the ASR system's ability to get back on track after a recognition error was
made, the increase in size of the active vocabulary (it was now the entire vocabulary) created
other problems. In particular, recognition delays were increased since more comparisons had
to be made in the recognition process. The recognition error rate, however, did not increase
because the parser would detect any potential errors arising from words that did not parse
correctly and examine inferior choices to find one that did parse correctly. Errors involving
words that did parse correctly were not detectable even with the previous parsers.
Operationally, there were other factors to consider. First, a threshold test had to be
implemented on inferior choice words in order to prevent words that had very poor scores,
and were thus, not likely said by the user, from being parsed. This prevented random
"garbage" from being parsed into a valid command. Further modifications to this threshold
test could also be implemented using the relative score differences between the best guess
and the one that would parse correctly (if these were different). If this was small, then it was
very likely that the two words could easily be confused. If it was large, then the second choice
word was not likely what had been said and an error had probably been made at a previous
stage in the parsing process. This however did not add significantly to the performance of
the parser and hence was not included.
Second, care had to be taken to accept the word "Over" only when it was the first choice
since its receipt would indicate to the Explorer that the VPC had switched modes. Thus, a
command to switch it back to recognition mode (or a pseudo-pilot message) would be sent
to the VPC. The VPC however only switched to playback mode when the first choice word
was "Over". Thus, an extra reset command would be contained in the VPC's input queue
and would result in speech playback synchronization problems.
Finally, the VPC system only made information about the top two candidate words
available. Although in most cases this was sufficient, there were instances where neither of
the top two choices were correct and the third was not available. Thus, an input error was
indicated when there was the potential to recover from it. This was a simple problem with
the VOTAN software since, in order to determine the best match, all of the words in the
vocabulary had to be scored. The interface provided however, only allowed access to the top
two.
Even if the second choice word was the correct selection, it was not possible to realign
the recognition algorithm to commence at the end of this word. Thus, there were time mis-
alignment problems, as mentioned in Section 2.2 and in Figure 3.4, which often resulted in
recognition errors in the succeeding words as well.
For these reasons, the performance of this system, although comparable, was not as good
as that for the basic FSM with set switching. This however, could likely be remedied to a
great extent if more than the top two choice words were made available by the VPC.
Other Variations
There are further variations possible on this FSM approach. However most of these
become intractable, either because the interrelationships between recognized words in con-
tinuous speech cannot be readily included or because only the top two choices as to the
recognized word are available with the current VPC software.
One of these variations that possesses a great deal of elegance and simplicity and deserves
mention involves the assignment of a confidence (or score) to each branch of the FSM based
on the likelihood that the word contained in that branch was the one that was actually
spoken. The command actually entered would then be defined by the path that possessed
the best score. This system however breaks down due to the extreme difficulty in assigning
scores to all of the branches in this manner. In particular, since in continuous speech the
currently recognized word affects the recognition of the words following it (word boundaries),
the recognition algorithm would have to be run quite a number of times on the same block
of speech data in order to obtain scores for all of the word sequence combinations required
to score all the branches of the FSM. Even if this this could be done, which it cannot with
the VPC, the delays would go up dramatically. This could be modified slightly however, so
that only the top two or three choices would be used to determine the possible branchings.
Or, more correctly, only the top choices within a specified range of recognition scores, thus,
only performing these evaluations when confusion between words is likely. However, this still
becomes complex very rapidly and again, could not be implemented with the VPC.
Pattern Matcher
While the FSM approach used so far was successful in incorporating ATC command
syntax for the detection of errors, it was unable to correct or in any way recover from
them without external aid. Thus, it was left up to the user to correct these before he could
commence inputing data. This requirement for user intervention every time an error occurred
resulted in a very difficult command entry procedure.
In order to remedy this, a different approach was taken in the design of a parser. This
approach, termed the Pattern Matcher, or PM, attempted to make-use, of the fact that even
though a command might contain one or two errors, its general intent could, at least in some
cases, still be readily inferred. In this way, the user would not be required to correct all of
the recognition errors that occurred.
The general procedure here involved comparing the entire input command to a database
of allowable commands on a word by word basis in order to determine which was the best
match. In this way (in a manner somewhat analogous to the speech recognition process itself),
the input command could be "recognized" and the required command action determined.
Since the explicit enumeration of all possible word sequences constituting the "recog-
nizable" commands was unrealistic, a more compact notation was used to construct the
aforementioned database of allowable commands. In particular, groups of words that are
logical entities are represented by keywords beginning with the symbol "+", in order to
avoid having to list all of the possibilities. For example ~+aircraft" is used to represent
combinations of words that can denote an aircraft. In a similar manner, "+altitude" is used
to represent word groups that specify an altitude, and so on.
An example of how this database of controller commands appears can be seen in Figure
4.14. This database contains the same commands that were incorporated into the FSM in
the previous sections.
Determining which command was spoken when no recognition errors were made was fairly
+aircraft
+aircraft+aircraft
+aircraft
+aircraft
+aircraft
+aircraft
+aircraft
+aircraft
+aircraft
turn right heading +heading
turn left heading +heading
turn right +heading
turn left +heading
climb and-maintain +altitude
descend and-maintain +altitude
climb +altitude
descend +altitude
increase speed-to +speed
decrease speed-to +speed
Figure 4.14: Table of ATC Commands used in Pattern Matcher database
straightforward and consisted simply of stepping through each word in the input command
and comparing it to the corresponding position in the command template.
When recognition errors were made however, the process became much more difficult.
This because mis-recognition errors could obscure words and spurious of non-recognition
errors resulting in word omissions or insertions into the input command could create problems
aligning the input and template. In order to allow for these, a simple procedure whereby
adjacent words were examined to determine if and what types of errors had occurred was
used. This procedure is perhaps best explained by example.
Consider for example the input command I1 12 13 14 and the template command T1
T2 T3 T4. The procedure for comparing these would be as follows.
1. If Il matches T1, then a match is declared and the comparison proceeds with the next
two elements 12 and T2
2. If I1 doesn't match T1, then I1 is compared to T2
(a) if Il matches T2, then it assumed that word T1 has been omitted from the input
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
*wWwwwwww-
stream and the matching process proceeds comparing 12 and T3
(b) if I1 does not match T2 then 12 is compared to T2
i. if 12 matches T2, then I1 is treated as a mis-match of T1
ii. if 12 does not match T2, but 12 matches T1, then I1 is treated as a spurious
input
iii. if 12 does not match T2 and 12 does not match T1, then I1 is again treated as
a mis-match of T1 and the comparison process is repeated commencing with
12 and T2
Thus, it can be seen that the result of this comparison is a sequence of either match, mis-
match, spurious, or omitted. These are then used to determine a score for the comparison.
The actual scoring used is, to a certain extent, somewhat arbitrary. In the baseline scoring
used, a match scored 0, a mis-match 0.75, an omission 1.0, and a spurious a 1.0. The scores
that were thus generated for each comparison of template and input were tallied and the
"recognized" command was selected as that with the lowest score.
On the whole, this parser was successful in demonstrating some of the things that it set
out to accomplish. In particular, with this approach, it was possible for errors to be made
and still allow the command to be recognized.
There were quite a few instances however where the errors were such that more than
one command template possessed the minimum score. Thus, it was unclear what the issued
command actually was.
Also, the fact that recognition errors tended to occur in successive words (due to word
boundary misalignment) created significant difficulties for the current algorithm since it only
examined adjacent words in performing the matching process.
Even if errors were such that the command could be recognized, there was still the
possibility that they would make the commanded action unclear. This would occur, for
example, if an error were made in the recognition of a digit in a heading, or altitude, or even
aircraft name.
This developmental system suffered from the fact that no mechanism for interactive error
correction by the user was included. As such, if an unrecoverable error was made, the user
was required to input the entire command again. This was done primarily to simplify the
code.
Still, the pattern matching approach provided benefits over the FSM approach. In partic-
ular, the improved error robustness provided by the innate capability to allow for recognition
errors was very desirable. Furthermore, with this structure, the parser could take advantage
of the VPC system's capability to get back on track after a recognition error was made, and
thus allow the user to keep speaking even if an error was made. This created a much more
realistic simulation environment since the user did not have to pause after every few words
to monitor the feedback display and correct errors as they occurred before he could continue
with his verbal input. It also more accurately duplicated the procedure used by humans to
recognize speech.
Its operation was however very slow. This because none of the actual pattern match-
ing was attempted until the entire command was received (ie; terminated by the keyword
"Over"). This was done for two reasons. First of all, it greatly simplified the development
of the Pattern Matcher and second, information as to what the next words in the command
input were was required by the matching algorithm.
Possible modifications and improvements to this system are numerous. Clearly the scoring
could be modified to, for example, reduce the significance of mis-matches involving unim-
portant inputs such as "and", or, to increase the importance of others, more critical to the
message content, such as "descend" or "turn". As well, since with a continuous speech ASR
system a mis-recognition error in one word increases the likelihood of an recognition error
(mis-recognition or non-recognition) in the next 7 , adjacent errors could be scored slightly less
or the matching algorithm could examine words further down the input in order to make a
decision about errors at the current position in the input sequence.
Further modifications could include the use of inferior choices for each recognized input
word into the comparison process. The score could then be suitably modified to reflect this,7 Due to the word boundary misalignment problem mentioned earlier.
possibly by using the recognition score of the particular word selected.
Similarly, a confusability matrix, a matrix whose elements specify the likelihood of the
ASR system confusing one word for another, could be determined (empirically) and used to
score hypothesized misrecognitions involving any word pairs.
It is beyond the scope of this work however to examine all of these possibilities but they
are mentioned here for the sake of completeness.
4.2.4 Discussion
From the material covered in the previous sections, it can be seen that the primary
difficulty in the design of the Speech Input Interface (or ATCCR system) is the handling of
recognition errors. Although means were presented whereby these-could be corrected by the
user, it was desirable that the SIP system itself be able to correct some of these.
In designing this capability for internal error correction, it became readily apparent that
what was being attempted by the PM, and to some extent, the FSM through some of the
variations on it, was directly analogous to what was being done in connected speech recog-
nition systems. That is to say, attempts were being made to find the best sequence of
words, subject to syntactical constraints that matched a given "segment" of speech input.
This matching process was such that comparisons went in both directions. Thus, not only
could words recognized early on affect the recognition of words spoken later, but the con-
verse could also occur. The difference however is that the SIP operates on entire commands
whereas connected ASR systems work on short phrases delineated by pauses.
This differed significantly from what goes on in continuous speech recognition. In con-
tinuous speech recognition, a top-down, fore-aft process in which only the past history of
recognized words can affect the recognition of any given word is performed. This is directly
analogous to that taken by the Speech Input Parsers based on a FSM approach.
This is not what is in general, desired in recognizing entire commands since in a great
number of cases, it is not clear, even if no recognition errors are made, exactly what was said
until most or all of the command has been recognized. Although systems using this fore-aft
approach can still be quite successful in recognizing commands, there is a lot of information
contained in the rest of the input that can be used to great advantage in order to reduce the
error rate. Furthermore, there are instances where these are guaranteed to fail in recognizing
what was said.'
Thus, the question was raised that perhaps a connected speech recognition system was
more amenable to the task at hand (since this is what was in effect, being done by the PM).
On closer reflection however, it was decided that what was really desired was a connected
speech recognition capability in a continuous speech recognition system9. In this way, the
speech input data could be "rewound" to various locations to re-examine it, perhaps under
different operating parameters. Thus, the immediate feedback benefits of continuous speech
recognition are maintained until an error is detected. Then, the system could re-examine the
input using connected speech recognition techniques. -.
In this light, the capability of the VPC system to get itself back on track, after a recog-
nition error was made, assuming, of course, that set switching was not being used in the
recognition. algorithm, was thought to be desirable. It could be used in order to continue
generating valid recognized input, even after an error was made and this input could be used
in order to allow the parsers to hypothesize where and if any errors had been made. The
ASR system could then be rewound to this point in order to attempt to verify the error
hypothesis.
Although the FSM could also generate these error hypotheses, the Pattern Matcher was
potentially much better. This because the pattern matching process determined the exact
location of the differences between the input command and the command-template. Thus,
the locations of candidate recognition errors could be hypothesized to allow the ASR system
to be rewound to these points.
For example, if a vectoring command was spoken and only two digits of the three digit
heading were recognized, then the ASR system could be rewound to the section of speech
"As exemplified by the "speed to" command where it is unclear until the actual speed change is recognizedwhether the controller said "to" or "two".
*An interesting variation might be to have two ASR systems, one continuous and the other connected,recognizing verbal input and making it available to the SIP in parallel. In this way, the disadvantages ofeach ASR system on its own could potentially be masked by the operation of the other.
waveform data where this heading was issued and it could be reprocessed with a relaxed
recognition threshold in order to extract the missing digit.
Furthermore, this could be used to provide a "what if" type of feature in which any word
in the recognized stream of input could be replaced by another (say the second or third choice
word replacing the first choice word) and the resulting changes in the recognized output and
scores could be observed. This could be used to great advantage to resolve errors resulting
from word boundary synchronization problems in following words when a recognition error
is made.
This capability to rewind speech data to an arbitrary point and restart the recogni-
tion algorithm however was not possible with the current configuration of the VPC ASR
system10 . Furthermore, the additional computations would probably increase delays dra-
matically. Thus, with the current system, although hypotheses can be made as to what
phrase was actually said, there is no means of going back and verifying this hypothesis
through re-analysis of the speech data.
This however was not the only instance where ASR system limitation affected the design
of the SIP and its error correction schemes. For example, recall that the FSM with Set
Switching SIP was limited since only the top two choices were available. Still yet another
example relates to the fact that the only output from the ASR system is recognized words.
Thus, if a non-recognition error occurred, there would be no way for the SIP to know that a
word was actually spoken but not recognized.
In general, these limitations were not related to the actual recognition procedure being
used but are instead directly attributable to the black box design of the ASR system. This
because ASR system designers do not want to unnecessarily complicate the user interfaces
for their systems. Thus, a lot of the internal operation is hidden from the user. The end
result however is that complex techniques that attempt to improve or correct recognition
errors are powerless due to a lack of information from and control over the ASR system.
"'This has to a certain extent, been remedied through the introduction of a Library of C routines that providea lot more flexibility in what can be done with this system.
4.3 Pseudo-Pilot Responses
One of the useful features of the VOTAN VPC system was the digital recording capability.
This allowed the user to record a number of spoken messages of varying length, subject to
memory limitations, and then play them back in any particular sequence desired. This feature
was used to incorporate pseudo-pilot responses into the Air Traffic Control simulation.
As mentioned in Section 4.2.2 one of the principal uses of the pseudo-pilot in the simu-
lation task is to provide feedback to the user about possible recognition errors. This form of
feedback can be used to indicate four general statuses of command recognition.
In the first, the controller's command was received correctly and there were no apparent
problems in understanding it. In this case, the pseudo-pilot is used to generate a standard
acknowledgment to the controller indicating the command was received.
In the second, an error was made in recognizing the specific command however the air-
craft being referred to was known. Here the p-pilot responds with a "Say Again" message
indicating that the message was not received correctly. In the third, it was unclear which
aircraft was being referred to and as such, there is no p-pilot response and finally in the
fourth, an error was made in determining which aircraft was being referred to, therefore a
response is generated from a random aircraft. The exact handling of these last two cases can
be varied quite a bit but basically consists of either no pseudo-pilot response or response from
the wrong pseudo-pilot. The other possibility, responses from multiple aircraft was just not
possible with the current system and might not be tractable even if the capability existed.
In order to obtain the desired flexibility in response message format as well as reduce the
limit memory requirements, the response messages were constructed by connecting together
shorter, more general messages. These shorter messages consisted of a specified number
of aircraft carriers (the simulation was required to operate with only these particular air
carriers), followed by the digits and a number of keywords. A table outlining exactly what
the pseudo-pilot's "vocabulary" was has been included in Table 4.1.
Furthermore, in order to simulate different aircraft pilots (the controller does use this
information), a number of speakers were recorded creating a database of pseudo-pilot voices.
Table 4.1: Table of discrete messages recorded for Pseudo-pilot response formulation
Thus, as each aircraft entered the controllers airspace, it was assigned a particular pseudo-
pilot voice. All responses from this aircraft were then made using this particular voice.
The format of pseudo-pilot responses was kept fairly basic. They consisted of the identi-
fication of which aircraft was responding followed by a "Roger" message implying message
received and acknowledged, or a "Say Again" message implying that the message was not
understood. With the inherent flexibility however, these could later be expanded to in-
clude a much broader range of message formats. For example, a typical response message
acknowledging a command would be the message "United Airlines six five zero Roger.".
The scheduling of the p-pilot functions was performed primarily at the VPC end. The
actual generation of which pseudo-pilot messages were to be played however was performed on
the simulation computer (Explorer). A flowchart indicating the action performed by the VPC
board can be seen in Figure 4.15. Basically, it remained in speech recognition mode until the
keyword "Over" was recognized. It then switched into speech playback mode."1 In this mode,
the first input from the Explorer (via the RS-232 Serial Port) specified which pseudo-pilot
voice dataset was to be used and the subsequent input specified which particular message
was to be played. Upon receipt of the ''end-of--message'' flag, the VPC was switched back
into recognition mode to repeat the process over again.
"In order to indicate these mode changes to the user so that he would know when the system was listeningto speech, a beep was sounded before entering and after exiting recognition mode.
Message Recorded Message RecordedUnited Airlines zero
TWA oneCP Air two
Air Canada threeSay Again four
Roger fiveheading sixaltitude sevenhundred eight
thousand niner
The critical and determining factor in this sequencing strategy is the use of the word
"Over" in the speech input stream as the command terminator that would cause the VPC to
switch functions. Although it was desirable that pauses in the controller's speech be also used
to indicate command termination in conjunction with this, the only way that this detection
and timing of pauses in the user's speech could be implemented caused the VPC to switch
into and out of recognition mode on a regular basis if there was no verbal input. Thus, there
was the possibility that when the controller did speak, the VPC would not be in recognition
mode and would miss part of his spoken input. For this reason, this was not implemented.
Ideally of course, there should be two different systems, one for speech recognition and one
for speech output. In this way, there would not be the same problems with sequencing.
Pseudo-pilot responses were generated by the simulation computer upon receipt of the
keyword "Over" from the speech recognizer. If the verbal command parsed correctly, then an
appropriate acknowledgment message would be generated. If the command was ambiguous
but the aircraft being referred to was not, then a "Say Again" message would be generated.
If, however, the aircraft specification was ambiguous, then any action could be taken. In the
current configuration, the user was prompted on the display that the aircraft specification was
ambiguous and a null message was played. Note that in any case, the ''end-of-message''
flag must be sent to the VPC in order to switch it back into speech recognition mode.
Furthermore, since the VPC was constantly being switched into and out of recognition mode,
it was necessary for the user to know when the recognition system was "listening" to his verbal
input. In order to do this, a beep was sounded whenever the VPC switched modes. This
could easily be modified to include a visual signal on the controllers display since his attention
is fixed there anyway. Experimentation revealed however that it was much simpler to use
the aural cue.
Evaluation
The performance of the pseudo-pilot, although it did add a degree of realism and satis-
faction to the ATC simulation, was lacking. The primary problem was directly attributable
to the use of short messages that were concatenated together to form a suitable the response
Figure 4.15: Flowchart of sequencing of VPC2000 functions
83
message. This created two problems.
The first problem was a result of the apparently random changes in pitch and intonation
of the recorded words. These arise from the fact that when the messages were recorded, it
was very difficult to avoid the introduction of inflections and emphasis. Thus, when they
were connected together, these inflections did not mesh together very well and produced very
strange sounding, although still intelligible, pseudo-pilot responses. This could be remedied
through a more careful and iterative recording procedure where such messages are erased and
re-recorded, or, different messages could be recorded with specific intonations for playback
at different positions in the response sequence.
The second problem concerned the discernible pause between the messages as they were
played back. This was directly attributable to the concatenation procedure used to construct
pseudo-pilot response messages. This made for responses that were very slow and often
created a significant delay to the controller since he was not able to issue the next command
until the message was finished playing.
Although the internal operation of the VPC in speech playback mode could not be mod-
ified, there were still a number of ways to reduce the effects of this problem. The first was
to reduce the call-out for identification of the aircraft by omitting the flight number and
using the carrier name only. This would reduce the number of concatenations necessary to
construct the response message and decrease the delays. This however, created the potential
for confusion between different flights of the same carrier but the distinct voices used for the
different pseudo-pilots alleviated this to some extent.
The second method used was to record some of the more common response messages as
entire messages and add these to the existing pseudo-pilot vocabulary. By doing this, not only
were delays eliminated, but the responses were much more realistic sounding. In particular,
the "Roger" and "Say Again" messages were recorded in this way. Since it was not possible
to know all of the flight numbers of the aircraft in the simulation beforehand, these messages
were simply recorded with only the carrier name for identification. By using these messages
only when fast responses were desired and the standard message format in other cases,
pseudo-pilot flexibility was still maintained. This scheme however, although addressing the
intonation and concatenation delay problems of the pseudo-pilot responses, still possesses the
problem of potentially ambiguous aircraft specification. This could be remedied by further
customizing these messages to include the aircraft flight number as well but this would entail
knowing beforehand the exact names of the aircraft that would be operating in the simulation
and would greatly limit flexibility. Furthermore, responses for every aircraft would have to
be recorded and this would greatly increase memory requirements.
In light of these findings, the final message format selected was to use the format initially
described for its flexibility in addition to some messages recorded in their entirety for their
realistic qualities.
The different pseudo-pilot voices however still made it possible to determine which aircraft
was responding if more than one flight from that particular carrier was in the air.
In all of these cases however, only a fixed number of recordings are being used so there is
not much variability in the sound of the responses as there would be in the real world. Thus,
although the basic task is accomplished, there are some drawbacks in terms of realism.
4.4 Discussion
In general, the simulation worked fairly well and was a good vehicle for the demonstration
of ATC command recognition. The use of the VPC was successful in eliminating the require-
ment for blip drivers both for speech input and speech output functions while maintaining a
high degree of realism. Extensive testing of the simulation however, served to indicate some
limitations and problems with both ATC command entry in relation to the simulation task
as well as in general. These, in addition to suggestions as to how they can be alleviated are
discussed in the following section.
Recognition Errors
As would be expected and was amply demonstrated, by far the greatest problem in
incorporating ATC command recognition into the simulation application, arose from errors
in the recognition of the controller's verbal input.
These errors resulted primarily from the sensitivity of the VPC to co-articulation effects
and variations in the way that the user spoke. These variations were, in general, thought to be
insignificant to the user (he was in no way trying to "fool" the system or push it to its limits).
However, they did significantly affect the system. If care was taken to maintain consistent
pronunciation between training and use of the system as well as limiting co-articulation
effects through careful articulation of verbal input, then the VPC performance was found to
be more than adequate to accomplish the tasks attempted. If however this was not the case,
then the error rate increased enough to make its use difficult.
The recognition errors that occurred were, for the most part, very similar to those encoun-
tered and described during the initial system evaluation of Chapter 3. There were however
some additional errors incurred when the user paused and said "ummm" or "aahhh" while
entering a command. This often led to word insertions or spurious recognitions since these
sounds were often mis-recognized as valid input. One method of correcting for these was to
train these sounds as they were made by the user and add them to the vocabulary. Thus,
they could hopefully be recognized and eliminated appropriately by the parser. This how-
ever, created more errors than it eliminated since there was a great deal of variability in how
these sounds were made by the user and thus they were rarely recognized. As well, the ad-
dition of these templates into the recognition vocabulary created a lot more mis-recognition
errors. As such, the operator was instead required to try and avoid these expressions and
state commands clearly.
Furthermore, there were instances where the subject controller desired to talk to other
people and could not since the ASR system was listening in. In order to allow for this a
microphone cut-off switch was included in the set-up. This also alleviated the problem with
"umm" and "aahhh" by allowing the user to switch the mike off until he had decided which
command he wanted to enter.
Background noise, which it was found could greatly affect the error rate, was successfully
compensated for through the use of a noise canceling, headset mounted, microphone. The use
of this further increased recognition accuracy by reducing the variability in the positioning
of the microphone with respect to the user's mouth and thus the variations in the signal seen
by the ASR system.
Since the vocabulary was already fixed for the ATC environment, words that were eas-
ily confused with others or were otherwise likely to cause recognition errors could not be
eliminated. Instead, problem words were merged together with other words to form longer
utterances whose recognizability, at least with the VPC system, was enhanced. This proce-
dure tended to be empirical in nature, requiring different word combinations to be tested in
order to determine which would solve the problem. It was this that motivated the concatena-
tion of such words as "and" and "maintain" into the single utterance "and-maintain" (the
"and" was a constant source of recognition error difficulties).
Some desirable modifications to the system to reduce error rates would be the addition
of the capability to train words "on the fly" while the simulation was running and then
add them to the vocabulary. In this way problem words could be retrained so as to reduce
recognition errors. Furthermore, commonly used sequences of words, such as aircraft names,
could be trained and added as single utterances. For example, the utterance "Air-Canada-
one-two-three" could be used in addition to the four words "Air-Canada one two three".
This would improve accuracy since longer words tend to be easier for the VPC to recognize.
Even if this longer utterance were not recognized, then the fallback would be to recognize
each individual word and proceed from there as was done originally. Care would have to
be taken to avoid adding similar sounding utterances to the vocabulary however, since these
could be easily confused by the VPC.
Another technique with the potential to improve recognition accuracy is the use of adap-
tive template modification techniques. Using this, templates could be modified, or even
removed, if their recognition performance was not good. This would be indicated through a
large number of corrections made involving these particular words by the user. The problem
here however lies in determining that these were indeed corrections and not simple input
changes. Furthermore, it is unclear exactly when a template should be modified and when
it should remain unchanged. For example, if a recognition error occurred due to an unrea-
sonable variation in how a particular word was said (eg; yawn, background noise, excessive
co-articulation or mumbling) then adding this template to the vocabulary would probably
only degrade the recognition accuracy (as was evidenced with the addition of the highly
co-articulated template for "eight" discussed on page 41). Thus, improper procedures could
very readily lead to even poorer performance than was originally evident.
By far the biggest difficulties with recognition errors occurred when they involved the
keywords "Over", "Delete", or "Cancel". Since these words performed special functions,
recognition errors involving would significantly alter the state of the parser. Fortunately,
these words were sufficiently distinct from other words contained in the vocabulary so that
recognition errors involving them were rare. With a larger vocabulary however, this might
not be the case. Therefore, this could be a problem.
One way in which this problem could be addressed would be to duplicate some existing
ASR systems which require the activation of an additional switcIr on-the mike to indicate
that the word being spoken is a keyword. In this way, the user can make certain that these
keywords are not confused with standard input.
Error Correction
Of the two error correction strategies implemented, it was, in general, found that if the
error rate was low (this depended to a great extent on the user and on the particular day as
was indicated in Chapter 3), the "Cancel" was preferable to the "Delete". This because the
user would often use the full capabilities of the ASR system and enter the entire command as
part of a stream of continuous speech since this was much easier than pausing after each word
or group of words in order to check for errors". In this manner, it was just as straightforward,
and much less demanding, to cancel the current command and repeat it in its entirety rather
than delete back to the error and commence from this point on. If the error rate was high
however, then since the user could almost be certain that an error would be made as the
command was repeated, it was preferable to simply correct the current command rather than
begin again.
Since it was often difficult and time consuming to "Delete" back to the error and correct2 This pausing would also tend to increase, the recognition accuracy of the ASR system by reducing co-articulation effects.
it, the need for additional error correction schemes was indicated.
One such improvement to the error correction procedure would be to provide the capa-
bility to repeat only certain portions of a command, preceded by a word such as "Check"
which would indicate that an error had occurred in the prior words. For example, consider
the input sequence "TWA turn left heading 090 Check 050 Over". Here, it is clear what is
implied. This type of capability however would be very difficult to incorporate for a number
of reasons. First, if there are errors present, both in the original command and possibly the
modification coming after the "Check" then it would be very easy for the meaning to become
muddled. Furthermore, because the correction could be a correction of any part of the com-
mand, the benefits of syntax for error reduction would be eliminated. Thus, this technique
was not implemented in the current configuration although it does-deserve mention.
A much better technique for the correction of input errors would be the incorporation
of multi-modal techniques (mouse and keyboard as well as speech) into the command entry
and correction process. In this way for example, the mouse could be used to select any
recognized word presented in the feedback display. The user would then have the option
of changing/correcting this word, deleting it, or inserting other words at this position. This
could be done by typing, speech, or even through the use of pull down menus containing mouse
sensitive options as to what the recognized word was. Furthermore, with this capability,
commands could be entered by keyboard alone if desired. Thus, hard to recognize or problem
words could simply be typed in, thereby eliminating any frustration on the part of the user
attempting to enter these verbally.
Scope
One of the major drawbacks of the ATC simulation itself was its limited scope. This
arose primarily because of the limited number of ATC commands that could be understood
by the system (only three different commands were implemented for this particular stage of
the work).
This, however, was not much of a problem for two reasons. First, the structure of the
Speech Input Parsers was such that additional commands could readily be added. Second, in
a simulation environment, it is very simple to restrict the scope of the task to one requiring
only those commands that have been defined. This, as will be discussed in the next chapter,
is not the case in an operational environment.
Note that if the number of commands were to be increased, then the vocabulary would also
likely increase. This increase would however have to be limited so that the total vocabulary
was at most 64 words. The reason for this is that even though the VPC allows different
groups of 64 words (actually, 22K worth of template data) to be switched in from main
memory thereby increasing the effective total vocabulary size, the system is not monitoring
the microphone while this is taking place and as such, all speech made during this period is
lost. Thus, all of the words that can possibly occur in any given command must be part of
the same group of 64 words comprising the current vocabulary. -
The scope of the simulation was also limited in that the aircraft names, or at least their
call sign roots, had to be known prior to the operation of the simulation in order to train
these on the ASR system and to define them in the SIP. Again, this did not pose much of
a problem in the simulation environment since the names of any aircraft appearing in the
simulation can be readily controlled. Furthermore, the capability of assigning different flight
numbers to aircraft with the same call sign root or carrier name made it seem to the user
that there was more variety in aircraft names than there actually was.
Speech Input and Output Sequencing
The actual sequencing of the command entry and pseudo-pilot response functions also
left something to be desired. In particular, since the hardware and software limitations of the
VPC forced the use of a keyword ("Over" was selected) to switch between speech recognition
and playback functions, this keyword had to be used to terminate each and every command
issued in order to allow any pseudo-pilot messages generated by the simulation to be played.
This greatly limited the rate at which commands could be entered since the user was forced
to pause after every command was terminated in order to wait for any pseudo-pilot messages
(recall that whether or not a pseudo-pilot message was generated, the VPC still switched
modes and would thus, not be monitoring the speech of the user). Furthermore, it resulted in
the inability to issue a number of commands, possibly to different aircraft, in rapid succession
without any intervening pauses to wait for acknowledgments from the pilots as is often done
in the real world, especially during high workload situations.
These last two criticisms are easily addressed, at least with the FSM parsing approach, by
adding a branch from the "bottom" of the FSM to the "top" so that the receipt of another
aircraft name after a syntactically complete message had been input would indicate that
another command was being issued. The chain of commands would still have to be terminated
by "Over" however in order to indicate that they are error free and can be executed. The
PM would require more significant modifications to implement this task however since the
current version requires the each command to be a separate entity for matching purposes
and there is no provision for splitting this command chain into its separate commands.
However, the problem of delays incurred through the recognition, or mis-recognition of
the word "Over" and the subsequent switching into speech output mode still exits.
There are two possible means for correcting this. First, the command termination strategy
(required in order to know when the user is finished inputing a command and correcting
any errors made in it as well as to switch operating modes) could be modified to make it
more general by including information about command syntax and periods of silence on the
part of the controller. In this way, for example, a period of silence of sufficient duration (in
conjunction with a syntactically complete message) could be used to indicate the termination
of a command. A good way to do this would be to incorporate a push-to-talk switch on the
controller's mike. This would be monitored by the Speech Input Interface and when released,
it could be used to indicate the termination of the command. Furthermore, it could be used
to disengage the ASR system so that the controller could talk to other people without having
it attempt to recognize what he said.
Second, the speech input and output functions could be performed using different systems
so that they would not be mutually exclusive. In this way, the user would be able to talk
to the system at any time. Furthermore, if he wanted to issue commands to two distinct
aircraft sequentially without pausing in between and waiting for an acknowledgment, he
could. Granted, there would be some scheduling involved so that pseudo-pilot responses
would not be played while he was speaking but this could be easily accomplished by the
use of the aforementioned microphone switch. This could be monitored to determine if the
controller was finished talking and the playback of pseudo-pilot messages suppressed until he
was.
Chapter 5
Air Traffic Control CommandRecognition: OperationalApplications
Now that some initial experience has been gained in the design of a system for ATCCR, it
is time to re-examine some of the Operational Applications mentioned in Chapter 1 in order
to determine what the practical difficulties in their implementation would be and propose
some solutions. Here, the difficulties to be discussed are those relating to system design and
not to recognition errors which were discussed in the last chapter.
These difficulties can be grouped into two major classes; those that are specific to a
particular application, and those that are generic to any application. Both of these will be
discussed in the following sections.
5.1 General Difficulties
In general, no matter what the particular use to which ATCCR (or ASR for that matter)
is put, there are a number of problems in incorporating it into an everyday operational
environment. These problems basically result from the finite vocabulary of ASR systems,
and the finite number of recognizable commands that can be designed into the ATCCR
system and have, to some extent, already been evidenced in the simulation task of the last
chapter. There, however, the scope and nature of the task could be artificially constrained
and modified in order to minimize these difficulties. This however cannot be done in an
operational environment where the controller does not possess the same control over the
environment.
5.1.1 Recognition of Aircraft Names
For example, in a simulation environment, explicit control can be exercised over the name
of any aircraft appearing in the controller's airspace. In this way, only those aircraft whose
names have been included into the ASR system's vocabulary would appear. In an operational
environment however, this control over which aircraft appear is not possible. Hence, it is
conceivable for an aircraft whose name cannot be recognized by the ASR system to enter a
controller's sector.
One way that this could be remedied is to determine all of the different names of aircraft
that could be expected to enter the sector during some future period of time (actually, only
the carrier names might be required) and explicitly train the ASR system to recognize these
before operations begin. Thus, when a given aircraft entered the sector, the template for its
name could be called up from a database into an "active" list of aircraft names so that it
could be recognized by the ASR system.
This approach however suffers from a number of disadvantages. First, the number of
possible aircraft names is quite large and as such, not only would recognition delays be
increased significantly, but a sizable amount of.memory would be required on board the ASR
system in order to hold all of these. This, however, is remedied to a certain extent by the
maintaining of a list of "active" aircraft names (i.e., those which are currently in the ATC
sector).
Second, since there would exist a number of names that would be used only rarely,
recognition performance could be expected to be degraded significantly for these due to
changes in the user's voice and variations in pronunciation, between the time that they were
trained and the time that they were used, if such a long term database were constructed.
Third, the actual training of all of these different names would be quite time consuming
and tedious (at least with a system such as the VPC where each word must be explicitly
repeated a number of times in order to train it). Therefore, frequent re-training of the
vocabulary, for reasons such as the one mentioned above, could not realistically be expected.
Most importantly however, it is unrealistic to expect to be able to foresee the names of
all aircraft that will be encountered. Thus, there will always exist the possibility of aircraft
whose names have not been trained, such as military aircraft, entering the sector. With the
solution mentioned previously, there is no way that these aircraft can be accommodated.
A better solution would be to incorporate the capability to train new words on-the-fly,
during actual operations, into the ATCCR system. In this way, as new aircraft entered the
controller's sector, their names could be trained and added to the vocabulary. The aircraft
to associate with these new names could be indicated to the computer by simply pointing
to the desired aircraft with a mouse while training its name. When these aircraft leave the
sector, their names could then be deleted automatically, or retained in,a database for future
recall if necessary.
This on-the-fly training capability could also be used, as was mentioned in the last Chap-
ter, to retrain difficult to recognize words or to train entire aircraft names (i.e., callsign plus
flight number) as single utterances in the hopes of improving recognition accuracy.
5.1.2 Issuance of Non-Standard Commands
During actual operations, there is also a problem arising from the controller's use of
commands or phrases that have not specifically been included into the standard ATCCR
system (or the ASR system's vocabulary). Input of these would tend to produce "digital
garbage" as the ASR system recognized random words or structures, or, even worse, would
potentially result in the generation of an unintended command1.
It is unrealistic to expect to be able to include all of the commands that can be issued
by the controller to pilots into the ATCCR system and even if this could be done, it would
not allow for the flexibility required for communication during emergencies, or other non-
standard situations, or for idle chatter between the controller and pilot. For this reason, some
'One of the fundamental assumptions used in ATCCR parser design was that the controller's input was validand that it was just a case of trying to recognize it. Thus, error correction techniques could make somechanges or assumptions as to what the recogniz-ed words were that could lead to valid but incorrect andunintended commands.
mechanism whereby the ATCCR system can be easily disengaged from the controller's verbal
input is required. The best method for doing this would be to utilize a push-to-talk switch on
the microphone in much the same way as was indicated in the last chapter. Using this, the
controller could then engage only the radio for non-standard verbal communications, or the
radio and the ATCCR system for standard command issuance. Operationally however, it
remains to be seen exactly how effective this procedure would be since it would now require
the controller to constantly determine which commands (and in what format) are standard
and which aren't, in order to activate the mike switch accordingly.
Although standard commands transmitted at the mike switch setting for non-standard
commands would not create any difficulties for the ATCCR system itself, commands issued
in this manner would not be available to any computer monitoring -the controller's input.
This loss of data could have serious ramifications to applications that utilize this information
such as an automated ATC decision support system.
5.2 Application Specific Difficulties
5.2.1 Digitized Command Transmission - Voice Channel Offloading
One of the primary motivating factors for this research, at least initially, was the expected
emergence a digital communication link between the controller and aircraft based on Mode
S technology. It was thought that this could be used not only to reduce the error rate of
command reception by the pilot, but also to offload the ATC sector's voice channel, and
speed up command transmission, acknowledgment and response.
A typical scenario for realizing this would be as follows. A controller, would issue a
verbal command to a particular aircraft in his normal manner. The ASR system listening
in would recognize it and translate it into a message format suitable for transmission. Any
errors made in recognizing the command could be handled by having the controller monitor
the feedback display and correct them before verifying or terminating a command. If the
aircraft possessed digital capability, this command would then be transmitted digitally. Once
received by the aircraft, it could be displayed either visually on a cockpit CRT or aurally using
speech output technology, and be made available for recall if desired. The pilot would then
acknowledge receipt of this command either digitally, by pushing a button on his display
(or perhaps through a similar on-board ASR system), or verbally over the radio link. If
however the aircraft did not possess digital capability, then the controller's command could
be transmitted over the radio channel.
The primary difficulty in implementing such a system lies in the management of the voice
channel. In the current environment, the radio link is a simplex channel. Anyone desiring
to transmit a message, be it pilot or controller, must first detect whether or not someone
else is currently using the radio before transmitting. Although this is fairly straightforward
to do with the current system, the addition of an ATCCR system to the channel 2 alters the
protocol of this channel slightly and can lead to message collisions. -
These message collisions can take on two forms. In the first, the pilot transmits to the
controller through what he perceives to be an open channel when in reality the controller is
currently in the process of voicing a command. This occurs because the controller's voiced
commands are intercepted by the ATCCR system before they are broadcast over the radio so
that they can be transmitted digitally (if the aircraft is so equipped). Thus, if the controller
were issuing a command to a digital aircraft, there would be no indication to any pilot
monitoring the radio channel that the controller was busy talking and he would feel free to
broadcast.
This type of message collision can also arise if the aircraft being addressed by the controller
is not digitally equipped. This because the controller's command is not broadcast over the
radio link until after it has been determined that the aircraft being referred to does not
possess digital uplink capability. Thus, there is a brief period of time during which any
pilot wishing to use the radio channel would not detect the controller talking. Although
this could be as short as the time required to speak and recognize the aircraft's name, it
could be increased significantly if any recognition errors had to be corrected or if a slow ASR
system were being used. Furthermore, the ATCCR schemes proposed in this work take no
2Recall that there are cases when controller commands would still have to be transmitted verbally These could
arise from non-standard commands as mentioned in the last section, or from operations involving aircraft
that are not digitally equipped.
action until the entire command has been recognized in order to facilitate error detection
and correction. Thus, this delay in transmission could be even larger.
This leads directly to the second type of message conflict where the computer transmits
a voiced controller command over the radio channel while another pilot is talking. In the
previous scenario, if a pilot did seize the voice channel during the time it took to recognize
the aircraft's name and determine that it was not digitally equipped, then when the ATCCR
system did broadcast the command over the radio, there would already be someone talking
on it even if this were not so when the controller began voicing his command.
There are two basic methods in which these message conflict problems can be handled.
The first requires that the voice channel be re-designed so that it becomes a duplex channel.
A diagram indicating how this would appear can be seen in Figure .5.1. In this, there are
two loops, an "air voice" loop containing the pilots and a "ground voice" loop containing
the controller and his ATCCR system. The interface between these two loops is handled by
computer. It is the responsibility of this computer to detect and buffer any incoming pilot
messages that occur while the controller is talking and replay these when he has finished. It
also has to buffer any outgoing controller messages so that they are transmitted only when
the radio channel is free (no pilots are talking). In this way, no messages are lost due to
channel conflicts.
This type of approach however, has some inherent difficulties. First, an automated tech-
nique must be developed in order to detect when the radio channel or "air voice" loop is busy.
This in order to detect any incoming pilot transmissions so that they can be recorded and
buffered as well as to determine when an outgoing controller command can be transmitted.
This task is complicated by the presence of background noise on the radio link but there
are cues, such as the reception of a carrier when someone is transmitting, or the periods of
relative silence (relative to the nominal background noise that is) that occur when someone
has his mike switched on and is not talking, that can be of aid in this process.
Second, the same must be done for the "ground voice" loop. This however is fairly
straightforward since a push to talk switch on the controller's microphone can be readily
monitored to determine if he is talking.
"Air Voice"Loop
"Ground Voice"Loop
Figure 5.1: Duplex voice channel.
The biggest difficulty however lies in the actual sequencing of outgoing and incoming
buffered messages. Should outgoing controller commands take precedence over incoming pilot
messages? What are the effects on ATC operations arising from lags and delays associated
with the buffering of communications? What are the results when pilots monitoring the
radio hear another pilot's messages before the controller does (due to message buffering at the
ground)? How can the interleaving of conversations arising from this buffering of messages be
avoided? How can emergency communications be distinguished from other communications
in order to allow them to take preference?
Thus, although such a system could probably be designed, some extensive testing and
simulation is required to determine whether it would alleviate controller workload or simply
add to it by unnecessarily complicating his task.
If the benefits of command recognition and digital command transmission without voice
channel offloading are alone sufficient to justify its use, then a second solution to the voice
channel conflict problem is possible. This requires the transmission of the controller's com-
mands in parallel on both the voice channel and the digital channel. In this way, the basic
operations on the verbal channel remain unchanged, except for the potential use of digital
instead of verbal acknowledgments by aircraft. Thus, the difficulties with radio conflicts
mentioned earlier would not occur. (Note however, that with such a system, there would be
a lag between the reception of the verbal and digital commands since the verbal command
would still have to be recognized before it could be digitized and transmitted.)
Furthermore, it would address the deficiency in a pilot's awareness of other air traffic
brought about with the use of digital command transmission since all of the commands
issued to aircraft would be available to anyone listening in on the radio link. These, as will
be attested to by any pilot, are used extensively, especially in crowded airspace, in order to
determine where other aircraft are and what they are doing. Thus, their elimination could
have serious ramifications in terms of safety.
5.2.2 Command Prestoring
The other major application that is envisioned for ATCCR is its use in prestoring con-
100
troller commands and clearances for later issue. These prestored commands can be entered
for storage in the computer verbally by the controller, in the anticipation of some future
event (such as an aircraft reaching a waypoint), or they can be contained in a database of
commonly used clearances, such as those used for standard approach or departure patterns.
In either case, when it is determined by the controller (or even by a computer monitoring
the ATC sector) that these clearances should be transmitted, they have already been entered
and are thus available for immediate transmission. Thus, the controller anticipating a period
of high workload can prestore a number of these commands in order to simplify his task.
In order to simplify the use of such a system, the controller would be given a display
containing all of these prestored commands and information about their status (i.e., pending,
transmitted, acknowledged, etc.). Using this, he could examine previously issued commands,
modify existing ones, or add new ones. This display would also allow the computer to request
validation of each specific prestored command before it was actually transmitted or signal to
the controller that an already issued command had not yet been acknowledged.
The actual transmission of these commands or clearances would take place using the same
procedure as that described in the last section. Thus, if the aircraft being referred to was
digitally equipped, the clearance would be digitally transmitted. If it were not, then it would
be transmitted over the radio link, perhaps using a recording of the controller's own voice.
Pilot .acknowledgments to these commands could also be transmitted either verbally or
digitally. If they were verbal, it would be the responsibility of the controller to recognize the
acknowledgment and update the display of the corresponding command to indicate this. If
they were digital, then this could be done by the computer directly.
In general, this system suffers from the same basic problems arising from message conflicts
that were mentioned in the last section. This because since both digital and non-digital
aircraft are being accommodated, operations on the voice channel are the same as the last
section. With this application however, the frequency of these message conflicts is increased
because both the controller and computer are now generating outgoing commands.
Although the' duplex voice channel modification described in Figure 5.1 addresses this
problem, the difficulties inherent in scheduling the playback of incoming and outgoing mes-
101
sages that have been buffered are almost certain to result in interleaved communications,
especially during situations of high loading on the voice channel.
This is because outgoing prestored commands transmitted by the computer and incoming
responses and acknowledgments to these are intermingled with controller originated com-
munications on the radio link (if prestored commands are directed towards non-digitally
equipped aircraft). The result is that the controller is likely to hear a seemingly random
sequence of messages and acknowledgments on the radio channel thereby greatly increasing
his workload by forcing him to mentally sift through these to determine what each refers to.
A solution to this is to require that prestored commands and clearances are transmitted
and acknowledged digitally only. In this way, the computer itself can handle the management
and scheduling of prestored command transmission and acknowledgment detection, thereby
freeing the controller to simply perform supervisory functions and concentrate on his own
task at hand. This type of operation emphasizes the need for the prestored command display
mentioned earlier in order to allow the controller to interface with the computer in the
execution of this supervisory function.
102
Chapter 6
Conclusions and Recommendations
6.1 Summary
The basic goal of this work has been to apply existing ASR technology in an ATC en-
vironment in order to explore not only some of the potential benefits and problems arising
from the practical application of ASR, but also the features and capabilities desirable in an
ASR system to be used in ATC.
This was accomplished by integrating a VOTAN VPC2000 continuous speech recognition
system into an existing ATC simulation so as to provide a means whereby verbal commands
issued by controllers and directed towards aircraft could be entered into the computer directly
thereby eliminating the need for blip drivers or pseudo-pilots.
In general, the potential benefits accrued through the use of ASR in an ATC environment
involve the simplification of the controller-computer interface in an environment where the
primary means of communication is verbal and the use of and reliance on computers is
increasing significantly, both in the air and on the ground.
The major difficulties however lie predominantly in the handling of errors. In order to
address the problem of recognition errors, the syntax for ATC commands was incorporated
into a Speech Input Parser. This was done in two basic ways. The first utilized a Finite
State Machine approach for syntax specification and required active intervention on the part
of the user in order to correct any errors once they were detected. The second however used
a pattern matching approach to compare the input command to a list of allowable commands
103
in order to determine the best match and could hypothesize possible corrections if any errors
were detected as long as these did not critically affect the intelligibility of the commanded
action.
The user based techniques developed for correction of recognition errors consisted of
utilizing the verbal channel in order to enter specific keywords that would either delete the
last recognized word, or delete the entire recognized command so far. These were found to
be lacking both in terms of speed, flexibility and ease of use, and from the fact that errors
could even be made in recognizing these keywords.
The automated techniques developed to correct for recognition errors internally were
limited by the capabilities of, and information made available by the VPC system. In many
cases, even though they were successful in hypothesizing the location- of these errors, there
was no capability to re-analyze the data and validate these hypotheses. As such, these
automated techniques were more proof of concept vehicles than implementable strategies (at
least with the current configuration of the VPC).
The major drawbacks of the VPC system were its sensitivity to variations in articulation
(co-articulation, intonation) and its inability to rewind data in order to re-examine sections of
speech data. The former is for the most part inherent in the particular recognition algorithm
and technique being used and could not readily be changed. The latter however is a result of
the actual packaging of the software. This problem has been addressed with a new software
package (a library of user callable C language subroutines to control the recognition functions
[35]) recently made available. There are however still some limitations in the capability of
the VPC that have not been addressed. In particular, the inability to obtain a ranking,
including scores, of how well each of the words in the active vocabulary matched the current
input as well as a pointer to the location in the speech data where each of these words ends
and the next word would therefore begin.
6.2 Recommendations
As a result of the work performed, the requirements and capabilities of an improved
104
operational ASR system for use in ATC can be more accurately specified. Although these
have, to some extent, already been discussed in the body of the text, they will be summarized
again. The current system was never intended for operational use. It was just for proof of
concept demonstration and system development in order to more accurately define not only
the significant areas of research, but also features and capabilities that would be desirable in
a more advanced, higher performance system that would be used in practical operations.
* Speaker Dependence
For the ATCCR application this was never really an issue since there is only one user
at a time, the controller, and thus a speaker dependent system is adequate.
e Continuous Speech Recognition
As mentioned earlier, the restrictions posed on the user with discrete speech recognition
systems and the delays associated with connected speech recognition systems created
a strong preference for continuous speech systems. Although in retrospect, some of the
error correction strategies hypothesized bear more similarity to connected than to con-
tinuous speech recognition techniques, it was felt that a continuous speech recognition
system with the capability to buffer speech input and go back and re-analyze it (i.e.,
connected speech recognition capability) would result in much better performance. In
this way, the delays associated with connected speech recognition techniques would
only be incurred when ambiguities or errors required their use.
* High Baseline Recognition Accuracy
Here, what is implied is the inherent accuracy of the recognition algorithms themselves,
without the explicit use of syntax or set-switching as an aid to the recognition process.
Although these can be used later to improve the overall performance,the ASR system,
must at least possess an adequate performance for word recognition to allow for a
reduction in the processing required by any error correction schemes, be they user
aided or internal. For continuous speech recognition systems, this almost certainly
implies the use of phoneme based approaches. This because co-articulation effects (one
105
of the major causes of recognition errors in continuous speech recognition) can be more
readily and accurately modeled.
The actual recognition accuracy required is difficult to quantify exactly since there
are a large number of variables. These include vocabulary size, vocabulary content,
speaker characteristics, and training procedure. In general however, the recognition rate
should be at least 95% for a vocabulary consisting of all of the words that are required
to implement the required task. This would result in a success rate for command
recognition of about 60% (assuming a command consists of roughly 10 words). Syntax
and other techniques could then be used to improve this.
Simplified Training Procedure
In general, the type of training procedure such as that used in the VPC where each
word is trained by repeating it to the system both in discrete and embedded modes has
serious drawbacks. First, this procedure is highly subject to training effects. Second,
it does not accurately allow for co-articulation effects. Third, the actual training can
become very time consuming for large vocabularies. What would be desired to remedy
some of these difficulties is a procedure where the user would simply talk to the system,
perhaps reading a section of text, in order to train the system to his voice. The handling
of co-articulation effects however is more related to the actual recognition algorithm
being used and as such, cannot be addressed solely through modifications to the training
procedure.
In addition, the capability to add new words to the vocabulary on-the-fly during ac-
tual operations would be desirable. With most systems, this is more related to the
"packaging" of the software than the actual training procedure. However, there are
those systems whose training procedure is so complex and time consuming that this
capability cannot realistically be added.
. Reduced Sensitivity to Variations in Speech
These, as mentioned earlier, can arise from anything from co-articulation effects to a
cold or stress on the part of the user and tend to decrease the recognition accuracy of a
106
system. These can be accounted for either through the use of more robust recognition
algorithms (by accurately modeling co-articulation effects for example) or through the
use of an enrollment procedure prior to the use of the system. With this, the user
would simply read aloud a brief paragraph in order to allow the recognition algorithm
to adapt to how he sounds that particular day. Additionally, if the actual procedure
for training the vocabulary were short enough, he could even retrain all or parts of it
prior to use.
Vocabulary Size
The actual vocabulary size required depends greatly on the task being implemented. In
general, since the entire vocabulary of words used in ATC (excluding names of specific
places) is only about two to three hundred words, a vocabulary roughly this size should
be sufficient. Granted, this might be increased depending on the application in order
to allow for a large number of aircraft names or waypoints and fixes.
As mentioned earlier, a more accurate indication of performance is the size of the active
vocabulary. If the system is one in which the only user control of its internal operation
is through the specification of the active vocabulary, then the active vocabulary should
be as large as possible (with a realistic minimum of about 60 words) in order to reduce
the requirement for vocabulary set switching while a command is being input and thus
the type of errors evidenced in Section 4.2.3 with a parser that utilized set switching.
If however more control is available over the internal operation, in particular, if the
capability to rewind the speech data is available, then a smaller active vocabulary
would be acceptable since the added control would allow any errors to be handled.
Note that as a general rule of thumb, the size of the vocabulary of an ASR system is
limited by its recognition accuracy. Therefore, the more accurate the system, the larger
the vocabulary.
107
. Short Recognition Delays
This is very difficult to quantify exactly since a number of different factors enter into it.
Clearly, the recognition delays should be as short as possible, in order to decrease the
lag between command and action as well as to more readily allow error correction by the
user without forcing him to wait excessively. Furthermore, the shorter the recognition
delay, the more time available for any post-processing required by the Speech Input
Parser. However, ASR systems with larger delays can offset this by reducing the need
for error correction by the user, or post-processing by the SIP, with higher recognition
accuracy. Thus, this must be analyzed in conjunction with the recognition accuracy of
a system in order to determine if it is excessive. ~
In general, recognition delays should be at most, one second for an individual word or
four seconds for a long stream of speech in order to force the user to not wait too long
before action is taken in relation to his input. Systems with poorer accuracy should
naturally be at the lower end of this scale.
Open Architecture
This is perhaps the most important requirement in an ASR system due to the ex-
ploratory nature of the work that was performed here. With an open architecture, the
user could exert more control over the recognition functions and would not be restricted
in what can be done by the "packaging" of the ASR system. In this way, some of the
parsing strategies and error correction mentioned earlier could be implemented. The
most desirable feature in an open architecture system would be the ability for the user
to call the recognition routines directly on any specified section of the incoming speech
data with any parameters and vocabulary desired. This, in general, is not possible with
a black box approach to the design of the interface to an ASR system where the only
input is speech (and possibly syntax for set switching purposes) and the output is a
recognized word.
For these reasons, what is really desired is a development system in order to allow
108
for the type of flexibility in configuration and execution that is required for research
purposes.
. Hosting
Although this is not critical, an ASR system that could be hosted on the ATC Simu-
lation Computer itself would possess distinct advantages. This because a lot of infor-
mation about the environment (airspace, what the controller is currently doing,...) is
available here and transferring it to another computer often results in difficulties.
Candidate ASR Systems
In general, although the VPC was more than adequate for demonstrating the proof of
concept of ATCCR and for performing initial ATCCR developmental work, a more capa-
ble ASR system was desired for testing and development of what might eventually be an
operational ATCCR system.
Based on the experience gleaned through the use of the VPC, the emerging ASR tech-
nology of phoneme based speech recognition was felt to be what was desired. With these
systems, the phonemes contained in the speech input are first recognized and then used to
consult a dictionary of phonemic spellings in order to determine the word spoken. These
types of systems offer a great number of performance improvements over more conventional
technology, such as the VPC, and are currently being used in order to tackle the much
more complex problem (in terms of vocabulary sizes and syntactical flexibility) of recogniz-
ing natural language. Thus, they should be quite successful in the reduced scope of the ATC
environment. Some of the advantages of this approach are listed below;
* There is already a great body of knowledge dealing with phonemes, their characteri-
zation, how they are used to construct speech, and most importantly, rules for their
co-articulation. Thus, degradations in recognition accuracy arising from co-articulation
effects can be reduced to a greater extent than possible simply with embedded train-
ing of the vocabulary words as in the VPC. It is this that is the major advantage of
phoneme based systems.
109
* The training procedure is much simpler for the user since it typically consists of having
him read, out loud, a paragraph of phonetically rich text in order to determine how he
enunciates phonemes. Thus, training effects are reduced since the training task is not
as artificial as that in other systems. Furthermore, all of the words contained in the
vocabulary need not be explicitly trained. Instead, their phonetic spelling must simply
be contained in the phonemic dictionary. Thus, the addition of words to the vocabulary,
even on-the-fly, is fairly straightforward and consists of simply adding another entry to
the dictionary without the need to specifically train them.
* Since data rates for phoneme recognition systems are only about 100 Hz, less memory
is required to buffer incoming speech data. Thus, it is more reasonable to save large
blocks of data for later re-processing if any errors are detected by the parser.
6.3 Future Work
Now that the basic tools and procedures for using ASR have been demonstrated and are
in place, there is the potential for a great deal of modification and additional study to be
performed not only in order to improve the current facility, but also to investigate different
application configurations. Areas thought to be of great potential have been summarized
below.
. Incorporate keyboard and mouse in addition to ASR as input modalities. These can
then be used for;
1. command entry
- Mix of input modalities can be used for entering commands. In this way, the
controller is free to use what he is most comfortable with.
- Aircraft or fixes and waypoints being referred to can be selected directly on
the display with the mouse.
- Difficult to recognize words can be entered with keyboard
110
2. error correction
- Add the capability to mouse recognized words on the feedback display. The
user can then change these, insert words in front of them, or delete them. This
can be done using either keyboard input, speech input, or pull down menus
with options.
* Investigate the addition of the capability to correct errors by only repeating part of the
issued command (i.e., "heading zero niner zero CHECK zero one zero")
* Simulate the mixed digital/non-digital cockpit environment
* Develop command prestoring functions and evaluate them irr a simulated environment
* Eliminate the pseudo-pilot responses (unless these can be separated from the VPC) in
order to allow more flexibility in how command termination is done and to decrease
the operational problems resulting from misrecognitions of the terminating keyword
* Investigate the possibility of using the Explorer for speech generation functions directly
in order to free up the VPC so that it performs only speech recognition.
* Change scope and format of pseudo-pilot messages to include more detail about com-
mands issued or errors detected. For example, evaluate the usefulness of error specific
pseudo-pilot responses such as "Please repeat heading for AA156".
* Investigate the use of other technologies for the generation of pseudo-pilot responses.
* Add a push to talk switch whose state can be monitored by the Explorer in order to
allow for improved sequencing of speech playback functions and command termination
detection.
* Develop the capability to allow for sequential commands without intervening pauses
to be issued. This would entail the modification of the parsers and the command
terminators.
111
" Modify the Pattern Matcher so that matching is done as each word is recognized as
opposed to waiting until the entire command has been input.
. Investigate the use of alternate scoring strategies for the Pattern Matcher.
. With the added flexibility and user control now available with the VOTAN Library of
C routines,
- Investigate the possibility of recoding the C programs to allow for the scores of
all the words in the vocabulary to be obtained as opposed to just the top two.
Present in the form of a word vector in order to allow for some of the refinements
to the SIPs mentioned in Chapter 4 to be implemented.
- Investigate the use of the rewind capability made available to implement some of
the error correction strategies alluded to in the refinements to the SIPs.
- Re-design the user interface to present a standardized training procedure where
the new user is explicitly paced through the training process
- Add the capability to add or re-train vocabulary words on the fly, while the simu-
lation is running and demonstrate how this could be used to for example, handle
aircraft entering into the airspace that have not had their names trained.
" Investigate the use of ASR for functions other than ATC command entry (eg; commands
directed towards the computer).
" Determine, by actual monitoring of controller-pilot communications, how strictly the
ATC command syntax is adhered to in practice.
" Investigate the possibility of using two ASR systems in parallel in the ATCCR process
in order to capitalize on differences in performance available with different systems.
" Collect and evaluate detailed statistics on the frequency and type of recognition errors
made during actual operation of the ATC simulation by comparing the recognized input
to that obtained through transcription by a human.
112
Bibliography
[1] Toong, H. D. and Gupta, A. "Automating Air Traffic Control", Technology Review, Vol
85, No 3, pp. 40-54, April 1982
[2] Lea, Wayne A., "The Value of Speech Recognition Systems", Printed in Lea, Wayne A.,Trends in Speech Recognition, Prentice Hall, Englewood Cliffs, NJ, 1980
[3] Poock, Gary K., "Voice Recognition Boosts Command Terminal Throughput",
Speech Technology, April, 1982
[4] Shannon, C.E. and Weaver, W., The Mathematical Theory of Communication,
University of Illinois Press, Urbana, IL, 1949.
[5] Turn, R. "The Use of Speech for Man-Computer Communication", RAND Report-1386-
ARPA, RAND Corp., Sanra Monica, CA.
[6] Lea, W. A.,"Establishing the Value of Voice Communications with
IEEE Trans. Audio and Electroacoustics, Vol AU-16.
[7] Rasmusson, Paul R., "Summary of Several Industrial Voice Data
tions",
Appearing in The Official Proceedings of SPEECH TECH '85, Vol
mensions Inc., New York, NY, 1985.
Computers",
Collection Applica-
1, No 2, Media Di-
[8] Ashton, Robert, "Voice Input of Warehouse Inventory",
Appearing in The Official Proceedings of SPEECH TECH '85, Vol 1, No 2, Media Di-
mensions Inc., New York, NY, 1985.
113
[9] Nelson, Donald L., "Use of Voice Recognition to Support the Collection of Product
Quality Data", Appearing in The Official Proceedings of SPEECH TECH '85, Vol 1,
No 2, Media Dimensions Inc., New York, NY, 1985.
[10] Newbery, R. R., "Integration of Advanced Displays, FMS, Speech Recognition and Data
Link", The Journal of Navigation, Vol 38, No 1, January, 1985.
[11] Reed, L. "Military Applications of Voice Technology", Speech Technology,
Feb/Mar, 1985.
[12] Lerner, Eric J., "Talking to Your Aircraft", Aerospace America, January, 1986.
[13] Merrifield, John T., "Boeing Explores Voice Recognition for-Future Transport Flight
Deck", Aviation Week and Space Technology, April 21, 1986.
[14] Legget, Johm, and Williams, Glen, "An Empirical Investigation of Voice as an Input
Modality for Computer Programming", International Journal of Man-Machine Studies,
pp. 493-520, January 1984.
[15] Connolly, Donald W., "Voice Data Entry in Air Traffic Control", FAA-NA-79-20,
August, 1979.
[16] Air Traffic Control 7110.65C, Air Traffic Service, Federal Aviation Administration, U.S.
Department of Transport, Jan 21, 1982
[17] Pollack, I., and Pickett, J.M., "The Intelligibility of Excerpts from Conventional Speech",
Language and Speech, pp. 165-171, Volume 6, 1963
[18] Pisoni, D.B., Nusbam, H.C., and Greene, B.G., "Perception of Synthetic Speech Gener-
ated by Rule", Proceedings of the IEEE, Vol. 73, No. 11, November 1985.
[19] Pisoni, D.B. et al, "Some Human Factors Issues in the Perception of Synthetic Speech"
Appearing in The Official Proceedings of SPEECH TECH '85, Vol 1, No 2, Media Di-
mensions Inc., New York, NY, 1985.
114
{201 McPeters, D.L., and Tharp, A.L., "The Influence of Rule-Generated Stress on Computer-
Synthesized Speech", International Journal of Man-Machine Studies, Vol 20, pp. 215-
226, 1984.
[21] Schwab, E.C., Nusbaum, H.C., and Pisoni, D.B., "Some Effects of Training on the
Perception of Synthetic Speech", Human Factors, pp. 395-408, August 1985.
[22) Simpson, C.A., and Marchionda-Frost, K., "Synthesized Speech Rate and Pitch Effects
on Intelligibility of Warning Messages for Pilots", Human Factors, pp. 509-517, October
1984.
[23] DECtalk A Guide To Voice, Digital Equipment Corporation, July 1985.
[24] Rabiner, L. R., and Schafer R.W., Digital Processing of Speech Signals, Prentice-Hall
Inc., Englewood Cliffs, NJ,1978.
[25] Harrison, John A., "Should Speech Input/Output Technology be Applied to ATC Sim-
ulators and Operational Systems", ICA O Bulletin, May 1984.
[26] Schafer, R.W., and Markel, J.D., editors Speech Analysis, IEEE Press, John Wiley and
Sons, New York, NY, 1979.
[27] SP-1000 Manual, Internal Publication, General Instruments Corp., Hicksville, NY, 1986.
[28] White, G.M., "Speech Recognition: An Idea Whose Time Is Coming", Byte Magazine,
January, 1984.
[29] Russell, M.J., et al, "Some Techniques for Incorporating Local Timescale Variability In-
formation into a Dynamic Time Warping Algorithm for Automatic Speech Recognition",
Proc. IEEE Conference on Acoustics, Speech and Signal Processing, pp 1037-1040, 1983
[30] Jelinek, F. et al, "Continuous Speech Recognition: Statistical Methods", Handbook of Statistics.
Vol. 2, Krishnaiah and Kanal, eds. North-Holland, 1982.
[31] Mari, J.F., and Roucos, S., "Speaker Independent Connected Digit Recognition using
Hidden Markov Models",
115
Appearing in The Official Proceedings of SPEECH TECH '85, Vol 1, No 2, Media Di-
mensions Inc., New York, NY, 1985.
[32] Ciarcia, S., "The Lis'ner 1000", Byte, pp. 111-124, November, 1984.
[33] Lis'ner 1000 Voice Recognition User and Assembly Manual, Rev 2.0, The Micromint Inc.,
Cedarhurst, NY, Oct 1984.
[34] VOTAN VPC2000 Users Guide, Votan, Fremont, CA, November 1985.
[35] Voice Library Reference Manual, Ver C-07, Votan, Fremont, CA, March 1986.
[36] Smyth, Christopher C., "Automated Voice and Touch Data Entry for the U.S. Army's
Forward Area Alerting Radar (FAAR)", Speech Technology, Feb/Mar, 1985.
[37] Waller, Harry F., "Choosing the Right Microphone for Speech Applications", Appearing
in The Official Proceedings of SPEECH TECH '85, Vol 1, No 2, Media Dimensions Inc.,
New York, NY 1985.
[38] Heline, Ture, "Apply Electret Microphones to Voice-Input Designs", Electronic Design
News, Sepetember 2, 1981.
116