Natural language understanding and communication for … · Abstract Natural language understanding...

Institute of Parallel and Distributed Systems

University of StuttgartUniversitätsstraße 38

D–70569 Stuttgart

Masterarbeit

Natural language understandingand communication for

human-robot collaboration

Marietta Inge Bloch

Course of Study: Softwaretechnik

Examiner: Prof. Dr. Marc Toussaint

Supervisor: Hung Ngo

Commenced: September 15, 2016

Completed: March 17, 2017

CR-Classification: I.2.4, I.2.7

Abstract

Natural language understanding and communication is a key aspect of human-robotcollaboration. Concerning natural language understanding, various features need to becovered for creating an overall system. Apart from semantic parsing and mapping, thereusability and extendability of such a system requires consideration. In this thesis a sys-tem for natural language understanding based on dependency parsing is designed. Thefocus lies on extracting the meaning of spoken commands. A communication componentis added for enabling on the one hand ambiguity resolving and on the other hand tasksuggestions on the part of the robot. The proposed system is evaluated with respectto its natural language understanding capabilities. The evaluation revealed the highimpact of speech recognition on semantic parsing. The significantly enhanced spokencommand understanding unveiled the benefit of an ambiguity resolving communicationcomponent.

3

Kurzfassung

Das Verstehen natürlicher Sprache und Kommunikation sind zentrale Elemente für dieZusammenarbeit von Mensch und Maschine. Um ein Gesamtsystem zu gestalten, müssendiverse, das Sprachverstehen betreffende Eigenschaften umgesetzt werden. Abgesehenvon der semantischen Satzanalyse und der Bedeutungszuordnung erfordert auch dieWiederverwendbarkeit und Erweiterbarkeit eines solchen Systems Berücksichtigung.In dieser Arbeit wird ein System zum Verstehen natürlicher Sprache basierend aufAbhängigkeitsanalysen entworfen. Im Mittelpunkt steht das Extrahieren von Bedeutun-gen aus gesprochenen Kommandos. Eine hinzugefügte Kommunikationskomponenteermöglicht zum einen das Auflösen von Unklarheiten und zum anderen Aufgaben-vorschläge seitens des Roboters. Das vorgeschlagene System wird, bezüglich seinerFähigkeit natürliche Sprache zu verstehen, untersucht. Die Evaluation offenbarte einenstarken Einfluss der Spracherkennung auf die semantische Satzanalyse. Außerdemwurde durch ein erheblich gesteigertes Verstehen von gesprochenen Kommandos derNutzen einer Unklarheiten auflösenden Kommunikationskomponente klar ersichtlich.

5

Contents

1 Introduction 13

2 Background 152.1 Natural language understanding . . . . . . . . . . . . . . . . . . . . . . 152.2 Relational Activity Process . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Related Work 253.1 Knowledge base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Semantic parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3 Human-robot dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Designing a natural language understanding and communication system 314.1 NLU methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2 Communication methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.3 Overview of the design and serration points . . . . . . . . . . . . . . . . 39

5 Evaluation 415.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.2 Experimental domain setup . . . . . . . . . . . . . . . . . . . . . . . . . 425.3 Analysis of the NLU capabilities . . . . . . . . . . . . . . . . . . . . . . . 45

6 Conclusions 556.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.2 Summary and limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Bibliography 61

7

List of Figures

2.1 Parse tree of a phrase structure grammar. . . . . . . . . . . . . . . . . . . 172.2 Dependency tree built by a dependency grammar. . . . . . . . . . . . . . 182.3 Universal Dependencies for the same sentence in English and German. . 192.4 Subset of the synset relations for the word table. . . . . . . . . . . . . . . 212.5 RAP concurrent activities. . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1 Meaning representation tree. . . . . . . . . . . . . . . . . . . . . . . . . 334.2 Word similarity measurement of an event e1 within WordNet. . . . . . . 344.3 Workflow of the questioning system for ambiguity resolving. . . . . . . . 384.4 Workflow of the task suggestion use case triggered by the robot. . . . . . 384.5 Design and serration of the NLU and communication system. . . . . . . . 40

5.1 Comparison of the speech recognition processing times of Google CloudSpeech and CMU Sphinx in the toolbox domain experiment. . . . . . . . 47

5.2 Box plots of the speech recognition times of the blockworld domaincommand transfer conducted using Google Cloud Speech. . . . . . . . . 49

9

List of Tables

2.1 Excerpt of the Penn Treebank Project English part-of-speech tags. . . . . 162.2 Excerpt of the Universal Dependencies. . . . . . . . . . . . . . . . . . . . 182.3 Excerpt of the different meanings of the word give from WordNet [Pri10]. 19

5.1 Recording and transcription times (in seconds) for the commands of thetoolbox domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.2 Recording and transcription times (in seconds) for the answers in thetoolbox domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.3 Recording and transcription times (in seconds) for the command andanswer utterances in the blockworld domain. . . . . . . . . . . . . . . . 49

5.4 Usage of communication system functions. . . . . . . . . . . . . . . . . . 515.5 Number of different causes for iterations per command triggered by ask

targeted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.6 Benefit of the communication system on the command understanding. . 525.7 Processing time of semantic parsing per experiment. . . . . . . . . . . . 53

6.1 Understanding of commands with and without ambiguity resolving (AR)per experiment in percent. . . . . . . . . . . . . . . . . . . . . . . . . . . 56

11

1 Introduction

Human-robot collaboration aims on humans cooperating with robots to achieve a jointgoal. For example, a robot assisting a human assembling components in single-itemproduction or small batch series, where size and number of components and eventhe workplace may vary frequently. This implies for the robot the need of performingmiscellaneous and unconstant tasks.

To be a helpful co-worker and not only a subordinate, a robot needs reasoning andplanning functionality. Moreover a certain level of communication is necessary toachieve cooperation between humans and robots. From the human side this impliesgiving instructions and answering questions. From the robot side it is essential to be ableto ask for advice when doing autonomous tasks and clarify ambiguous instructions.

For humans, of course, the easiest way of communication is by natural language withoutthe need of reading a manual and learning numerous predefined commands. Anotheradvantage of spoken, and therefore freehand, communication is the ability to use bothhands for the task to be done together.

From a robot this requires the capability of also understanding and speaking naturallanguage. This Natural Language Problem is considered an AI-complete problem [Yam13].The notation is derived from the computational complexity theory’s problem classificationNP-complete. This is an indicator for the scope and the complexity of giving a robot thehuman ability to use natural language.

There are natural language systems, like IBM’s question answering system Wat-son [Fer+10], which seem to achieve this. But at a closer look they are restrictedto a specific use case. Would one for example ask Watson to hold a workpiece it wouldfail, even if it could derive the meaning of the sentence. As its only function is to answerquestions it would fail in performing any task requiring movements. For human-robotcollaboration exactly this type of natural language understanding is necessary.

Up to now there is, to the best of my knowledge, no system available reaching the stateof a universal cooperative robot communicating in natural language, that is able toadapt to any domain and circumstances without modification.

13

1 Introduction

Restricted to a specific application domain instructing autonomous robots in naturallanguage works quite well as for example [Kol+13] proves, but modifications need tobe made for adapting to a new domain.

The scope of this work is narrowed to natural language understanding and communica-tion with respect to human-robot collaboration. The main focus lies on the translationof a natural language command to its corresponding meaning representation; with anambiguity resolving questioning system as a secondary objective. Therefore this workaims on identifying suitable methods for a reusable and expandable natural languageunderstanding system which easily adapts to a new domain with as few modificationsas possible. Deriving the meaning of a sentence is achieved by using grammaticaldependencies the words have to each other.

Structure

The content of this work is structured as follows: In chapter 3 related work in natural lan-guage processing with different key aspects is presented. These aspects are knowledgebase, semantic parsing, and human-robot dialog. Chapter 2 first gives background infor-mation about natural language processing and the corresponding terminology. Second itintroduces the human-robot cooperation formalization (RAP) this work is embeddedin. In chapter 4 the method selection and resulting design of a natural language under-standing and communication system are being described. Next, chapter 5 describes theimplementation of the methods described in chapter 4 on two different domains. Theconducted experiments evaluate the natural language understanding capabilities and thebenefit of an ambiguity resolving communication system. Last, chapter 6 will summarizethe findings and lessons learned from building a natural language understanding andcommunication system for human-robot collaboration. A concluding discussion presentsthe developed system’s differences to other approaches, addresses alternative methodsin natural language processing and stats its expandability.

14

2 Background

This chapter will introduce terms and methods of natural language understanding aswell as the embedding model Relational Activity Process - RAP.

2.1 Natural language understanding

Natural language understanding (NLU) is a subsection of natural language processing(NLP). Natural language processing describes the whole process of translating speech totext and text to meaning respectively vice versa. Whereas natural language understand-ing means deriving the meaning of speech or written content. This implies not only theanalysis of a single sentence, which is semantic parsing, but also analyzing the topic of atext or doing a sentiment analysis of comments to a movie.

Within this chapter the fundamental syntactic components of semantic parsing will beexplained. Besides the projects Universal Dependencies and WordNet are introduced.

2.1.1 Syntactic components of semantic parsing

The mapping between natural language and a logic form is called semantic parsing, asnot only syntax but also semantics need to be considered when dealing with the meaningof natural language. An example is the sentence "he ate with Dave and Anna", whichuses the exactly same grammar as "he ate with fork and knife", where the meaning iscompletely different. Other analysis problems are words that can be nouns or verbs likehand: "hand me the documents" in comparison with "he has the documents in his hand"and co-references like "The book is in the bag. Please give it to me". Does it refer to thebag or the book?

Parsing itself describes the grammatical analysis of a sentence. A necessary preprocessingstep before parsing is tokenizing. Tokenizing is the process of splitting a sentence intotokens. Tokens are in this case words and punctuation marks. There are no syntacticalrelations considered in this step; this happens during parsing.

15

2 Background

In opposition to programming languages, deriving the grammatical structure and build-ing a parse tree of a sentence in natural language is more complicated. Often more thanone grammatical interpretation is possible for a sequence of tokens. There exist differ-ent approaches which are based on either phrase structure grammars or dependencygrammars. Both approaches involve part-of-speech tagging.

Part-of-speech tagging

According to [Koe15] there are two types of words: Open class words are nouns, verbs,adjectives and adverbs. They hold the content of a sentence. Close class words are forexample pronouns and prepositions. They are mostly functional, but necessary to builda sentence. All of them can be divided into certain categories describing their syntacticproperties called part-of-speech (POS) tags. Part-of-speech tagging is the process ofassigning each token its corresponding part-of-speech tag. Table 2.1 gives an excerpt ofpart-of-speech tags1.

As mentioned before the word hand can be a noun or a verb depending on the context.The aim of part-of-speech tagging is finding the most probable tags for the given tokensconsidering their sequence as well. Applied on the sentence "hand me the documents"this means: A preposition following a noun is impossible, but a verb followed by apreposition is very likely. Therefore hand is tagged as a verb in this context.

Table 2.1: Excerpt of the Penn Treebank Project English part-of-speech tags.

Tag Description ExampleCD Cardinal number 24DT Determiner theJJ Adjective blueNN Noun, singular or mass ballNNS Noun, plural objectsPRP Personal pronoun meVB Verb, base form beVBZ Verb, 3rd person singular present is

1The entire list of the English part-of-speech tags from the Penn Treebank Project is available onhttps://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

16

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html


Figure 2.1: Parse tree of a phrase structure grammar. The part-of-speech tags are listedin table 2.1.

Phrase structure grammar

For phrase structure grammars a parse tree is built by taking part-of-speech taggedtokens and iteratively applying predefined production rules until a parse tree is foundwhich holds all tokens as leaf nodes. The process becomes iterative when at least onetoken cannot be matched according to the rules, then another iteration starts. Examplesfor such rules are:

S → VP | NP VPNP → NP NP | DT JJ NN | PRPVP → VB NP

in which S (sentence), NP (noun phrase) and VP (verb phrase) are non-terminal symbolsdefining the syntax of the sentence. The tokens are terminal symbols containing thesemantics [Koe15]. Figure 2.1 shows the parse tree of an example sentence built withthese rules.

Dependency grammar

Dependency grammar based approaches map the relation tokens have to eachother [CM14b]. The verb of a sentence is always the root of the structure. Thereare no other symbols needed like in the phrase structure grammar as the tokens aredirectly the nodes of a tree. This usually results in a shallower tree. Figure 2.2 shows adependency tree as an example for dependency mapping.

17

2 Background

Figure 2.2: Dependency tree built by a dependency grammar. The part-of-speech tagsare listed in table 2.1.

2.1.2 Universal Dependencies

Universal Dependencies (UD) is a multilingual treebank collection [Niv+16]. The projectaims on developing the same syntactical dependency annotations for many languages toenable cross-lingual analyses and parsing. Table 2.2 shows six of the 40 grammaticalrelations, which are defined through the UD project2.

The result of annotating a sentence in English and German with Universal Dependenciesis shown in figure 2.3.

Table 2.2: Excerpt of the Universal Dependencies.

Relation Description Exampleamod Adjectival modifier ball → bluedet Determiner ball → thedobj Direct object give → balliobj Indirect object give → mensubj Nominal subject I → givenummod Numeric modifier objects → three

2The entire list of the grammatical relations from the Universal Dependencies project is available onhttp://universaldependencies.org/docs/u/dep/index.html

18

http://universaldependencies.org/docs/u/dep/index.html


Figure 2.3: Universal Dependencies for the same sentence in English and German. Thepart-of-speech tags are listed in table 2.1. The explanation to the UniversalDependencies annotations of this sentence are given in table 2.2.

2.1.3 WordNet

Semantic analysis aims for deriving the meaning of a sentence. This starts with theanalysis of the words of a sentence. Like words may be nouns and verbs depending onthe syntax, words may have different meanings depending on their context. There are44 meanings and examples for the word give as a verb listed in WordNet [Pri10]. Six ofthem are denoted in table 2.3.

Table 2.3: Excerpt of the different meanings of the word give from WordNet [Pri10].

Meaning Synonyms ExampleBe the cause or sourceof

Yield, give, afford He gave me a lot of trouble

Organize or be respon-sible for

Hold, throw, have, make, give Give a course

Place into the hands orcustody of

Pass, hand, reach, pass on, turnover, give

Give me the spoon, please

Break down, literallyor metaphorically

Collapse, fall in, cave in, give,give way, break, founder

The roof finally gave underthe weight of the ice

Offer in good faith Give He gave her his wordProffer (a body part) Give She gave her hand to her

little sister

19

2 Background

WordNet is a lexical database containing definitions of English open class words in asemantic network which interconnects them [Fel06]. Words are grouped into synonymsets which are called synsets. A synset holds words related to one specific concept aswell as their synonyms. The words of a synset belong to the same word type. The wordtable as a piece of furniture belongs to another synset than table as a data arrangementin rows and columns, which describes another concept. In its current version WordNetholds 117,659 synsets [Pri10].

Figure 2.4 illustrates common conceptual synset relations for nouns exemplified by theword table defined as a piece of furniture. Hierarchical relations are hyponym-hypernymrelations and meronym-holonym relations [Fri12, pp. 221-222]: Counter is a hyponymof table, therefore table is a hypernym of counter. Leg is a meronym of table, thus table isa holonym of leg. Sister terms share the same direct hypernym within a concept [Pri10].There are also relations between synsets. For example good is related to bad by antonymyrelation.

WordNet is a useful resource in NLU for semantic similarity measurements. For this workit is particular interesting to compare an unknown word’s meaning to a known word’smeaning to determine their lexical semantic similarity. There are various WordNet-based semantic similarity measurements available. [MHG13] groups them into pathbased measures, information content based measures, feature based measures andhybrid measures and gives an overview of their principles, features, advantages anddisadvantages. One of the information content based measures is the method of Jiangand Conrath (JCN) [JC97].

Jiang and Conrath Semantic Similarity

The method of JCN determines the semantic distance of word meanings within asemantic net by combining a node-based with an edge-based approach. The node-basedpart relies on the probability of the occurrence of a synset within a semantic net, whichdenotes its information content (IC). The information content (IC) of a synset is definedas:

IC(s) = log1

P (s) (2.1)

where s is a synset and P (s) denotes the probability of occurrence of instances of s in alarge text corpus. Consequently a synset carries the less information the more frequentit occurs and vice versa.

The edge-based part focuses on distance measurement between child nodes and parentnodes considering the hierarchical structure of WordNet. The first step towards the link

20


Figure 2.4: Subset of the synset relations for the word table defined as "a piece offurniture having a smooth flat top that is usually supported by one or morevertical legs" [Pri10].

strength calculation is determining the conditional probability of, given a parent nodesp, encountering any of its child nodes sc by:

P (sc|sp) = P (sc ∩ sp)P (sp) = P (sc)

P (sp) (2.2)

The probability of (sc ∩ sp) equates the probability of (sc) as P (sc) is a subset of P (sp).

The link strength (LS) is defined by the IC of the link. Therefore the LS is the negativelogarithm of the conditional probability of P (sc|sp):

LS(sc, sp) = −log(P (sc|sp)) = IC(sc) − IC(sp) (2.3)

As a result the link strength between a child synset and a parent synset is denoted bythe difference of their ICs.

21

2 Background

The resulting distance measurement DJCN is the link strength between two synsets andtheir closest common parent synset:

DJCN(s1, s2) = IC(s1) + IC(s2) − 2 ∗ IC(s3) (2.4)

where s1 and s2 are the synsets which are compared to each other and s3 is the closestcommon synset.

The resulting similarity score SJCN used in WordNet is the multiplicative inverse of thedistance measurement:

SJCN(s1, s2) = 1DJCN(s1, s2)

(2.5)

2.2 Relational Activity Process

Relational Activity Process (RAP) is a model for concurrent cooperation domains de-veloped by [Tou+16]. Concurrent domains are domains with multiple tasks runningparallel and sequential with diverse starting and ending points. Cooperation domainsinvolve any kind of human-robot cooperation or multi-agent/multi-robot cooperation.The model is called RAP as it uses relational state representations, which means thatthe relations between objects and agents are modeled. Activities relate to the tasksperformed by agents. Two characteristic of the model are [Tou+16]:

1. Concurrent activities are regarded as parts of the state

2. Decisions mark the starting and ending point of activities

Utilizing Monte Carlo planning or reinforcement learning in concurrent cooperationdomains is impossible, as such standard Markov decision processes (SMDPs) do notconsider duration or concurrency. RAP enables the transfer of SMDPs to concurrentcooperation domains by considering activities as part of the state and decisions asstarting and ending points of these activities [Tou+16]. Figure 2.5 shows a diagram ofconcurrent activities as defined by RAP.

Within a real-world example of direct policy learning a command line interface was usedto issue commands to the robot.

22

2.2 Relational Activity Process

Figure 2.5: RAP concurrent activities performed by three agents with decisions markingstarting and ending points of these activities.

23

3 Related Work

This sections outlines related work in natural language understanding by means ofhow to derive the meaning of a natural language command and translate it to thecorresponding logical form. Several individual aspects have to be considered for creatinga natural language understanding system. These will be studied in detail:

• Knowledge base

• Semantic parsing

• Human-robot dialog

Speech processing is not studied here, as the focus of this work is natural languageunderstanding. The translation from speech to text and text to speech concerns digitalsignal processing which is not a part of natural language understanding. Also learningcan be a part of a natural language understanding system, but this is out of the scope ofthis work.

3.1 Knowledge base

According to [Hel06, p. 279] the concept of a knowledge base includes linguistic know-ledge and world knowledge. Linguistic knowledge consists of lexical and grammaticalknowledge; world knowledge describes the meaning of a word. World knowledge doesnot imply having average human knowledge about the world (commonsense knowledge)or a specific part of it. It is also stated that there are no sharp boundaries between thesedifferent type of knowledge. But for this work grammatical knowledge will be discussedseparately in section 3.2 as a method for language understanding. This section givesinsight in creating and using lexical, world and commonsense knowledge.

The knowledge base of [Mat+13] is created by training a parser with examples ofnatural language commands and the corresponding logical form. They exemplarily applythis method on the domain of following navigation instructions through a previouslyunknown indoor environment. Commonsense knowledge was not taken into account.The main disadvantage of this approach is the poor reusability of a learnt parser; the

25

3 Related Work

adaption to another domain requires training the parser with a new set of suitableexamples. This approach relies heavily on the coverage of the gathered examples. Thisis a problem which occurs at any grammar-based approach.

[KTF15] took a different approach, which uses ontology as commonsense knowledgebase. An ontology is a form of semantic knowledge representation that contains objects,their properties, and their relation to each other. Their ontology knowledge base aswell as their subsystem, which performs the object grounding, are domain restricted.The finding of a later work of this approach [ETF16] is, amongst others, that domain-independent semantics make the adaption to new domains easier.

This work uses a semantic network as commonsense knowledge base like [KTF15], withthe difference of not setting up an own domain-specific semantic network. The idea isbased on the approach of using an external semantic network which was taken by [GR14]for learning from natural instructions. As this work relies on a general semantic networkit has no domain restriction concerning world knowledge. But to deal with worldknowledge it has a domain specific hand-crafted lexicon. This lexicon connects thedomain vocabulary with the semantic network and commonsense knowledge.

3.2 Semantic parsing

Natural language understanding means deriving the content of natural language input.There are various methods which can be applied depending on the desired outcome;whether it is identifying the topic of a text or it is identifying the instruction to beexecuted, as in this work.

If there is only a single sentence and the desired outcome is to map a natural languagesentence to the right command, a precise analysis of the sentence is necessary to exactlyunderstand the entire content, which is called semantic parsing. For example, there is asignificant difference if something should be transferred to table number one from placenumber two, or to place number two on table number one. It is not enough to understandthat the sentence is about transferring an object to execute the intended command. Asany sentence in natural language itself follows general grammatical structures, thesestructures can be utilized to determine which action should be done, which objectsshall be involved and which and how locations need to be considered. But naturallanguage is not as easy to parse as source code due to its complicated grammar, wordambiguousness, and the many possibilities to express subtly differences in meaning.

Various so called semantic parsing approaches have been used for natural languageunderstanding purposes. They can basically be grouped into two categories: Thegrammar based approaches and the data driven approaches.

26

3.2 Semantic parsing

[KTF15] uses Embodied Construction Grammar(ECG) for semantic parsing of naturallanguage in English and, to show the flexibility of their system, in Spanish. They admitthat the learnt rules do not fully cover the English grammar. But the rules are workingwell for moderately complex task instructions. [KTF15] do not use spoken language yet,which will be shown in this work. Spoken language has a large influence on naturallanguage understanding as it adds factors of uncertainty through noise and it differsfrom written commands in choice of words and grammar. The portability of this work’sapproach to other languages will also be discussed.

[ALZ15], [EB14] and [Per+15] use Combinatory Categorical Grammars (CCG) for se-mantic parsing. [Mat+13] uses a probabilistic version of CCG for navigation instructions.They state being able to deal with very complex natural language instructions due totheir approach’s probabilistic nature. After grounding navigation instructions in [AZ13],the focus of [ALZ15] lies on the reusability aspect of CCG for semantic parsing in variousdomains. [EB14] applies CCG with a focus on parsing robotic spatial commands withand without the integration of a spatial planner. The topic of [Per+15] is robotic taskassignment through dialog handling.

All of the above mentioned grammar based approaches share the same limitation: Thenecessity to generate large amounts of training data covering many different grammaticalaspects and the portability of the learnt grammar rules to other domains. These twopoints shall be improved in this work by significantly reducing the amount of trainingdata necessary and by using a universally valid parser which is not domain specific.

[Bas+14] compares a grammar based approach, like the above mentioned, with a datadriven approach for human robot interaction on several datasets. The grammar basedapproach recognizes only a restricted set of command and performs very well on them.But when complexity rises and speech differences occur the performance declines. Thedata driven approach uses a statistical semantic parser on free-form text input whichis more robust and portable to other domains. The data driven approach uses FrameSemantics [Fil76] as desired parsing result. In Frame Semantics a frame is a situationlike an action with all necessary arguments like objects and goals involved. The workflowof the data driven approach is similar to the one taken in this work. As frames do notmatch to the abstract domains studied here, the parsing results are different.

[RLS14] introduces a data driven approach for a question answering system for queryingFreebase in natural language. Their semantic parser maps the output of a CCG parser toa graph-based representation which is similar to Freebase. By using this practice, it isnot necessary to manually annotate question-answer pairs for training like in grammarbased approaches. In a later work [Red+16], they outperform CCG based approachesby converting dependency structures to logical forms for querying Freebase.

27

3 Related Work

Involving dependency structures and graph-based representations seems also to be apromising approach for semantic parsing of natural language instructions for robots.A suitable graph-based representation for robot instructions was found in the work of[Tel+11a]. Their meaning representation is called the generalized grounding graph(G3).

3.3 Human-robot dialog

Sophisticated human-robot communication systems have been introduced in recentyears for various use cases. [Lis15] compares three dialog management approaches(hand-crafted model, factored statistical model, rule-structured model) with the rule-structured probabilistic approach outperforming the other two approaches. The scopeof this work does not take into account conversation, human feelings or non-verbalcommunication like gaze. The focus of this human-robot communication system isfunctionality: Bidirectionally asking questions and getting answers.

It is necessary for the robot to ask questions if a command is incomplete due to speechrecognition inaccuracy, information implied by the human or incomplete semanticparsing. Considering autonomously working robots with planning abilities, a dialogsystem makes these robots to co-workers instead of subordinates, as a dialog systemenables these robots to suggest tasks.

[Tho+15] achieved good results by using question templates to clarify or confirm roboticcommands. The goal of their work was to improve natural language understandingby dialog management. Their evaluation was done with written communication andspeech. In written communication spelling errors lead to misunderstanding. Freedomfrom errors can not be guaranteed by using written communication instead of speechrecognition.

[Tel+13] approach of inverse semantics, which is based on inversely using their semanticparser, for an autonomous robot asking a human for help was outperformed by questiontemplates. They improved their approach twice in [Tel+14] by using two differentmetrics both scoring better than the question templates this time. Both works’ applicationdomain is the assembly of furniture by an autonomous robot which asks a human forhelp by requesting for example to place a certain item at a certain place.

The work of [Dei+13] is also based on the semantic parser and G3 of [Tel+11a] butwith the objective of clarifying commands. Different question-asking strategies havebeen evaluated: Yes-or-no questions, targeted questions and asking to rephrase thewhole command.

28

3.3 Human-robot dialog

Using template questions matches the needs of this work. The question-asking strategiestargeted question and asking to rephrase from [Dei+13] will be used for clarifying input.As the meaning representation of this work is also based on the work of [Tel+11a], theymatch well. Another template-based question-asking strategy, which is similar to theconfirmation questions of [Tho+15], is used for suggesting tasks: Asking the humanpartner whether performing a specific task is reasonable.

29

4 Designing a natural languageunderstanding and communicationsystem

The goal of this work is to enable unrestricted natural language communication withrobots in cooperative tasks. This chapter presents the set-up and design of a NLU systemwith a communication system for ambiguity resolving which is suitable for RelationalActivity Process (RAP). The chapter is split into three sections: The NLU methods as wellas the key points concerning the communication are studied separately in the sections4.1 and 4.2. The final section 4.3 describes the serration into an overall system and theinterconnection to RAP.

4.1 NLU methods

The general idea followed in this work is finding an approach that is as modular,adaptable and portable as possible. Modularity is given by the possibility of exchangingparts like the semantic net or speech recognition and speech synthesis. Adaptability isconsidered in terms of using it in further domains with few adaptations. The portabilityis a result of the modularity and the focus on accessing general resources instead ofdomain specific ones.

To set-up a NLU system there are several decisions to be made concerning the followingaspects:

1. Syntactic parsing

2. Meaning representation

3. Knowledge base

4. Mapping

31

4 Designing a natural language understanding and communication system

The points one to three are key points of semantic parsing. Mapping is related to theoverall system considering the further processing of the semantically parsed output. Butit is nevertheless a part of an NLU system.

In this section the input command is considered as already being available in writtenform. Speech recognition and the resulting issues are covered in section 4.2.

4.1.1 Syntactic parsing

The grammatical structure analysis is the first step of semantic parsing. It can either berealized by a parser based on dependency grammar or on phrase structure grammar.The identification of connections between words is an advantage which dependencygrammars have over phrase structure grammars. As the desired outcome of syntacticparsing in this work is revealing the links between words, a dependency grammar basedparser is chosen. The links are represented by Universal Dependencies. Consideringadaptability the chosen parser does not need any domain specific training.

4.1.2 Meaning representation

The desired meaning representation needs to be capable of representing commandsfor a stationary robot. [Tel+11a] invented a meaning representation for stationaryand moving robots called generalized grounding graph (G3). The G3 is a probabilisticgraphical model, which dynamically adapts itself to the certain structure of a command.The resulting structure of the model is a tree. The composition of the model is based onthe formalization of spatial description clauses (SDCs) introduced by [Kol+10]. SDCsare defined in four categories which allow to divide the command into certain partialmeanings:

• Event: An action to be executed

• Object: A physical object in the world

• Place: A location in the world

• Path: A travel path instruction

Analogously to the work of [Tel+11a], in this work the dependency types are used toassign the parsed outputs to SDCs. However the SDC path is excluded from this work,as the robot is stationary and instructions specifying the travel path do not occur.

The meaning representation of this work is a tree as shown in figure 4.1. One tree is builtfor each sentence. The size of the tree depends on the syntactic parsing result just as for

32

4.1 NLU methods

Figure 4.1: Meaning representation tree. The command is divided into the SDCs event,objects and places.

the G3 [Kol+13]. Usually there are one event, one or more objects and up to two placesfound per command. Not all of the categories necessarily appear in each command andsome might appear more than once: "Release the red square", for example, has just oneevent and one object; "transfer the blue ball from area one to area two" consists of oneevent, one object and two places.

Each leaf includes not only the object but also descriptive elements of it like its color.

As the structure is a tree without a fixed number of leaf nodes it is transferable to anydomain where the command descriptions are expressible in events, objects and places.

4.1.3 Knowledge base

As mentioned in section 3.1 a semantic network is chosen as a commonsense knowledgebase. A handcrafted lexicon is the connecting element between natural language wordsand the commonsense knowledge. The handcrafted lexicon contains word meanings asentry points for comparisons within the semantic network.

The semantic network WordNet has been used successfully in several NLU tasks [MF07].WordNet represents commonsense knowledge as it consists of synsets which categorize,connect and associate the meanings of words. It also represents world knowledgebecause it is at the same time a lexicon of word meanings. Using a general, provennetwork has the advantage of having an extensive, reliable data basis to depend on.This data basis is used for word meaning purposes and word similarity measurements.

33


Figure 4.2: Word similarity measurement of an event e1 within WordNet. The lexiconentries x, y and z mark the entry points to WordNet.

The WordNet project started in 1985 and is still under active development [Pri10]. Thisis important for natural language tasks as word meanings change over time and newwords get invented. Due to its structure and scope WordNet is the ideal semantic net forthe desired word similarity measurements.

The handcrafted lexicon contains word meanings for actions a robot can take as well asdomain specific objects and locations. The lexicon is tailored to the WordNet synsets.With word similarity measurements the words in the meaning representation get com-pared against the words in the lexicon. Word similarity measurements indicate therelatedness of two words. As words with the same meaning are merged into the samesynset and synsets with related meanings are connected to each other, the lexicon needsto contain only few words as entry points. The relatedness between words belonging tothe same synset is obviously very high. Figure 4.2 exemplarily illustrates the meaning ofentry points and word similarity measurements within WordNet.

More precisely, the desired similarity measurement of this work is a concept similaritymeasurement between word meanings. There exist several options for word meaningsimilarity measurements in WordNet. The method of Jiang and Conrath (JCN) [JC97],which is based on the information content is one of them. JCN was named the one giving

34

4.2 Communication methods

the best results overall by [BH06] and among the three most promising for term-to-termsimilarity by [BWB13]. Hence JCN is chosen as method in this work. The technicalexplanation of JCN was presented in section 2.1.3.

With JCN each noun and verb in the leafs of the meaning representation gets assignedthe most similar entry point by comparing the semantic distance between the words andthe entry points.

A key point of the knowledge base creation and adaption is identifying the words andsynsets describing the desired meaning. Objects and places of a new domain need to beidentified or learnt whenever new ones appear. The robot’s possible courses of actionsneed to be specified only once, which benefits the portability to a new domain.

4.1.4 Mapping

Via the word similarity measurements with the knowledge base the words in the meaningrepresentation get linked to the application environment. The mapping is the translatingstep between the meaning representation and the specific robot command.

The knowledge base of this work also contains a list of skeletons of the executablecommands. The skeleton for the command of transferring an object from one place toanother would be:

[eventtransfer][object{}][fromPlace{}][toP lace{}]

The result of the mapping is an executable command based on one of the skeletons. Forthe natural language order "Carry the blue ball from area number one to the second area"the preceding skeleton would be filled as follows:

[eventtransfer][object{ball_blue}][fromPlace{area1}][toP lace{area2}]


In this section key points of spoken communication will be discussed. This covers on onehand issues that arise from spoken NLU and on the other hand questioning and answerstrategies for ambiguity resolving.

35


4.2.1 Issues of spoken NLU

The method of the language input also needs to be considered when utilizing NLU. Asstated previously this work aims on freehand and therefore spoken communication.Consequently spoken communication is discussed with cross references to writtencommunication.

Spoken communication adds a layer of uncertainty as speech needs to be translated totext first. Spoken communication also has a higher level of implicit content. But on theother hand in spoken language the chosen words and grammar are generally simplerthan in written sentences.

Within the transmission of speech to text one error-prone topic is the speech productionin terms of the chosen microphone, the articulation of the speaker and the backgroundnoise. While written communication methods have to deal with spelling mistakes, spokenlanguage recognition faces a broad variety in pronunciation, dialect and colloquiallanguage. Speech recognition quality, the second topic, highly depends on the chosenspeech recognition framework. Errors that may arise in the transmission of speech totext lead to wrongly recognized or completely unrecognized words.

Apart from word recognition errors there might be information missing for the robot’sunderstanding which the human did not give but implied. This is a significant differencebetween human-robot and human-human dialog: Not taking facial expression or gesturesinto account increases the amount of information that needs to be expressed throughspoken words.

In general ambiguity resolving necessity arises when none of the skeletons can be filledcompletely during the mapping step. This has its origin in semantic parsing where theinformation which is extracted cannot be parsed correctly. Reasons for this lie in wronglyrecognized words, insufficient informations or the restricted NLU capabilities of therobot.

Communication is a mutual activity especially in cooperative tasks: A robot shouldalso have the ability to make itself understood. This can be realized by text to speechframeworks.

4.2.2 Questioning and answer strategies

To resolve the issues occurring during the mapping step, questioning strategies are themethods of choice. Clarification through questions is a common method in human-human dialog for ambiguity resolving [Dei+13]. Consequently it is also a suitablestrategy for human-robot communication. There are several ways to build up questions,

36


see section 3.3. Template based questions are a well-tried strategy to enhance commandunderstanding.

The following questioning types, leaned on [Dei+13], will be used:

• Targeted question

• Asking to rephrase

The targeted question of [Dei+13] is adapted to fill the gaps in the skeleton duringmapping if there occur fewer gaps than a certain threshold. If the number of gaps isabove the threshold, the question type asking to rephrase is used. This resets the NLUto the starting point where the human gives a command. This combination seems tobe more useful for this work than for [Dei+13] as the target questions are applieddifferently. In [Dei+13] the targeted questions are used to specify an object by asking inthe form: "What does the word ... refer to?" In this work the targeted questions are of theform:

”Which

action

object

location

shall I

perform

use

transfer ... to

transfer ... from

?”

This question form adds the necessity of understanding answers which are not semanti-cally parsable in the same way as a command: The answer consists of only one or twowords instead of a sentence. But, as in human-human dialog, the reference from thequestion to the answer is obvious. Targeted questioning is an iterative task, where thequestions are successively asked - targeted on one gap per question - until all gaps inthe skeleton are filled and the command can be executed. The functioning is illustratedin figure 4.3.

Apart from the exiting strategy asking to rephrase triggered by the robot, there shouldbe an option to stop the question-answering process which can be triggered by thehuman.

For autonomously working collaborative robots that do not only follow commands,another questioning strategy is necessary to communicate with a human. In this usecase the obstacle is not the uncertainty considering the mapping but the choice of thenext task. [Tho+15] uses confirmation questions like "You want me to bring ... to ...?".This is applied on the use case of the robot asking for the reasonableness of a possibletask instead of confirming a given command. An example would be to ask: "Shall Icarry the blue ball from area number one to area number two?" This opens two optionsfor a response: Allow the robot to proceed or specify instead the desired command.As figure 4.4 shows, this questioning strategy acts like a cover around the commoncommand to execution workflow with resolving potential ambiguity.

37


Figure 4.3: Workflow of the questioning system for ambiguity resolving. The decisionwhich question type is used depends on the chosen threshold for allowedgaps in the skeleton.

Figure 4.4: Workflow of the task suggestion use case triggered by the robot. If theanswer is "no", the human has to specify another command. In case of theappearance of ambiguities the - grey shaded - ambiguity resolving workflow,illustrated in figure 4.3, is followed.

38

4.3 Overview of the design and serration points

4.3 Overview of the design and serration points

For the setup of the NLU system with a communication component both parts getconnected. The connecting point for the clarifying dialog system is situated just after themapping into the command skeleton. The task suggestion can be seen as an independentcomponent which is placed before the NLU system. The connection to the NLU system isbuilt when the human’s reaction to the task suggested by the robot is processed.

The design of the NLU system with the communication component for ambiguity resolv-ing is displayed in figure 4.5. In the middle of the chart the general straight forwardNLU workflow is shown without any negotiation. The semantic net, as an externalcomponent, is set to the left. The domain lexicon above is the comparison base betweenthe semantic net and the words in the meaning representation. The dialog system forambiguity resolving is shown on the right side of the graphic.

With the strategy of not only allowing the human but also the robot to suggest tasks, thisNLU and communication design is suitable for human-robot collaboration. The NLU andcommunication system can be plugged into RAP at any point, where NLU or negotiatingabout a command is necessary.

Adapting to another domain

The constructed design works with some changes for any domain. The semantic parsingpart is universal for any domain restricted to the desired result being to determine

• the desired action to perform

• the objects to involve

• the places which are concerned.

The same applies to the dialog system. As long as the desired result is the same, thequestioning templates are valid.

Concerning the mapping, the command skeletons need to be extended if, for example,the robot gets implemented another course of action.

As a connection point, the domain lexicon has the most needs for adaption. If learningis not taken into account, as in this work, each object in the used domain is consideredto be known. Therefore it needs to be added to the domain lexicon. The same applies toactions the robot can perform.

39


Figure 4.5: Design and serration of the NLU and communication system.

40

5 Evaluation

The purpose of this chapter is to evaluate the NLU and communication system presentedin section 4. In the course of the evaluation first the implementation of the NLU andcommunication system is described in section 5.1. This also includes the implementationof the connection to RAP. Subsequently two application domains with different semanticparsing requirements get introduced in section 5.2. Finally the system’s NLU capabilitiesget analyzed in section 5.3: This analysis investigates the influence of the input on theunderstanding, the processing time of individual components and the benefit of theambiguity resolving dialog system. Findings will be evaluated separately for the twoinvestigated domains and furthermore in comparison with each other.

5.1 Implementation

The implementation of the NLU system and the communication parts is done in Python2.7. Python was chosen as programming language as connections to required naturallanguage processing tools are available. To realize the entire system the followingexternal tools and functions are used:

CoreNLP [Sta17] Dependency parsing and POS taggingWordNet [Pri10] Semantic netNLTK [NLT15] Connection to WordNet, connection to CoreNLPCMU Sphinx [Car16] Offline speech recognitionGoogle Cloud Speech [Pla17] Online speech recognitionpyttsx [Par13] Text to speechSpeechRecognition 3.6.0 [Zha17] Connection to pyttsx, connection to CMU Sphinx

As the goal is to reuse the system for several domains, it is reasonable to use proven toolswithout the necessity of domain specific training. This work uses the same syntacticparser and POS tagging from CoreNLP as [Bas+14] . They state that retraining isunnecessary due to the quality which is reached out-of-the-box.

41

5 Evaluation

For speech recognition an offline and an online option were implemented. The offline op-tion, CMU Sphinx, has the advantage of not needing an internet connection. The onlineoption, Google Cloud Speech, offers near real-time processing and better recognitionaccuracy.

Concerning WordNet the entry point definition in the domain lexicon needs to be done.Defining the correct word meanings and the quantity of meanings for each action andobject is done by personal opinion. Choosing the threshold for similarity measurementsis also a matter of trail and error. In general: as lower the threshold is set as higher isthe possibility of incorrectly mapping word meanings. On the downside, if the thresholdis chosen too low, too few word meanings get considered similar. The choice of thethreshold depends also on the similarity measurement. Here the measurement of [JC97]is used, cf. section 2.1.3. The possible range of the similarity measurement lies betweenzero for absolute dissimilarity and infinite for maximum similarity, e.g. for sister terms.

The meaning representation tree, described in section 4.1.2, is realized as a nesteddictionary. The command skeletons, cf. section 4.1.4, are implemented as dictionariesas well. The domain lexicon consists of separate dictionaries for actions, objects,locations and descriptors. The respective dictionary entries are chosen on the basis ofthe domain.

RAP is written in C++. The connection between RAP and the NLU processing is doneby pure embedding without the aid of external libraries. The execution of the Pythonimplementations is triggered within a C++ function whenever NLU functionality isrequested.

5.2 Experimental domain setup

To investigate the NLU capabilities from the natural language input until the filling ofthe command skeleton, example domains need to be selected. For testing purposesthe natural language system was implemented for two RAP domains, which havedifferent semantic parsing requirements: The first domain only requires action andobject mapping. The second domain extends the scope to location mapping.

5.2.1 Toolbox domain

The toolbox domain is a modification of the concurrent assembly assistance robot-domainfrom [Tou+16]. The domain describes the collaborative assembly of a toolbox by ahuman and a robot. The toolbox to be assembled consists of three parts (left side, right

42

5.2 Experimental domain setup

side and handle). These parts shall be attached to each other with screws. The robothas the ability to give an object to the human, release the object and hold two objects inplace for the human to attach. Within this domain the following actions and objects aredefined:

actions: give(object)release(object)attach(object, object)

objects: side_leftside_righthandlescrew

The following questioning templates are implemented:

ask to rephrase: (0) I did not understand you. Please rephrase the com-mand.

ask targeted: (1) Which action shall I perform?(2) Which object shall I use?(3) Which objects shall I use?(4) Which object shall I use together with OBJECT1 and

... and OBJECTn

suggest task: (5) Shall I perform the action ACTION?(6) Shall I perform the action ACTION with object

OBJECT?(7) Shall I perform the action ACTION with objects

OBJECT1 and ... and OBJECTn?

The ask targeted questioning templates 1-4 need to cover the cases that the action (1),one object (2) or more objects (3) are not recognized. In this domain there are up to twoobjects which may not have been recognized. As the questioning templates are intendedto be reusable, the template is flexible to the number of objects. If more than one objectis involved in the questioning process it is necessary to inform the human if and whichobjects were already recognized. This is covered by the questioning templates 4 and 7.The suggest task questioning templates 5-7 cover the use case of task suggestions by therobot which is illustrated in figure 4.4.

43

5 Evaluation

5.2.2 Blockworld domain

The blockworld domain consists of three areas, seven objects and one action. The blocksare specified by their color and shape. The robot as well as the human can transferobjects from one area to another area. The goal of involving this domain is to show notonly action and object but also location mapping.

Within this domain the following action, objects and locations are defined:

actions: transfer(object, location_from, location_to)objects: triangle_red

triangle_greentriangle_blueball_redball_greensquare_bluesquare_red

locations: area_leftarea_middlearea_right

The questioning templates from the toolbox domain, cf. section 5.2.1, were extended toenable also the asking for locations. In that domain there only need to be considered twolocations: The location to pick an object from and the location to place an object on. Ifmore than two locations are requested, the question template involving the locations canbe extended in the same way as the questioning templates 4 and 7 from section 5.2.1.The following questioning templates are in use:

ask to rephrase: (0) I did not understand you. Please rephrase the com-mand.

ask targeted: (1) Which action shall I perform?(2) Which object shall I use?(8) Where is the object I shall pick?(9) Where shall I place the object?

suggest task: (5) Shall I perform the action ACTION?(10) Shall I ACTION the OBJECT from the

LOCATIONfrom to the LOCATIONto?

The questioning templates 0, 1, 2 and 5 are identical to the ones in the toolbox domain,whereas the templates 3, 4, 6 and 7 are not in use. The questioning templates 8, 9 and10 got added for dealing with locations.

44

5.3 Analysis of the NLU capabilities


Within this section various aspects of the implemented NLU workflow get analyzed onthe basis of the domains described in section 5.2. First, in section 5.3.1, the focus is seton the speech recognition. Second the communication system is evaluated regardingquestioning necessity for both - the offline and the online - speech recognition systemsin section 5.3.2. Concluding, the processing time of individual components of the NLUsystem is examined in section 5.3.3.

5.3.1 Speech recognition analysis

The scope of this part of the evaluation was to analyze the two speech recognitionsystems. Special consideration was set on the processing time of the transcription fromspeech to text, the accuracy of the speech recognition and the determination of thenecessity for an ambiguity resolving system. The experiment was conducted separatelyfor the domains.

Experiment with the toolbox domain

As first part of the experiment the analysis was done for the toolbox domain using CMUSphinx as well as Google Cloud Speech. The goal was to expose differences of the speechrecognition systems. For the analysis all possible commands from the toolbox domainwere expressed ten times for each valid combination of actions and objects. This makes100 utterances in following proportion:

• 40 times the command give with the objects handle, left side, right side and screw

• 30 times the command release with the objects handle, left side and right side

• 30 times the command attach with the object pairs handle + left side, handle +right side and left side + right side

Table 5.1 enumerates key figures of the speech recognition analysis. The measured timescontain the recording and the transcription times per command. As the transcriptionusing Google Cloud Speech is gradually done during the recording, further processingcan start almost simultaneously with the completion of the utterance. The speechrecognition time of Google Could Speech is thereby an indication for the length ofthe speech recording per command. Consequentially on average the whole speechrecognition with CMU Sphinx takes more than 2.5 times as long as just the recordingwith Google Cloud Speech. Using Google Cloud Speech saves thereby on average 8.64

45

5 Evaluation

Table 5.1: Recording and transcription times (in seconds) for the commands of thetoolbox domain.

Google Cloud Speech CMU SphinxCommands Average Minimum Maximum Average Minimum MaximumGive 4.94 3.85 7.42 12.71 5.22 39.91Release 5.06 3.64 6.95 13.22 8.17 37.36Attach 6.32 5.18 8.05 16.30 7.42 44.39Overall 5.44 3.64 8.05 14.08 5.22 44.39

Table 5.2: Recording and transcription times (in seconds) for the answers in the toolboxdomain.

Google Cloud Speech CMU SphinxAverage Minimum Maximum Average Minimum Maximum

Answer 3.56 2.72 6.14 9.44 4.25 42.98

seconds per command. The same trend is visible in the average answer recognitionshown in table 5.2. The measured times for the answer recognition are alike for allcommands, as all answers consist of single words. For this reason table 5.2 gives thecumulated values only.

The boxplots in figure 5.1 clearly indicate the tendency of the measured times for theonline and offline speech recognition implementations. Notably the measured timesof Google Cloud Speech are in a much smaller range than the ones of CMU Sphinx.Besides there are also more outliers in CMU Sphinx. For both implementations thecommand attach, which involves two objects, has a significantly longer speech recordingand processing time, while the recognition time of shorter commands and answersdecreases.

Concerning speech recognition and semantic parsing accuracy Google Cloud Speechperformed better than CMU Sphinx: Out of the 100 utterances per speech recognitionimplementation, 33 of the utterances with Google Cloud Speech needed ambiguityresolving. Otherwise, when using CMU Sphinx, 61 utterances needed ambiguity resolv-ing. Hence ambiguity resolving turned out to be essential for both speech recognitionimplementations. The reasons are manifold and may also lie in insufficient semanticparsing capabilities. Though concerning speech recognition there are two major sourcesof errors:

46


Figure 5.1: Comparison of the speech recognition processing times of Google CloudSpeech and CMU Sphinx in the toolbox domain experiment.

47

5 Evaluation

For both implementations the speech recording starts and stops when the microphoneinput is below a certain threshold for a certain period of time. If there is a longer pausebetween two words, the recognition gets prematurely finalized.

Speech transcription errors are the second major source of errors. The word "handle"turned out to be the most wrongly recognized word for CMU Sphinx as well as for GoogleCloud Speech in this domain: Possible transcriptions were handler, hand, handbook,hamptons, handover, handout for CMU Sphinx and hand, panda, honda for GoogleCloud Speech. Such errors usually do not propagate to syntactic parsing and affect therecognition of the remaining command as the following examples do: "If we the leftside" [give me] or "and catch them left side to the right side" [attach the] (CMU Sphinx)."Grasp a screw place" [please] and "screw to handle on the right side" [the] (Google CloudSpeech).

Experiment with the blockworld domain

For the blockworld domain the experiment was conducted using solely Google CloudSpeech. As the previous part of the experiment already revealed longer processing timesfor CMU Sphinx and the expected command length will be about twice as long, CMUSphinx wass not taken into account. For human robot collaboration, which is the goal ofthis work, a long processing time on the robot’s part is unsatisfactory. Moreover a shorttest resulted in the occurrence of too many transcription errors.

In the blockworld domain there is only one command defined, which needs one objectand two location specifications. This command was executed 30 times with diverse objectand location combinations. Table 5.3 enumerates key figures of the speech recognitionanalysis for command and answer utterances. The speech recognition time of thecommand is about twice as long as the average speech recognition time in the toolboxdomain, which is 5.44 seconds. The answer processing times of the blockworld domain,cf. table 5.3, are similar to the times measured in the toolbox domain experiment, cf.table 5.2. Obviously there is a high correlation between the length of the utteranceand the speech recognition time. Figure 5.2 demonstrates this effect illustratively byshowing the speech recognition times of the utterances of this command besides thespeech recognition times from the toolbox experiment conducted with Google CloudSpeech in box plots.

From the results, 18 out of the 30 command utterances required ambiguity resolving,which implies that 40% of the utterances were immediately understood. The rateis worse than in the toolbox domain, but the semantic parsing in this domain hasmore mapping necessities and possible mapping options. Frequently one of the twolocations was not correctly mapped. Most likely reasons are insufficient semantic parsing

48


Table 5.3: Recording and transcription times (in seconds) for the command and answerutterances in the blockworld domain.

Google Cloud SpeechAverage Minimum Maximum

Transfer (command) 9.89 6.19 12.62Answer 4.38 3.52 5.93

Figure 5.2: Box plots of the speech recognition times of the blockworld domain com-mand transfer conducted using Google Cloud Speech, in comparison withthe speech recognition times from the toolbox domain experiment alsoconducted with Google Cloud Speech.

49

5 Evaluation

capabilities or grammatical errors. Transcription errors, that arose, were for example"carry the red triangle from the left area to the malaria" [middle area] or "pick the greensquare from area 2 and place it on area ..." [area 0]. The second one is an example forpremature finalization of the recognition.

5.3.2 Communication system benefit analysis

The experiment has already shown that ambiguity resolving is essential. The analysis ofthe usage of the communication system for ambiguity resolving lies within the scopeof this section. The speech recognition implementations and application domains weretreated jointly to enable comparisons amongst them.

The questioning strategies asking to rephrase and asking targeted were used extensivelyin both Google Cloud Speech and CMU Sphinx. The threshold for asking to rephrase wasset to two gaps in the toolbox domain and three gaps in the blockworld domain.

Table 5.4 gives an overview of the usage of the communication system per conductedexperiment. The numbers correlate with the findings from the speech recognitionanalysis: Within the Google Cloud Speech experiment there is a much higher recognitionrate. This results in the necessity of less than half as many iterations than within theexperiment with CMU Sphinx. The eight times higher number of rephrases can beexplained by error propagation into the syntactic parsing, which happened very frequentin CMU Sphinx. For the same reason the opposite function, which enables the human tostop and restart the language understanding process, was also used more often in theexperiment with CMU Sphinx.

Table 5.5 lists the causes for the necessary iterations enumerated in table 5.4. Whenthe object was the cause for one or more iterations, the object in question was veryoften the word "handle". In the experiment involving the blockworld domain, theaction and objects were transcribed nearly to 100%. But insufficient location mappingcaused demands for ambiguity resolving in 18 out of 30 utterances (60%). It can beassumed, that the recognition accuracy depends highly on the words occurring in thedomain. A less abstract domain with common words may have a positive influence onthe recognition accuracy. An indication for this assumption is, compared to the numberof objects, the small number of actions causing ambiguity resolving iterations.

Despite ambiguity resolving, some commands were incorrectly mapped. Table 5.6 enu-merates the correct and incorrect mappings per command and experiment. Altogetherthere were seven commands wrongly mapped while using CMU Sphinx and only onewhile using Google Cloud Speech. On the other hand this also means 93% correctmappings when using CMU Sphinx and 99% correct mappings when using Google Cloud

50


Table 5.4: Usage of communication system functions ask to rephrase, ask targeted (AT)and the human intervention to stop the current mapping and restart fromthe beginning (Human stop).

Command Rephrase Human stop Iterations of ATExperiment (Count) Count Count Count Maximum

CMU Sphinx,Give (40) 10 1 24 7

Toolbox domainRelease (30) 17 2 32 8Attach (30) 5 2 54 9Overall (100) 32 5 110 9

Google Cloud Speech,Give (40) 1 0 19 5

Toolbox domainRelease (30) 3 0 12 4Attach (30) 0 1 19 3Overall (100) 4 1 50 5

Google Cloud Speech,Transfer (30) 0 0 21 2

Blockworld domain

Table 5.5: Number of different causes for iterations per command triggered by asktargeted.

Experiment Command Cause for iteration(Count) Action Object Location_from Location_to

CMU Sphinx,Give (40) 2 12 - -

Toolbox domainRelease (30) 3 13 - -Attach (30) 1 21 - -Overall (100) 6 46 - -

Google Cloud Speech,Give (40) 4 8 - -

Toolbox domainRelease (30) 1 5 - -Attach (30) 6 9 - -Overall (100) 11 22 - -

Google Cloud Speech,Transfer (30) 0 1 7 12

Blockworld domain

51

5 Evaluation

Table 5.6: Benefit of the communication system on the command understanding.

Experiment Command Mapping Cause(Count) Correct Incorrect

CMU Sphinx,Give (40) 39 1 action

Toolbox domainRelease (30) 28 2 action, objectAttach (30) 26 4 actionOverall (100) 93 7

Google Cloud Speech,Give (40) 40 0 -

Toolbox domainRelease (30) 29 1 actionAttach (30) 30 0 -Overall (100) 99 1

Google Cloud Speech,Transfer (30) 30 0 -

Blockworld domain

Speech in the toolbox domain. In the blockworld domain using Google Cloud Speechled even to 100% correctly mapped commands.

In general causes for incorrect mapping can be found either in recognition errors or onthe human side. Two reasons for the latter are: First, the commanding human makesgrammatical errors, second, the human uses word meanings differently than imple-mented in WordNet. Wrong mappings for the second reason arise, when the measuredword similarity is above the similarity measurement threshold. For all experiments thesimilarity measurement threshold was set to 0.1. This is a trade-off between taking lesssimilar words into account, but rejecting marginal similar words. The dissimilarity valueis 0 in the utilized JCN measurement, cf. section 2.1.3, but words that are below 0.1 arerelated very loosely.

An example for such a wrong mapping occurred in the toolbox domain, when the verb"drop" was mapped as "give" instead of "release". The word similarity measurement JCNgave 0.176 as similarity value between "drop" and "give", while the similarity valuebetween "drop" and "release" was defined as 0.150. To overcome such issues, one optionis to revise the domain lexicon. Another option is taking into account learning newwords by informing the robot about incorrect mappings.

52


Table 5.7: Processing time of semantic parsing per experiment.

Experiment Syntactic parsing Mapping SumToolbox, CMU Sphinx 1.67 2.28 3.95Toolbox, Google Cloud Speech 1.67 2.61 4.28Blockworld, Google Cloud Speech 2.22 2.74 4.96

5.3.3 Processing time of semantic parsing

In this section the processing time of semantic parsing is the aspect under consideration.Two main functions will be evaluated here, the syntactic parsing and the mapping.The mapping includes the filling of the meaning representation and word similaritymeasurements. Table 5.7 lists the processing times per experiment. The length of acommand has a notable impact on the syntactic parsing time. But the impact on themapping phase is very little. The translation from the meaning representation to thecommand skeleton takes, for all commands and in all experiments, on average around10−5 seconds. This dimension is negligible for the processing time.

Using Google Cloud Speech, the delay until the command understanding is four tofive seconds. The whole process from command to execution takes, including speechutterance, around ten seconds in the toolbox domain and around 15 seconds in theblockworld domain. The experiment with CMU Sphinx in the toolbox domain takes,with 18 seconds, nearly twice as long as the experiment in the same domain withGoogle Cloud Speech. The reason are the on average needed 14 seconds for speechrecognition.

53

6 Conclusions

This chapter is two-part. First, the discussion gives insight in the evaluation results andcompares key aspects of this work to other approaches. Concluding the chapter finisheswith a brief summary and limitations section.

6.1 Discussion

This section begins with the discussion of the evaluation results. Subsequently theevaluation results as well as the entire approach of this work are contemplated withrespect to alternatives from the state of the art. Finally insight is given in general topicsand in ideas which were discarded and in new ideas which arose thereof.

6.1.1 Results

The experiments conducted for the evaluation proved the applicability of the designedworkflow for several points. The evaluation aimed on the quality of the commandunderstanding, the benefit of ambiguity resolving, the influence of the speech recognitionand the processing times.

Considering NLU of spoken commands without ambiguity resolving, only - depending onthe domain - 67% respectively 40% of the commands were entirely correct understood.Thereby the necessity of ambiguity resolving was revealed. Adding the questioningsystem for ambiguity resolving enhanced the understanding to 99% respectively 100%of entirely correct commands, cf. table 6.1.

The reason for the need of ambiguity resolving lies in the natural language input andspeech recognition: Perfect input and transcription cannot be guaranteed, thoughsyntactic parsing relies on the correctness of the grammar. Inadvertently ungrammaticalsentences, which occurred frequently in the experiments, led to faulty dependency trees;and these resulted in faulty or incomplete mappings. Enquiries by the robot provided asatisfactory solution for this task.

55

6 Conclusions

Table 6.1: Understanding of commands with and without ambiguity resolving (AR) perexperiment in percent.

Correct understandingExperiment without AR with ARToolbox, CMU Sphinx 39% 93%Toolbox, Google Cloud Speech 67% 99%Blockworld, Google Cloud Speech 40% 100%

The speech recognition implementation turned out to have a major influence on theaccuracy as well as the processing time. One implementation was clearly superior tothe other one tested on the same domain: The correctly understood sentences reachedwith Goolge Cloud Speech are the above named 67%. By contrast the implementationof CMU Sphinx reached 39% as shown in table 6.1. The former implementation alsohas, with less than half as many iterations, a lower necessity for ambiguity resolving.Concerning processing time Google Cloud Speech saves on average nine seconds throughsimultaneous speech recording and transcription.

The general processing time is, with four to five seconds - depending on the length andcomplexity of the command -, within a reasonable limit for human-robot communication.But of course speed improvements are possible as the implementation has not beenoptimized for fast execution yet. Comparisons to other work concerning processingspeed cannot be made, as run time was not reported in any work studied for thisthesis.

[Bas+14] is the only work studied here which uses spoken language in form of audio filesinstead of written commands for evaluation. They report speech recognition as the majorsource of error along their workflow, which is observed here too. The measured wordtranscription error rate using the Google API lies between 17.6% and 37.9% dependingon the dataset. As a consequence the resulting recognition ranges between 21% and80%, depending on the dataset. When written commands were used for evaluation, thereported recognition accuracy was higher: [EB14] reports 86% correctly understoodsentences and [Tel+11b] reports an average recognition accuracy of 85.2% over all fourmapping options (SDCs). Ambiguity resolving was not taken into account in any ofthe above works. Here the number of correctly understood sentences was significantlyimproved by ambiguity resolving. It is remarkable, that the recognition accuracy percommand reaches 99% to 100%, even in consideration of the smaller domain withfewer commands studied here. On that basis the system designed in this work should becompetitive to other studied approaches for human-robot collaboration.

56

6.1 Discussion

6.1.2 Alternatives

Considered as a whole, the semantic parsing approach with the ambiguity resolvingsystem creates a robust system for human-robot collaboration. Nevertheless there arecertain aspects of the designed system, which could have been designed differently.

The NLU capabilities concerning syntactic parsing, as they are implemented now, mightreach their limits, when the commands become more complex. Up to now the results,which are taken from the syntactic parsing result, are not used to their full extend. Themapping is limited to the recognition of actions, objects and locations. Comparablesystems have broader understanding capabilities like [Tel+11b]. They, for example,include path grounding for moving robots. Reaching or even exceeding their under-standing capabilities, would lead to more flexibility concerning possible applications.The work of [ETF16] exceeds the NLU capabilities of [Tel+11b] by mapping conditionalstatements. The described NLU capabilities can be realized using dependency parsingwith universal dependencies as output.

By the application of learning to extract the right mappings from the universal depen-dencies, an improvement concerning the flexibility in terms of natural language inputcould be achieved. Such an approach, on the downside, requires a lot of suitable trainingdata, like in the grammar based approach of [Mat+13]. The current approach of thiswork could be adapted to learned rules at a certain point, while keeping its advan-tage of overcoming domain restriction. The approach has the advantage of separatelyhandling syntactic relations. The desired learning would not affect the built-up of thedependency tree, but only the mapping between the dependency tree and the meaningrepresentation.

Concerning ambiguity resolving, there are two appealing alternatives. First, [Tel+14]usage of inverse semantics for answer generation instead of template based questions.Second, [Mes+15] usage of recurrent neural networks for slot filling instead of a dialogsystem. Slot filling, through assumptions based on the world state, could improve therecognition rate and reduce the necessity of question answer cycles.

6.1.3 Insights

During this work regular expressions, on the basis of POS tagged commands, wereapplied for semantic parsing, cf. [CM14a]. This approach was dismissed and replacedby the better suiting dependency parsing. Regular expressions should nevertheless bekept in mind for applications like the mapping of time specifications.

57

6 Conclusions

Compared to the usage of written commands, the usage of spoken commands adds alayer of uncertainty. When designing a NLU system, an important point to consider isthe speech recognition and error propagation into the semantic parsing. The semanticparsing implementation needs to be robust against speech recognition errors.

Generally speaking there are a lot of open questions to solve, until human-robot col-laboration works on a similar level as human-human collaboration. Giving a robot thehuman ability to use natural language is of course a key point. But, inter alia, verbal andnon-verbal communication, knowledge handling, reasoning and learning capabilitiesneeds to be enhanced to approach this state.

6.2 Summary and limitations

The goal of this thesis was to design a workflow for a NLU system, which is supportedby a questioning system for ambiguity resolving. The approach should suite the needsof human-robot collaboration in terms of communication facilities and processing time.The constructed model confirmed its suitability and applicability during implementationand experimental evaluation. Overall the approach achieved 99% respectively 100% ofcorrectly understood natural language commands in the two sample domains using themore accurate and faster speech processing implementation.

Dependency parsing, with universal dependencies as output, proved to be a promisingapproach by the reached recognition accuracy. Nevertheless this approach has limita-tions concerning its natural language understanding and communication capabilities.The approach could be enhanced by extending the recognition: The mapping of condi-tions, path descriptions for moving robots and time specifications would increase theapplication possibilities substantially. Currently, slot filling is solely done by questionanswering. The extension of slot filling by drawing conclusions on basis of the worldstate would enhance the usability through fewer question answering cycles. A universallyapplicable natural language understanding and communication system for human-robotcollaboration is still a vision for the future. But the design proposed in this work is,despite its limitation, already capable of satisfying essential aspects.

58

6.2 Summary and limitations

59

Bibliography

[ALZ15] Y. Artzi, K. Lee, L. Zettlemoyer. “Broad-coverage ccg semantic parsing withamr.” In: Proceedings of the 2015 Conference on Empirical Methods in NaturalLanguage Processing. 2015, pp. 1699–1710 (cit. on p. 27).

[AZ13] Y. Artzi, L. Zettlemoyer. “Weakly supervised learning of semantic parsersfor mapping instructions to actions.” In: Transactions of the Association forComputational Linguistics 1 (2013), pp. 49–62 (cit. on p. 27).

[Bas+14] E. Bastianelli, G. Castellucci, D. Croce, R. Basili, D. Nardi. “Effective andRobust Natural Language Understanding for Human-Robot Interaction.” In:ECAI. 2014, pp. 57–62 (cit. on pp. 27, 41, 56).

[BH06] A. Budanitsky, G. Hirst. “Evaluating wordnet-based measures of lexicalsemantic relatedness.” In: Computational Linguistics 32.1 (2006), pp. 13–47(cit. on p. 35).

[BWB13] A. Ballatore, D. C. Wilson, M. Bertolotto. “Computing the semantic similarityof geographic terms using volunteered lexical definitions.” In: InternationalJournal of Geographical Information Science 27.10 (2013), pp. 2099–2118(cit. on p. 35).

[CM14a] A. X. Chang, C. D. Manning. TokensRegex: Defining cascaded regular ex-pressions over tokens. Tech. rep. CSTR 2014-02. Department of ComputerScience, Stanford University, 2014 (cit. on p. 57).

[CM14b] D. Chen, C. D. Manning. “A Fast and Accurate Dependency Parser usingNeural Networks.” In: EMNLP. 2014, pp. 740–750 (cit. on p. 17).

[Dei+13] R. Deits, S. Tellex, P. Thaker, D. Simeonov, T. Kollar, N. Roy. “Clarifyingcommands with information-theoretic human-robot dialog.” In: Journal ofHuman-Robot Interaction 2.2 (2013), pp. 58–79 (cit. on pp. 28, 29, 36, 37).

[EB14] K. Evang, J. Bos. “RoBox: CCG with Structured Perceptron for SupervisedSemantic Parsing of Robotic Spatial Commands.” In: SemEval 2014 (2014),p. 482 (cit. on pp. 27, 56).

61

Bibliography

[ETF16] M. Eppe, S. Trott, J. Feldman. “Exploiting deep semantics and composi-tionality of natural language for Human-Robot-Interaction.” In: IntelligentRobots and Systems (IROS), 2016 IEEE/RSJ International Conference on.IEEE. 2016, pp. 731–738 (cit. on pp. 26, 57).

[Fel06] C. Fellbaum. “WordNet(s).” In: K. Brown, A. H. Anderson, L. Bauer,M. S. Berns, J. E. Miller, G. Hirst. Encyclopedia of language & linguistics.Second Edition. Vol. 13. Oxford: Elsevier, 2006, pp. 665–670 (cit. on p. 20).

[Fer+10] D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. A. Kalyanpur,A. Lally, J. W. Murdock, E. Nyberg, J. Prager, et al. “Building Watson: Anoverview of the DeepQA project.” In: AI magazine 31.3 (2010), pp. 59–79(cit. on p. 13).

[Fil76] C. J. Fillmore. “Frame semantics and the nature of language.” In: Annals ofthe New York Academy of Sciences 280.1 (1976), pp. 20–32 (cit. on p. 27).

[Fri12] M. Frické. Logic and the Organization of Information. Springer New York,2012. ISBN: 9781461430889 (cit. on p. 20).

[GR14] D. Goldwasser, D. Roth. “Learning from natural instructions.” In: MachineLearning 94.2 (2014), pp. 205–232. DOI: 10.1007/s10994-013-5407-y(cit. on p. 26).

[Hel06] H. Helbig. Knowledge Representation and the Semantics of Natural Language.Cognitive Technologies. Springer-Verlag Berlin Heidelberg, 2006. ISBN:978-3-540-29966-0. DOI: 10.1007/3-540-29966-1 (cit. on p. 25).

[JC97] J. J. Jiang, D. W. Conrath. “Semantic similarity based on corpus statistics andlexical taxonomy.” In: Proceedings on International Conference on Researchin Computational Linguistics (ROCLING X). Taiwan, 1997, pp. 19–33 (cit. onpp. 20, 34, 42).

[Koe15] P. Koehn. Lecture notes on Natural Language Processing. 2015 (cit. on pp. 16,17).

[Kol+10] T. Kollar, S. Tellex, D. Roy, N. Roy. “Toward understanding natural languagedirections.” In: 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE. 2010, pp. 259–266 (cit. on p. 32).

[Kol+13] T. Kollar, S. Tellex, M. R. Walter, A. Huang, A. Bachrach, S. Hemachandra,E. Brunskill, A. Banerjee, D. Roy, S. Teller, et al. “Generalized groundinggraphs: A probabilistic framework for understanding grounded language.”In: (2013) (cit. on pp. 14, 33).

[KTF15] H. Khayrallah, S. Trott, J. Feldman. “Natural Language For Human RobotInteraction.” In: Proceedings of the Workshop on Human-Robot Teaming atthe 10th ACM/IEEE International Conference on Human-Robot Interaction,Portland, Oregon. 2015 (cit. on pp. 26, 27).

62

http://dx.doi.org/10.1007/s10994-013-5407-y

http://dx.doi.org/10.1007/3-540-29966-1

Bibliography

[Lis15] P. Lison. “A hybrid approach to dialogue management based on probabilisticrules.” In: Computer Speech & Language 34.1 (2015), pp. 232–255 (cit. onp. 28).

[Mat+13] C. Matuszek, E. Herbst, L. Zettlemoyer, D. Fox. “Learning to parse naturallanguage commands to a robot control system.” In: Experimental Robotics.Springer. 2013, pp. 403–415 (cit. on pp. 25, 27, 57).

[Mes+15] G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. Hakkani-Tur, X. He,L. Heck, G. Tur, D. Yu, et al. “Using recurrent neural networks for slot fillingin spoken language understanding.” In: IEEE/ACM Transactions on Audio,Speech and Language Processing (TASLP) 23.3 (2015), pp. 530–539 (cit. onp. 57).

[MF07] G. A. Miller, C. Fellbaum. “WordNet then and now.” In: Language Resourcesand Evaluation 41.2 (2007), pp. 209–214 (cit. on p. 33).

[MHG13] L. Meng, R. Huang, J. Gu. “A review of semantic similarity measures inwordnet.” In: International Journal of Hybrid Information Technology 6.1(2013), pp. 1–12 (cit. on p. 20).

[Niv+16] J. Nivre, M.-C. de Marneffe, F. Ginter, Y. Goldberg, J. Hajic, C. D. Manning,R. McDonald, S. Petrov, S. Pyysalo, N. Silveira, et al. “Universal depen-dencies v1: A multilingual treebank collection.” In: Proceedings of the 10thInternational Conference on Language Resources and Evaluation (LREC 2016).2016, pp. 1659–1666 (cit. on p. 18).

[Par13] P. Parente. pyttsx - Text-to-speech x-platform. 2013. URL: https://pyttsx.readthedocs.io/en/latest/ (cit. on p. 41).

[Per+15] A. Perzylo, S. Griffiths, R. Lafrenz, A. Knoll. “Generating grammars for natu-ral language understanding from knowledge about actions and objects.” In:2015 IEEE International Conference on Robotics and Biomimetics (ROBIO).IEEE. 2015, pp. 2008–2013 (cit. on p. 27).

[Pla17] G. C. Platform. Google Cloud Speech API. 2017. URL: https://cloud.google.com/speech/ (cit. on p. 41).

[Red+16] S. Reddy, O. Täckström, M. Collins, T. Kwiatkowski, D. Das, M. Steedman,M. Lapata. “Transforming Dependency Structures to Logical Forms forSemantic Parsing.” In: Transactions of the Association for ComputationalLinguistics 4 (2016), pp. 127–140 (cit. on p. 27).

[RLS14] S. Reddy, M. Lapata, M. Steedman. “Large-scale semantic parsing withoutquestion-answer pairs.” In: Transactions of the Association for ComputationalLinguistics 2 (2014), pp. 377–392 (cit. on p. 27).

63

https://pyttsx.readthedocs.io/en/latest/

https://pyttsx.readthedocs.io/en/latest/

https://cloud.google.com/speech/

https://cloud.google.com/speech/

[Tel+11a] S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, A. G. Banerjee, S. Teller,N. Roy. “Approaching the symbol grounding problem with probabilisticgraphical models.” In: AI magazine 32.4 (2011), pp. 64–76 (cit. on pp. 28,29, 32).

[Tel+11b] S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, A. G. Banerjee, S. Teller,N. Roy. “Interpreting Robotic Mobile Manipulation Commands Expressedin Natural Language.” In: ICRA 2011 Workshop: Manipulation Under Uncer-tainty. 2011 (cit. on pp. 56, 57).

[Tel+13] S. Tellex, R. A. Knepper, A. Li, T. M. Howard, D. Rus, N. Roy. “Assemblingfurniture by asking for help from a human partner.” In: 2013 (cit. on p. 28).

[Tel+14] S. Tellex, R. Knepper, A. Li, D. Rus, N. Roy. “Asking for help using inversesemantics.” In: Proceedings of Robotics: Science and Systems, Berkeley, USA(2014) (cit. on pp. 28, 57).

[Tho+15] J. Thomason, S. Zhang, R. J. Mooney, P. Stone. “Learning to InterpretNatural Language Commands through Human-Robot Dialog.” In: IJCAI.2015, pp. 1923–1929 (cit. on pp. 28, 29, 37).

[Tou+16] M. Toussaint, T. Munzer, Y. Mollard, L. Y. Wu, N. A. Vien, M. Lopes. “Rela-tional activity processes for modeling concurrent cooperation.” In: Roboticsand Automation (ICRA), 2016 IEEE International Conference on. IEEE. 2016,pp. 5505–5511 (cit. on pp. 22, 42).

[Yam13] R. V. Yampolskiy. “Turing test as a defining feature of AI-completeness.” In:Artificial intelligence, evolutionary computing and metaheuristics. Springer,2013, pp. 3–17 (cit. on p. 13).

[Zha17] A. Zhang. SpeechRecognition 3.6.0. 2017. URL: https://pypi.python.org/pypi/SpeechRecognition/ (cit. on p. 41).

[Car16] Carnegie Mellon University. CMU Sphinx. 2016. URL: http://cmusphinx.sourceforge.net/ (cit. on p. 41).

[NLT15] NLTK Project. Natural Language Toolkit. 2015. URL: http://www.nltk.org/(cit. on p. 41).

[Pri10] Princeton University. About WordNet. 2010. URL: http://wordnet.princeton.edu (cit. on pp. 19–21, 34, 41).

[Sta17] Stanford NLP Group. Stanford CoreNLP – a suite of core NLP tools. 2017.URL: http://stanfordnlp.github.io/CoreNLP/ (cit. on p. 41).

All links were last followed on March 14, 2017.

https://pypi.python.org/pypi/SpeechRecognition/

https://pypi.python.org/pypi/SpeechRecognition/

http://cmusphinx.sourceforge.net/

http://cmusphinx.sourceforge.net/

http://www.nltk.org/

http://wordnet.princeton.edu

http://wordnet.princeton.edu

http://stanfordnlp.github.io/CoreNLP/

Declaration

I hereby declare that the work presented in this thesis isentirely my own and that I did not use any other sourcesand references than the listed ones. I have marked alldirect or indirect statements from other sources con-tained therein as quotations. Neither this work norsignificant parts of it were part of another examinationprocedure. I have not published this work in whole orin part before. The electronic copy is consistent with allsubmitted copies.

place, date, signature

Natural language understanding and communication for … · Abstract Natural language understanding...

Documents

Transcript of Natural language understanding and communication for … · Abstract Natural language understanding...