arXiv:2005.02233v2 [cs.CL] 11 Dec 2020

20
A Survey on Dialog Management: Recent Advances and Challenges Yinpei Dai , Huihua Yu , Yixuan Jiang , Chengguang Tang , Yongbin Li , Jian Sun , Alibaba Group, Beijing Cornell University {yinpei.dyp,chengguang.tcg,shuide.lyb,jian.sun}@alibaba-inc.com Abstract Dialog management (DM) is a crucial component in a task-oriented dialog sys- tem. Given the dialog history, DM pre- dicts the dialog state and decides the next action that the dialog agent should take. Recently, dialog policy learning has been widely formulated as a Reinforcement Learning (RL) problem, and more works are focus on the applicability of DM. In this paper, we survey recent advances and challenges within three critical topics for DM: (1) improving model scalability to fa- cilitate dialog system modeling in new sce- narios, (2) dealing with the data scarcity problem for dialog policy learning, and (3) enhancing the training efficiency to achieve better task-completion performance . We believe that this survey 1 can shed a light on future research in dialog management. 1 Introduction Many efforts have been made to develop highly intelligent human-machine dialog sys- tems since research began on artificial intel- ligence (AI). Alan Turing proposed the Tur- ing test in 1950(Turing, 2009). He believed that machines could be considered highly in- telligent if they passed the Turing test. To pass this test, the machine had to commu- nicate with a real person so that this per- son believed they were talking to another per- son. The first-generation dialog systems were mainly rule-based. For example, the ELIZA system(Weizenbaum, 1966) developed by MIT in 1966 was a psychological medical chatbot that matched methods using templates. The flowchart-based dialog system popular in the 1 This survey is translated by Al- ibaba Cloud International Team based on https://developer.aliyun.com/article/741450. 1970s simulates state transition in the dialog flow based on the finite state automaton (FSA) model. These machines have transparent in- ternal logic and are easy to analyze and de- bug. However, they are less flexible and scal- able due to their high dependency on expert intervention. Second-generation dialog systems driven by statistical data (hereinafter referred to as the statistical dialog systems) emerged with the rise of big data technology. At that time, re- inforcement learning was widely studied and applied in dialog systems. A representative ex- ample is the statistical dialog system based on the Partially Observable Markov Decision Pro- cess (POMDP) proposed by Professor Steve Young of Cambridge University in 2005(Young et al., 2013). This system is significantly supe- rior to rule-based dialog systems in terms of ro- bustness. It maintains the state of each round of dialog through Bayesian inference based on speech recognition results and then selects a dialog policy based on the dialog state to gen- erate a natural language response. With a re- inforcement learning framework, the POMDP- based dialog system constantly interacts with user simulators or real users to detect errors and optimize the dialog policy accordingly. A statistical dialog system is a modular system not highly dependent on expert intervention. However, it is less scalable, and the model is difficult to maintain. In recent years, with breakthroughs in deep learning in the image, voice, and text fields, third-generation dialog systems built around deep learning have emerged. These sys- tems still adopt the framework of the sta- tistical dialog systems, but apply a neural network model in each module. Neural net- work models have powerful representation and arXiv:2005.02233v3 [cs.CL] 25 Oct 2021

Transcript of arXiv:2005.02233v2 [cs.CL] 11 Dec 2020

A Survey on Dialog Management: Recent Advances andChallenges

Yinpei Dai†, Huihua Yu‡, Yixuan Jiang‡, Chengguang Tang†, Yongbin Li†, Jian Sun†,†Alibaba Group, Beijing‡Cornell University

{yinpei.dyp,chengguang.tcg,shuide.lyb,jian.sun}@alibaba-inc.com

Abstract

Dialog management (DM) is a crucialcomponent in a task-oriented dialog sys-tem. Given the dialog history, DM pre-dicts the dialog state and decides thenext action that the dialog agent shouldtake. Recently, dialog policy learning hasbeen widely formulated as a ReinforcementLearning (RL) problem, and more worksare focus on the applicability of DM. Inthis paper, we survey recent advances andchallenges within three critical topics forDM: (1) improving model scalability to fa-cilitate dialog system modeling in new sce-narios, (2) dealing with the data scarcityproblem for dialog policy learning, and (3)enhancing the training efficiency to achievebetter task-completion performance . Webelieve that this survey1 can shed a lighton future research in dialog management.

1 Introduction

Many efforts have been made to develophighly intelligent human-machine dialog sys-tems since research began on artificial intel-ligence (AI). Alan Turing proposed the Tur-ing test in 1950(Turing, 2009). He believedthat machines could be considered highly in-telligent if they passed the Turing test. Topass this test, the machine had to commu-nicate with a real person so that this per-son believed they were talking to another per-son. The first-generation dialog systems weremainly rule-based. For example, the ELIZAsystem(Weizenbaum, 1966) developed by MITin 1966 was a psychological medical chatbotthat matched methods using templates. Theflowchart-based dialog system popular in the

1This survey is translated by Al-ibaba Cloud International Team based onhttps://developer.aliyun.com/article/741450.

1970s simulates state transition in the dialogflow based on the finite state automaton (FSA)model. These machines have transparent in-ternal logic and are easy to analyze and de-bug. However, they are less flexible and scal-able due to their high dependency on expertintervention.

Second-generation dialog systems driven bystatistical data (hereinafter referred to as thestatistical dialog systems) emerged with therise of big data technology. At that time, re-inforcement learning was widely studied andapplied in dialog systems. A representative ex-ample is the statistical dialog system based onthe Partially Observable Markov Decision Pro-cess (POMDP) proposed by Professor SteveYoung of Cambridge University in 2005(Younget al., 2013). This system is significantly supe-rior to rule-based dialog systems in terms of ro-bustness. It maintains the state of each roundof dialog through Bayesian inference based onspeech recognition results and then selects adialog policy based on the dialog state to gen-erate a natural language response. With a re-inforcement learning framework, the POMDP-based dialog system constantly interacts withuser simulators or real users to detect errorsand optimize the dialog policy accordingly. Astatistical dialog system is a modular systemnot highly dependent on expert intervention.However, it is less scalable, and the model isdifficult to maintain.

In recent years, with breakthroughs in deeplearning in the image, voice, and text fields,third-generation dialog systems built arounddeep learning have emerged. These sys-tems still adopt the framework of the sta-tistical dialog systems, but apply a neuralnetwork model in each module. Neural net-work models have powerful representation and

arX

iv:2

005.

0223

3v3

[cs

.CL

] 2

5 O

ct 2

021

language classification and generation capa-bilities. Therefore, models based on natu-ral language are transformed from generativemodels, such as Bayesian networks, into deepdiscriminative models, such as ConvolutionalNeural Networks (CNNs), Deep Neural Net-works (DNNs), and Recurrent Neural Net-works (RNNs)(Wen et al., 2016). The di-alog state is obtained by directly calculat-ing the maximum conditional probability in-stead of the Bayesian a posteriori probabil-ity. The deep reinforcement learning modelis also used to optimize the dialog policy(Suet al., 2017). In addition, the success of end-to-end sequence-to-sequence technology in ma-chine translation makes end-to-end dialog sys-tems possible. Facebook researchers proposeda task-oriented dialog system based on mem-ory networks(Bordes et al., 2016), present-ing a new way forward in the research ofthe end-to-end task-oriented dialog systemsin third-generation dialog systems. In gen-eral, third-generation dialog systems are bet-ter than second-generation dialog systems, buta large amount of tagged data is required foreffective training. Therefore, improving thecross-domain migration and scalability of themodel has become an important area of re-search.

Common dialog systems are divided into thefollowing three types: chatting systems, task-oriented dialog systems, and QA systems. In achatting systems, the system generates inter-esting and informative natural responses to al-low human-machine dialog to proceed(Serbanet al., 2017).

In a QA systems, the system analyzes eachquestion and finds a correct answer from itscandidate set(Berant et al., 2013). A task-oriented dialog (hereinafter referred to as atask dialog) is a task-driven multi-round di-alog. The machine determines the user’srequirements through understanding, activeinquiry, and clarification, makes queries bycalling an Application Programming Interface(API), and returns the correct results. Gen-erally, a task dialog is a sequence decision-making process. During the dialog, the ma-chine updates and maintains the internal di-alog state by understanding user statementsand then selects the optimal action based on

the current dialog state, such as determin-ing the requirement, querying restrictions, andproviding results.

Task-oriented dialog systems are divided byarchitecture into two categories. One type isa pipeline system that has a modular struc-ture(Wen et al., 2016), as shown in Figure 1.It consists of four key modules:

• Natural Language Understanding (NLU):It identifies and parses a user’s text in-put to obtain semantic tags that can beunderstood by computers, such as slot-values and intentions.

• Dialog State Tracking (DST): It main-tains the current dialog state based onthe dialog history. The dialog state is thecumulative meaning of the dialog history,which is generally expressed as slot-valuepairs.

• Dialog Policy: It outputs the next systemaction based on the current dialog state.The DST module and the dialog policymodule are collectively referred to as thedialog manager (DM).

• Natural Language Generation (NLG): Itconverts system actions to natural lan-guage output.

This modular system structure is highly in-terpretable, easy to implement, and applied inmost practical task-oriented dialog systems inthe industry. However, this structure is notflexible enough. The modules are independentof each other and difficult to optimize together.This makes it difficult to adapt to changing ap-plication scenarios. Additionally, due to theaccumulation of errors between modules, theupgrade of a single module may require theadjustment of the whole system.

Another implementation of a task-orienteddialog system is an end-to-end system, whichhas been a popular field of academic researchin recent years(Dhingra et al., 2016; Lei et al.,2018; Madotto et al., 2018) (Figure 2). Thistype of structure trains an overall mappingrelationship from the natural language inputon the user side to the natural language out-put on the machine side. It is highly flexi-ble and scalable, reducing labor costs for de-

Figure 1: Modular structure of a task-oriented dialog system(Gao et al., 2018).

sign and removing the isolation between mod-ules. However, the end-to-end model placeshigh requirements on the quantity and qualityof data and does not provide clear modelingfor processes such as slot filling and API call-ing. This model is still being explored and is asyet rarely applied in the industry. With higherrequirements on product experience, actual di-alog scenarios become more complex, and DMneeds to be further improved. Traditional DMis usually built in a clear dialog script sys-tem (searching for matching answers, query-ing the user intent, and then ending the dia-log) with pre-defined system action space, userintent space, and dialog body. However, dueto unpredictable user behaviors, traditional di-alog systems are less responsive and have agreater difficulty dealing with undefined sit-uations. In addition, many actual scenariosrequire cold start without sufficient tagged di-alog data, resulting in high data cleansing andtagging costs. DM based on deep reinforce-ment learning requires a large amount of datafor model training. According to the experi-ments in many academic papers, hundreds ofcomplete sessions are required to train a dialogmodel, which hinders the rapid developmentand iteration of dialog systems.

To solve the limitations of traditional DM,researchers in academic and industry circleshave begun to focus on how to strengthen theusability of DM. Specifically, they are workingto address the following shortcomings in DM:

• Poor scalability

• Insufficient tagged data

• Low training efficiency

We will introduce the latest research resultsin terms of the preceding aspects.

2 Cutting-Edge Research on DialogManagement

2.1 Shortcoming 1: Poor Scalability

As mentioned above, DM consists of the DSTand dialog policy modules. The most rep-resentative traditional DST is the neural be-lief tracker (NBT) proposed by scholars fromCambridge University in 2017(Mrksic et al.,2016). NBT uses neural networks to trackthe state of complex dialogs in a single do-main. By using representation learning, NBTencodes system actions in the previous round,user statements in the current round, and can-didate slot-value pairs to calculate semanticsimilarity in a high dimensional space and de-tect the slot value output by the user in thecurrent round. Therefore, NBT can identifyslot values that are not in the training setbut semantically similar to those in the set byusing the word vector expression of the slot-value pair. This avoids the need to create asemantic dictionary. As such, the slot valuescan be extended. Later, Cambridge scholarsfurther improved NBT(Ramadan et al., 2018;Weisz et al., 2018) by changing the input slot-value pair to the domain-slot-value triple. Therecognition results of each round are accumu-lated using model learning instead of man-ual rules. All data is trained by the samemodel. Knowledge is shared among differentdomains, leaving the total number of parame-ters unchanged as the number of domains in-creases. Among traditional dialog policy re-search, the most representative is the ACER-based policy optimization proposed by Cam-

Figure 2: End-to-end structure of a task-oriented dialog system(Gao et al., 2018)

bridge scholars(Su et al., 2017; Weisz et al.,2018).

By applying the experience replay tech-nique, the authors tried both the trust re-gion actor-critic model and the episodic natu-ral actor-critic model. The results proved thatthe deep AC-based reinforcement learning al-gorithms were the best in sample utilizationefficiency, algorithm convergence, and dialogsuccess rate.

However, traditional DM still needs to beimproved in terms of scalability, specifically inthe following three respects:

1. How to deal with changing user intents.

2. How to deal with changing slots and val-ues.

3. How to deal with changing system ac-tions.

2.1.1 Changing User Intents

If a system does not take the user intent intoaccount, it will often provide nonsensical an-swers. As shown in Figure 3, the user’s “con-firm” intent is not considered. A new dialogscript must be added to help the system dealwith this problem.

The traditional model outputs a fixed one-hot vector of the old intent category. Once anew user intent not in the training set appears,vectors need to be changed to include the newintent category, and the new model needs tobe retrained. This makes the model less main-tainable and scalable. Wang et al. (2018) pro-poses a teacher-student learning framework tosolve this problem. In the teacher-studenttraining architecture, the old model and logi-cal rules for new user intents are used as theteacher, and the new model as a student. This

Figure 3: Example of a dialog with new in-tent(Wang et al., 2018)

architecture uses knowledge distillation tech-nology. Specifically, for the old intent set, theprobability output of the old model directlyguides the training of the new model. For thenew intent, the logical rules are used as newtagged data to train the new model. In thisway, the new model no longer needs to interactwith the environment for re-training. The pa-per presented the results of an experiment per-formed on the DSTC2 dataset. The confirmintent is deliberately removed and then addedas a new intent to the dialog body to verifywhether the new model is adaptable. Figure 4shows the experiment result. The new model(Extended System), the model containing allintents (Contrast System), and the old modelare compared. The result shows that the newmodel achieves satisfactory success rates in ex-tended new intent identification at differentnoise levels.

Of course, systems with this architectureneed to be further trained. CDSSM, a se-mantic similarity matching model proposedin (Chen et al., 2016), can identify extendeduser intents without tagged data and modelre-training. Based on the natural descriptionof user intents in the training set, CDSSM di-

Figure 4: Comparison of various models at differ-ent noise levels

rectly learns an intent embedding encoder andembeds the description of any intent into ahigh dimensional semantic space. In this way,the model directly generates corresponding in-tent embedding based on the natural descrip-tion of the new intent and then identifies theintent. Many models that improve scalabil-ity mentioned below are designed with similarideas. Tags are moved from the output end ofthe model to the input end, and neural net-works are used to perform semantic encodingon tags (tag names or natural descriptions ofthe tags) to obtain certain semantic vectorsand then match their semantic similarity.

The work of (Rajendran et al., 2019) pro-vides another idea. Through man-machinecollaboration, manual customer services areused to deal with user intents not in the train-ing set after the system is launched. Thismodel uses an additional neural parser to de-termine whether manual customer service isrequired based on the dialog state vector ex-tracted from the current model. If it is, themodel distributes the current dialog to onlinecustomer service. If not, the model makes aprediction. The parser obtained through datalearning can determine whether the current di-alog contains a new intent, and responses fromcustomer service are regarded as correct by de-fault. This man-machine collaboration mech-anism effectively deals with user intents notfound in the training set during online testingand significantly improves the accuracy of thedialog.

2.1.2 Changing Slots and Slot Values

In dialog state tracking involving multiple orcomplex domains, dealing with changing slots

Figure 5: Concept tagger structure

and slot values has always been a challenge.Some slots have non-enumerative slot values,for example, the time, location, and username. Their slot value sets, such as flights ormovie theater schedules, change dynamically.In traditional DST, the slot and slot value setremain unchanged by default, which greatlyreduces the system scalability.

Google researchers(Rastogi et al., 2017) pro-posed a candidate set for slots with non-enumerative slot values. A candidate set ismaintained for each slot. The candidate setcontains a maximum of k possible slot valuesin the dialog and assigns a score to each slotvalue to indicate the user’s preference for theslot value in the current dialog. The systemuses a two-way RNN model to find the valueof a slot in the current user statement and thenscore and re-rank it with existing slot valuesin the candidate set. In this way, the DSTof each round only needs to make a judgmenton a limited slot value set, allowing us to tracknon-enumerative slot values. To track slot val-ues not in the set, we can use a sequence tag-ging model(Mesnil et al., 2013) or a semanticsimilarity matching model such as the neuralbelief tracker(Mrksic et al., 2016). Some work(Dai et al., 2018; Weld et al., 2021) also usesCRFs for open-ontology slot-filling.

The preceding are solutions for non-fixedslot values, but what about changing slots inthe dialog body? In the work of (Bapna et al.,2017), a slot description encoder is used to en-code the natural language description of ex-isting and new slots. The obtained semanticvectors representing the slot are sent with userstatements as inputs to the Bi-LSTM model,and the identified slot values are output as se-quence tags, as shown in Figure 5. The papermakes an acceptable assumption that the nat-

ural language description of any slot is easyto obtain. Therefore, a concept tagger appli-cable to multiple domains is designed, and theslot description encoder is simply implementedby the sum of simple word vectors. Experi-ments show that this model can quickly adaptto new slots. Compared with the traditionalmethod, this method greatly improves scala-bility. Futher work (Liu et al., 2020; Shahet al., 2019) have been proposed based on theconcept tagger for zero-shot cross-domain slot-filling.

With the development of sequence-to-sequence technology in recent years, many re-searchers are looking at ways to use the end-to-end neural network model to generate theDST results as a sequence. Common tech-niques such as attention mechanisms and copymechanisms are used to improve the genera-tion effect. In the famous MultiWOZ datasetfor multi-domain dialogs, the team led by Pro-fessor Pascale Fung from Hong Kong Univer-sity of Science and Technology used the copynetwork to significantly improve the recog-nition accuracy of non-enumerative slot val-ues(Wu et al., 2019a). Figure 6 shows theTRADE model proposed by the team. Eachtime the slot value is detected, the model per-forms semantic encoding for different combina-tions of domains and slots and uses the resultas the initial position input of the RNN de-coder. The decoder directly generates the slotvalue through the copy network. In this way,both non-enumerative slot values and chang-ing slot values can be generated by the samemodel. Therefore, slot values can be sharedbetween domains, allowing the model to bewidely used. Built upon the recent power-ful transformers, many work (Mehri and Es-kenazi, 2021; Lee et al., 2021) propose to useGPT for slot-filling generation.

Recent research tends to view multi-domainDST as a machine reading and understand-ing task and transform generative models suchas TRADE into discriminative models(Zhouand Small, 2019; Zhang et al., 2020; Yuet al., 2021). Non-enumerative slot valuesare tracked by a machine reading and under-standing task like SQuAD(Rajpurkar et al.,2018), in which the text span in the dia-log history and questions is used as the slot

value. Enumerative slot values are trackedby a multi-choice machine reading and un-derstanding task, in which the correct valueis selected from the candidate values as thepredicted slot value. By combining deep con-text words such as ELMO and BERT, thesenew models obtain the optimal results fromthe MultiWOZ dataset. The work (Hender-son and Vulic, 2020) from PolyAI also leveragepre-trained slot-filling for few-shot adaptation.

2.1.3 Changing System Actions

The last factor affecting scalability is the diffi-culty of pre-defining the system action space.As shown in Figure 7, when designing an elec-tronic product recommendation system, youmay ignore questions like how to upgrade theproduct operating system, but you cannotstop users from asking questions the systemcannot answer. If the system action spaceis pre-defined, irrelevant answers may be pro-vided to questions that have not been defined,greatly compromising the user experience.

In this case, we need to design a dialog pol-icy network that helps the system quickly ex-pand its actions. The first attempt to do thiswas made by Microsoft(He et al., 2015), whomodifies the classic DQN structure to enablereinforcement learning in an unrestricted ac-tion space. The dialog task in this paper is atext game mission task. Each round of actionis a single sentence, with an uncertain numberof actions. The story varies with the action.The author proposed a new model, Deep Rein-forcement Relevance Network (DRRN), whichmatches the current dialog state with optionalsystem actions by semantic similarity match-ing to obtain the Q function. Specifically,in a round of dialog, each action text of anuncertain length is encoded by a neural net-work to obtain a system action vector with afixed length. The story background text is en-coded by another neural network to obtain adialog state vector with a fixed length. Thetwo vectors are used to generate the final Qvalue through an interactive function, such asdot product. Figure 8 shows the structureof the model designed in the paper. Exper-iments show that DRRN outperforms tradi-tional DQN (using the padding technique) inthe text games “Saving John” and “Machineof Death”.

Figure 6: TRADE model framework

Figure 7: Example of a dialog where the dialog sys-tem encounters an undefined system action(Wanget al., 2019)

In another paper(Wang et al., 2019), theauthor wanted to solve this problem fromthe perspective of the entire dialogue systemand proposed the Incremental Dialogue Sys-tem (IDS), as shown in Figure 9. IDS first en-codes the dialog history to obtain the contextvector through the Dialog Embedding mod-ule and then uses a VAE-based UncertaintyEstimation module to evaluate, based on thecontext vector, the confidence level used toindicate whether the current system can givecorrect answers. Similar to active learning, ifthe confidence level is higher than the thresh-old, DM scores all available actions and thenpredicts the probability distribution based onthe softmax function. If the confidence levelis lower than the threshold, the tagger is re-quested to tag the response of the currentround (select the correct response or createa new response). The new data obtained inthis way is added to the data pool to updatethe model online. With this human-teachingmethod, IDS not only supports learning in anunrestricted action space, but also quickly col-

lects high-quality data, which is quite suitablefor actual production.

2.2 Shortcoming 2: InsufficientTagged Data

The extensive application of dialog systemsresults in diversified data requirements. Totrain a task-oriented dialog system, as muchdomain-specific data as possible is needed, butquality tagged data is costly. Scholars havetried to solve this problem in several ways: (1)using machines to tag data to reduce the tag-ging costs; (2) mining the dialog structure touse non-tagged data efficiently; (3) optimizingthe data collection policy to efficiently obtainhigh-quality data; (4) using meta-learning ap-proaches (Qian and Yu, 2019; Dai et al., 2020;Xu et al., 2020; Dingliwal et al., 2021); and (5)using pre-training dialog models (Hosseini-Aslet al., 2020; Peng et al., 2020; Wu et al., 2020;Dai et al., 2021). In this survey we mainlydiscuss the first three kinds, and leave the lasttwo topics in the future.

2.2.1 Automatic Tagging

To address the cost and inefficiency of man-ual tagging, scholars hope to use supervisedlearning and unsupervised learning to allowmachines to assist in manual tagging. Shiet al. (2018) proposed the auto-dielabel ar-chitecture, which automatically groups intentsand slots in the dialog data by using the unsu-pervised learning method of hierarchical clus-tering to automatically tag the dialog data

Figure 8: DRRN model, in which round t has two candidate actions, and round t+1 has three candidateactions

Figure 9: The Overall framework of IDS

(the specific tag of the category needs to bemanually determined). This method is basedon the assumption that expressions of thesame intent may share similar background fea-tures. Initial features extracted by the modelinclude word vectors, part-of-speech (POS)tags, noun word clusters, and Latent Dirich-let allocation (LDA). All features are encodedby the auto-encoder into vectors of the samedimension and spliced. Then, the inter-classdistance calculated by the radial bias function(RBF) is used for dynamic hierarchical cluster-ing. Classes that are closest to each other aremerged automatically until the inter-class dis-tance between the classes is greater than thethreshold. Figure 10 shows the model frame-work.

In another paper(Haponchyk et al., 2018),supervised clustering is used to implement ma-chine tagging. The author views each dialogdata record as a graph node and sees the clus-tering process as the process of identifying theminimum spanning forest. The model uses asupport vector machine (SVM) to train thedistance scoring model between nodes in the

QA dataset through supervised learning. Itthen uses the structured model and the mini-mum subtree spanning algorithm to derive theclass information corresponding to the dialogdata as the hidden variable. It generates thebest cluster structure to represent the user in-tent type.

2.2.2 Dialog Structure Mining

Due to the lack of high-quality tagged data fortraining dialog systems, finding ways to fullymine implicit dialog structures or informationin the untagged dialog data has become a pop-ular area of research. Implicit dialog struc-tures or information contribute to the design ofdialog policies and the training of dialog mod-els to some extent.

One paper(Shi et al., 2019) proposed to useunsupervised learning in a variational RNN(VRNN) to automatically learn hidden struc-tures in dialog data. The author provides twomodels that can obtain the dynamic informa-tion in a dialog: Discrete-VRNN (D-VRNN)and Direct-Discrete-VRNN (DD-VRNN). Asshown in Figure 11, xt indicates the t−thround of dialog, ht indicates the hidden vari-able of the dialog history, and zt indicates thehidden variable (one-dimensional one-hot dis-crete variable) of the dialog structure. Thedifference between the two models is that forD-VRNN, the hidden variable zt depends onht−1, while for DD-VRNN, the hidden vari-able zt depends on zt−1. Based on the max-imum likelihood of the entire dialog, VRNNuses some common methods of VAE to esti-mate the distribution of a posteriori probabil-

Figure 10: Auto-dialabel model

ities of the hidden variable zt.

The experiments in the paper show thatVRNN is superior to the traditional HMMmethod. VRNN also adds the dialog structureinformation to the reward function, support-ing faster convergence of the reinforcementlearning model. Figure 12 shows the transitionprobability of the hidden variable zt in restau-rants mined by D-VRNN. There are also somefuture work to use structured attention net-work(Qiu et al., 2020) and graph neural net-work (Sun et al., 2021) to extract hidden dia-log structures.

CMU scholars(Zhao et al., 2019) also triedto use the VAE method to deduce system ac-tions as hidden variables and directly use themfor dialog policy selection. This can alleviatethe problems caused by insufficient predefinedsystem actions. As shown in Figure 13, forsimplicity, an end-to-end dialog system frame-work is used in the paper. The baseline modelis an RL model at the word level (that is, a di-alog action is a word in the vocabulary). Themodel uses an encoder to encode the dialoghistory and then uses a decoder to decode itand generate a response. The reward func-tion directly compares the generated responsestatement with the real response statement.Compared with the baseline model, the latentaction model adds a posterior probability in-ference between the encoder and the decoderand uses discrete hidden variables to representthe dialog actions without any manual inter-vention. The experiment shows that the end-to-end RL model based on latent actions is su-perior to the baseline model in terms of state-ment generation diversity and task completionrate.

2.2.3 Data Collection

Recently, Google researchers proposed amethod to quickly collect dialog data(Shahet al., 2018) (see Figure 14): First, use tworule-based simulators to interact to generatea dialog outline, which is a dialog flow frame-work represented by semantic tags. Then, con-vert the semantic tags into natural languagedialogs based on templates. Finally, rewritethe natural statements by crowdsourcing toenrich the language expressions of dialog data.This reverse data collection method featureshigh collection efficiency and complete andhighly available data tags, reducing the costand workload of data collection and process-ing.

This method is a machine-to-machine(M2M) data collection policy, in which a widerange of semantic tags for dialog data are gen-erated, and then crowd-source to generate alarge number of dialog utterances. However,the generated dialogs cannot cover all the pos-sibilities in real scenarios. In addition, the ef-fect depends on the simulator (Li et al., 2016;Liu and Lane, 2017; Chen et al., 2020; Tsenget al., 2021).

In relevant academic circles, two othermethods are commonly used to collect datafrom dialog systems: human-to-machine(H2M) and human-to-human (H2H) . TheH2H method requires a multi-round dialog be-tween the user, played by a crowdsourced staffmember, and the customer service personnel,played by another crowdsourced staff mem-ber. The user proposes requirements basedon specified dialog targets such as buyingan airplane ticket, and the customer servicestaff annotates the dialog tags and makes re-

Figure 11: D-VRNN and DD-VRNN

Figure 12: Dialog stream structure mined by D-VRNN from the dialog data related to restaurants

sponses. This mode is called the Wizard-of-Oz framework. Many dialog datasets, suchas WOZ(Wen et al., 2016) and MultiWOZ(Budzianowski et al., 2018), are collected inthis mode. The H2H method helps us get di-alog data that is the most similar to that ofactual service scenarios. However, it is costlyto design different interactive interfaces for dif-ferent tasks and to clean up incorrect annota-tions. The H2M data collection policy allowsusers and trained machines to interact witheach other. This way, we can directly collectdata online and continuously improve the DMmodel through RL. The famous DSTC2&3dataset was collected in this way. The perfor-mance of the H2M method depends largely onthe initial performance of the DM model. Inaddition, the data collected online has a greatdeal of noise, which results in high clean-upcosts and affects the model optimization effi-ciency.

2.3 Shortcoming 3: Low TrainingEfficiency

With the successful application of deep RLin the Go game, this method is also widelyused in the task dialog systems. For exam-ple, the ACER dialog management method inone paper(Su et al., 2017) combines model-freedeep RL with other techniques such as Expe-rience Replay, belief domain constraints, andpre-training. This greatly improves the train-ing efficiency and stability of RL algorithms intask dialog systems.

However, simply applying the RL algorithmcannot meet the actual requirements of dialogsystems. One reason is that dialogs lack clearrules, reward functions, simple and clear ac-tion spaces, and perfect environment simula-tors that can generate hundreds of millions ofquality interactive data records. Dialog tasksinclude changing slot values, actions, and in-tents, which significantly increases the actionspace of the dialog system and makes it dif-ficult to define. When traditional flat RLmethods are used, the curse of dimensional-ity may occur due to one-hot encoding of allsystem actions. Therefore, these methods areno longer suitable for handling complex di-alogs with large action spaces. For this rea-son, scholars have tried many other methods,including model-free RL, model-based RL, andhuman-in-the-loop.

2.3.1 Model-Free RL - HRL

Hierarchical Reinforcement Learning (HRL)divides a complex task into multiple sub-tasksto avoid the curse of dimensionality in tradi-tional flat RL methods. In one paper(Penget al., 2017), HRL was applied to task dia-log systems for the first time. The authors

Figure 13: Baseline model and latent action model

Figure 14: Examples of dialog outline, template-based dialog generation, and crowdsourcing-based dialogrewrite

divided a complex dialog task into multiplesub-tasks by time. For example, a complextravel task can be divided into sub-tasks, suchas booking tickets, booking hotels, and rent-ing cars. Accordingly, they designed a dialogpolicy network of two layers. One layer selectsand arranges all sub-tasks, and the other layerexecutes specific sub-tasks.

The DM model they proposed consists oftwo parts, as shown in Figure 15:

• Top-level policy: Selects a sub-task basedon the dialog state.

• Low-level policy: Completes a specific di-alog action in a sub-task.

The global dialog state tracker records theoverall dialog state. After the entire dialogtask is completed, the top-level policy receivesan external reward.

The model also has an internal critic mod-ule to estimate the possibility of completingthe sub-tasks (the degree of slot filling for sub-tasks) based on the dialog state. The low-level policy receives an intrinsic reward fromthe internal critic module based on the degreeof completion of the sub-task.

For complex dialogs, a basic system action isselected at each step of traditional RL meth-

Figure 15: The HRL framework of a task-orienteddialog system

ods, such as querying the slot value or con-firming constraints. In the HRL mode, a setof basic actions is selected based on the top-level policy, and then a basic action is se-lected from the current set based on the low-level policy, as shown in Figure 16. This hi-erarchical division of action spaces covers thetime sequence constraints between differentsub-tasks, which facilitates the completion ofcomposite tasks. In addition, the intrinsic re-ward effectively relieves the problem of sparserewards, accelerating RL training, preventingfrequent switching of the dialog between differ-

Figure 16: Policy selection process of HRL

ent sub-tasks, and improving the accuracy ofaction prediction. Of course, the hierarchicaldesign of actions requires expert knowledge,and the types of sub-tasks need to be deter-mined by experts. Recently, tools that canautomatically discover dialog sub-tasks haveappeared(Kristianto et al., 2018; Tang et al.,2018). By using unsupervised learning meth-ods, these tools automatically split the dia-log state sequence of the whole dialog history,without the need to manually build a dialogsub-task structure.

2.3.2 Model-free RL - FRL

Feudal Reinforcement Learning (FRL) is asuitable solution to large dimension issues.HRL divides a dialog policy into sub-policiesbased on different task stages in the time di-mension, which reduces the complexity of pol-icy learning. FRL divides a policy in the spacedimension to restrict the action range of eachsub-policy, which reduces the complexity ofsub-policies. FRL does not divide a task intosub-tasks. Instead, it uses the abstract func-tions of the state space to extract useful fea-tures from dialog states. Such abstraction al-lows FRL to be applied and migrated betweendifferent domains, achieving high scalability.

Cambridge scholars appliedFRL(Casanueva et al., 2018) to task dia-log systems for the first time to divide theaction space by its relevance to the slots.With this done, only the natural structureof the action space is used, and additionalexpert knowledge is not required. They putforward a feudal policy structure shown inFigure 17. The decision-making process forthis structure is divided into two steps:

1. Determine whether the next action re-quires slots as parameters.

2. Select the low-level policy and next actionfor the corresponding slot based on thedecision of the first step.

In general, both HRL and FRL divide thehigh-dimensional complex action space in dif-ferent ways to address the low training effi-ciency of traditional RL methods due to largeaction space dimensions. HRL divides tasksproperly in line with human understanding.However, expert knowledge is required to di-vide a task into sub-tasks. FRL divides com-plex tasks based on the logical structure ofthe action and does not consider mutual con-straints between sub-tasks.

2.3.3 Model-Based RL

The preceding RL methods are model-free.With these methods, a large amount of weaklysupervised data is obtained through trial anderror interactions with the environment, andthen a value network or policy network istrained accordingly. The process is indepen-dent of the environment. There is also model-based RL, as shown in Figure 18. Model-based RL directly models and interacts withthe environment to learn a probability transi-tion function of state and reward, namely, anenvironment model. Then, the system inter-acts with the environment model to generatemore training data. Therefore, model-basedRL is more efficient than model-free RL, es-pecially when it is costly to interact with theenvironment. However, the resulting perfor-mance depends on the quality of environmentmodeling.

Using model-based RL to improve train-ing efficiency is currently an active field ofresearch. Microsoft first applied the clas-sic Deep Dyna-Q (DDQ) algorithm in di-alogs(Peng et al., 2018), as shown by the figure(c) in Figure 19. Before DDQ training starts,we use a small amount of existing dialog datato pre-train the policy model and the worldmodel. Then, we train DDQ by repeating thefollowing steps:

• Direct RL: Interact with real users online,update policy models, and store dialogdata.

• World model training: Update the worldmodel based on collected real dialog data.

• Planning: Use the dialog data obtainedfrom interaction with the world model totrain the policy model.

Figure 17: Application of FRL in a task-oriented dialog system

Figure 18: Model-based RL process

The world model (as shown in Figure 20) isa neural network that models the probabilityof environment state transition and rewards.The inputs are the current dialog state andsystem action. The outputs are the next useraction, environment rewards, and dialog ter-mination variables. The world model reducesthe human-machine interaction data requiredby DDQ for online RL (as shown in figure (a)of Figure 19) and avoids ineffective interac-tions with user simulators (as shown in figure(b) of Figure 19).

Similar to the user simulator in the dialogfield, the world model can simulate real useractions and interact with the system’s DM.However, the user simulator is essentially anexternal environment and is used to simulatereal users, while the world model is an internalmodel of the system.

Microsoft researchers have made improve-ments based on DDQ. To improve the au-thenticity of the dialog data generated by theworld model, they proposed(Su et al., 2018)to improve the quality of the generated dialog

data through adversarial training. Consider-ing when to use the data generated through in-teraction with the real environment and whento use data generated through interaction withthe world model, they discussed feasible so-lutions in a paper(Wu et al., 2019b). Theyalso discussed a unified dialog framework to in-clude interaction with real users in another pa-per(Zhang et al., 2019). This human-teachingconcept has attracted attention in the industryas it can help in the building of DMs. This willbe further explained in the following sections.

2.3.4 Human-in-the-Loop

We hope to make full use of human knowledgeand experience to generate high-quality dataand improve the efficiency of model training.Human-in-the-loop RL(Abel et al., 2017)is amethod to introduce human beings into robottraining. Through designed human-machineinteraction methods, humans can efficientlyguide the training of RL models. To furtherimprove the training efficiency of the task dia-log systems, researchers are working to designan effective human-in-the-loop method basedon the dialog features.

Google researchers proposed a compositelearning method combining human teachingand RL(Abel et al., 2017) (as shown in Fig-ure 21), which adds a human teaching stagebetween supervised pre-training and onlineRL, allowing humans to tag data to avoidthe covariate shift caused by supervised pre-training(Ross et al., 2011). Amazon re-

Figure 19: Three RL architectures

Figure 20: Structure of the world model

Figure 21: Composite learning combining super-vised pre-training, imitation learning, and onlineRL

searchers also proposed a similar human teach-ing framework(Abel et al., 2017): In eachround of dialog, the system recommends fourresponses to the customer service expert. Thecustomer service expert determines whether toselect one of these responses or create a newresponse. Finally, the customer service ex-pert sends the selected or created response tothe user. With this method, developers canquickly update the capabilities of the dialogsystem.

In the preceding method, the system pas-sively receives the data tagged by humans.However, a good system should actively askquestions and seek help from humans. One pa-per(Chen et al., 2017) introduced the compan-ion learning architecture (as shown in Figure22), which adds the role of a teacher (human)to the traditional RL framework. The teachercan correct the responses of the dialog system(the student, represented by the switch on theleft side of the figure) and evaluate the stu-dent’s response in the form of intrinsic reward(the switch on the right side of the figure).For the implementation of active learning, theauthors put forward the concept of dialog de-cision certainty. The student policy networkis sampled multiple times through dropout toobtain the estimated approximate maximumprobability of the desired action. Then themoving average of several dialog rounds is cal-culated through the maximum probability andused as the decision certainty of the studentpolicy network. If the calculated certainty islower than the target value, the system deter-mines whether a teacher is required to correcterrors and provide reward functions based onthe difference between the calculated decisioncertainty and the target value. If the calcu-

lated certainty is higher than the target value,the system stops learning from the teacher andmakes judgments on its own.

The key to active learning is to estimatethe certainty of the dialog system regardingits own decisions. In addition to dropping outpolicy networks, other methods include usinghidden variables as condition variables to cal-culate the Jensen-Shannon divergence of pol-icy networks(Wang et al., 2019) and makingjudgments based on the dialog success rate ofthe current system(Zhang et al., 2019).

3 Dialog Management Frameworkof the Intelligent RobotConversational AI team

To ensure stability and interpretability, the in-dustry primarily uses rule-based DM models.The Intelligent Robot Conversational AI Teamat Alibaba’s DAMO Academy began to ex-plore DM models last year. When building areal dialog system, we need to solve two prob-lems: (1) how to obtain a large amount of di-alog data in a specific scenario and (2) how touse algorithms to maximize the value of data.

Currently, we plan to complete the modelframework design in four steps, as shown inFigure 23.

Step 1: First, use the dialog studio inde-pendently developed by the Intelligent RobotConversational AI team to quickly build a di-alog engine called TaskFlow based on rule-based dialog flows and build a user simulatorwith similar dialog flows. Then, have the usersimulator and TaskFlow continuously interactwith each other to generate a large amount ofdialog data.

Step 2: Train a neural network through su-pervised learning to build a preliminary DMmodel that has capabilities basically equiva-lent to a rule-based dialog engine. The modelcan be expanded by combining semantic sim-ilarity matching and end-to-end generation.Dialog tasks with a large action space are di-vided using the HRL method.

Step 3: In the development phase, makethe system interact with an improved usersimulator or AI trainers and continuously en-hance the system dialog capability based onoff-policy ACER RL algorithms.

Step 4: After the human-machine interac-

tion experience is verified, launch the systemand introduce human roles to collect real userinteraction data. In addition, use some UI de-signs to easily introduce user feedback to con-tinuously update and enhance the model. Theobtained human-machine dialog data will befurther analyzed and mined for customer in-sight.

At present, the RL-based DM model we de-veloped can complete 80% of the dialog withthe user simulator for moderately complex di-alog tasks, such as booking a meeting room,as shown in Figure 24.

4 Summary

This article provides a detailed introduction ofthe latest research on DM models, focusing onthree shortcomings of traditional DM models:

• Poor scalability

• Insufficient tagged data

• Low training efficiency

To address scalability, common methods forprocessing changes in user intents, dialog bod-ies, and the system action space include se-mantic similarity matching, knowledge dis-tillation, and sequence generation. To ad-dress insufficient tagged data, methods in-clude automatic machine tagging, effective di-alog structure mining, and efficient data col-lection policies. To address the low trainingefficiency of traditional DM models, methodssuch as HRL and FRL are used to divide ac-tion spaces into different layers. Model-basedRL methods are also used to model the envi-ronment and improve training efficiency. In-troducing human-in-the-loop into the dialogsystem training framework is also a current fo-cus of research. Finally, we discussed the cur-rent progress of the DM model developed bythe Intelligent Robot Conversational AI teamof Alibaba’s DAMO Academy. We hope thissummary can provide some new insights tosupport your own research on DM.

Acknowledgement

The authors would like to thank AlibabaCloud International Team for their efforts onthe translation.

Figure 22: The teacher corrects the student’s response (on the left) or evaluates the student’s response(on the right).

Figure 23: Four steps of DM model design

Figure 24: Framework and evaluation indicators of the DM model developed by the Intelligent RobotConversational AI team

References

David Abel, John Salvatier, Andreas Stuhlmuller,and Owain Evans. 2017. Agent-agnostic human-in-the-loop reinforcement learning. arXivpreprint arXiv:1701.04079.

Ankur Bapna, Gokhan Tur, Dilek Hakkani-Tur,and Larry Heck. 2017. Towards zero-shot framesemantic parsing for domain scaling. arXivpreprint arXiv:1707.02363.

Jonathan Berant, Andrew Chou, Roy Frostig, andPercy Liang. 2013. Semantic parsing on free-base from question-answer pairs. In Proceedingsof the 2013 conference on empirical methods innatural language processing, pages 1533–1544.

Antoine Bordes, Y-Lan Boureau, and Jason We-ston. 2016. Learning end-to-end goal-orienteddialog. arXiv preprint arXiv:1605.07683.

Pawe l Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Inigo Casanueva, Stefan Ultes,Osman Ramadan, and Milica Gasic. 2018.Multiwoz–a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling.arXiv preprint arXiv:1810.00278.

Inigo Casanueva, Pawe l Budzianowski, Pei-Hao Su,Stefan Ultes, Lina Rojas-Barahona, Bo-HsiangTseng, and Milica Gasic. 2018. Feudal reinforce-ment learning for dialogue management in largedomains. arXiv preprint arXiv:1803.03232.

Lu Chen, Boer Lv, Chi Wang, Su Zhu, BowenTan, and Kai Yu. 2020. Schema-guided multi-domain dialogue state tracking with graph at-tention neural networks. In Proceedings of theAAAI Conference on Artificial Intelligence, vol-ume 34, pages 7521–7528.

Lu Chen, Xiang Zhou, Cheng Chang, RunzheYang, and Kai Yu. 2017. Agent-aware dropoutdqn for safe and efficient on-line dialogue policylearning. In Proceedings of the 2017 Conferenceon Empirical Methods in Natural Language Pro-cessing, pages 2454–2464.

Yun-Nung Chen, Dilek Hakkani-Tur, and Xi-aodong He. 2016. Zero-shot learning of intentembeddings for expansion by convolutional deepstructured semantic models. In 2016 IEEE In-ternational Conference on Acoustics, Speech andSignal Processing (ICASSP), pages 6045–6049.IEEE.

Yinpei Dai, Hangyu Li, Yongbin Li, Jian Sun, FeiHuang, Luo Si, and Xiaodan Zhu. 2021. Pre-view, attend and review: Schema-aware cur-riculum learning for multi-domain dialog statetracking. arXiv preprint arXiv:2106.00291.

Yinpei Dai, Hangyu Li, Chengguang Tang, Yong-bin Li, Jian Sun, and Xiaodan Zhu. 2020. Learn-ing low-resource end-to-end goal-oriented dialog

for fast and reliable system deployment. In Pro-ceedings of the 58th Annual Meeting of the As-sociation for Computational Linguistics, pages609–618.

Yinpei Dai, Yichi Zhang, Zhijian Ou, YanmengWang, and Junlan Feng. 2018. Elastic crfsfor open-ontology slot filling. arXiv preprintarXiv:1811.01331.

Bhuwan Dhingra, Lihong Li, Xiujun Li, Jian-feng Gao, Yun-Nung Chen, Faisal Ahmed, andLi Deng. 2016. Towards end-to-end reinforce-ment learning of dialogue agents for informationaccess. arXiv preprint arXiv:1609.00777.

Saket Dingliwal, Bill Gao, Sanchit Agarwal,Chien-Wei Lin, Tagyoung Chung, and DilekHakkani-Tur. 2021. Few shot dialogue statetracking using meta-learning. arXiv preprintarXiv:2101.06779.

Jianfeng Gao, Michel Galley, and Lihong Li. 2018.Neural approaches to conversational ai. In The41st International ACM SIGIR Conference onResearch & Development in Information Re-trieval, pages 1371–1374.

Iryna Haponchyk, Antonio Uva, Seunghak Yu,Olga Uryupina, and Alessandro Moschitti. 2018.Supervised clustering of questions into intentsfor dialog system applications. In Proceedingsof the 2018 Conference on Empirical Methods inNatural Language Processing, pages 2310–2321.

Ji He, Jianshu Chen, Xiaodong He, JianfengGao, Lihong Li, Li Deng, and Mari Osten-dorf. 2015. Deep reinforcement learning witha natural language action space. arXiv preprintarXiv:1511.04636.

Matthew Henderson and Ivan Vulic. 2020. Convex:Data-efficient and few-shot slot labeling. arXivpreprint arXiv:2010.11791.

Ehsan Hosseini-Asl, Bryan McCann, Chien-ShengWu, Semih Yavuz, and Richard Socher. 2020.A simple language model for task-oriented dia-logue. arXiv preprint arXiv:2005.00796.

Giovanni Yoko Kristianto, Huiwen Zhang,Bin Tong, Makoto Iwayama, and YoshiyukiKobayashi. 2018. Autonomous sub-domainmodeling for dialogue policy with hierarchicaldeep reinforcement learning. In Proceedings ofthe 2018 EMNLP Workshop SCAI: The 2ndInternational Workshop on Search-OrientedConversational AI, pages 9–16.

Chia-Hsuan Lee, Hao Cheng, and Mari Ostendorf.2021. Dialogue state tracking with a languagemodel using schema-driven prompting. arXivpreprint arXiv:2109.07506.

Wenqiang Lei, Xisen Jin, Min-Yen Kan, ZhaochunRen, Xiangnan He, and Dawei Yin. 2018. Se-quicity: Simplifying task-oriented dialogue sys-tems with single sequence-to-sequence architec-tures. In Proceedings of the 56th Annual Meet-ing of the Association for Computational Lin-guistics (Volume 1: Long Papers), pages 1437–1447.

Xiujun Li, Zachary C Lipton, Bhuwan Dhingra,Lihong Li, Jianfeng Gao, and Yun-Nung Chen.2016. A user simulator for task-completion dia-logues. arXiv preprint arXiv:1612.05688.

Bing Liu and Ian Lane. 2017. Iterative policylearning in end-to-end trainable task-orientedneural dialog models. In Proceedings of 2017IEEE Workshop on Automatic Speech Recogni-tion and Understanding (ASRU).

Zihan Liu, Genta Indra Winata, Peng Xu, andPascale Fung. 2020. Coach: A coarse-to-fineapproach for cross-domain slot filling. arXivpreprint arXiv:2004.11727.

Andrea Madotto, Chien-Sheng Wu, and Pas-cale Fung. 2018. Mem2seq: Effectively in-corporating knowledge bases into end-to-endtask-oriented dialog systems. arXiv preprintarXiv:1804.08217.

Shikib Mehri and Maxine Eskenazi. 2021. Gensf:Simultaneous adaptation of generative pre-trained models and slot filling. arXiv preprintarXiv:2106.07055.

Gregoire Mesnil, Xiaodong He, Li Deng, andYoshua Bengio. 2013. Investigation of recurrent-neural-network architectures and learning meth-ods for spoken language understanding. In In-terspeech, pages 3771–3775.

Nikola Mrksic, Diarmuid O Seaghdha, Tsung-Hsien Wen, Blaise Thomson, and SteveYoung. 2016. Neural belief tracker: Data-driven dialogue state tracking. arXiv preprintarXiv:1606.03777.

Baolin Peng, Chunyuan Li, Jinchao Li, ShahinShayandeh, Lars Liden, and Jianfeng Gao. 2020.Soloist: Few-shot task-oriented dialog with asingle pretrained auto-regressive model. arXivpreprint arXiv:2005.05298.

Baolin Peng, Xiujun Li, Jianfeng Gao, JingjingLiu, Kam-Fai Wong, and Shang-Yu Su. 2018.Deep dyna-q: Integrating planning for task-completion dialogue policy learning. arXivpreprint arXiv:1801.06176.

Baolin Peng, Xiujun Li, Lihong Li, Jian-feng Gao, Asli Celikyilmaz, Sungjin Lee,and Kam-Fai Wong. 2017. Composite task-completion dialogue policy learning via hier-archical deep reinforcement learning. arXivpreprint arXiv:1704.03084.

Kun Qian and Zhou Yu. 2019. Domain adap-tive dialog generation via meta learning. arXivpreprint arXiv:1906.03520.

Liang Qiu, Yizhou Zhao, Weiyan Shi, Yuan Liang,Feng Shi, Tao Yuan, Zhou Yu, and Song-Chun Zhu. 2020. Structured attention for un-supervised dialogue structure induction. arXivpreprint arXiv:2009.08552.

Janarthanan Rajendran, Jatin Ganhotra, andLazaros C Polymenakos. 2019. Learning end-to-end goal-oriented dialog with maximal user tasksuccess and minimal human agent use. Transac-tions of the Association for Computational Lin-guistics, 7:375–386.

Pranav Rajpurkar, Robin Jia, and Percy Liang.2018. Know what you don’t know: Unan-swerable questions for SQuAD. In Proceedingsof the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 2: ShortPapers), pages 784–789, Melbourne, Australia.Association for Computational Linguistics.

Osman Ramadan, Pawe l Budzianowski, and Mil-ica Gasic. 2018. Large-scale multi-domain belieftracking with knowledge sharing. arXiv preprintarXiv:1807.06517.

Abhinav Rastogi, Dilek Hakkani-Tur, and LarryHeck. 2017. Scalable multi-domain dialoguestate tracking. In 2017 IEEE AutomaticSpeech Recognition and Understanding Work-shop (ASRU), pages 561–568. IEEE.

Stephane Ross, Geoffrey Gordon, and Drew Bag-nell. 2011. A reduction of imitation learning andstructured prediction to no-regret online learn-ing. In Proceedings of the fourteenth interna-tional conference on artificial intelligence andstatistics, pages 627–635. JMLR Workshop andConference Proceedings.

Iulian Serban, Alessandro Sordoni, Ryan Lowe,Laurent Charlin, Joelle Pineau, AaronCourville, and Yoshua Bengio. 2017. A hi-erarchical latent variable encoder-decodermodel for generating dialogues. In Proceed-ings of the AAAI Conference on ArtificialIntelligence, volume 31.

Darsh J Shah, Raghav Gupta, Amir A Fayazi,and Dilek Hakkani-Tur. 2019. Robust zero-shotcross-domain slot filling with example values.arXiv preprint arXiv:1906.06870.

Pararth Shah, Dilek Hakkani-Tur, Bing Liu, andGokhan Tur. 2018. Bootstrapping a neural con-versational agent with dialogue self-play, crowd-sourcing and on-line reinforcement learning. InProceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Com-putational Linguistics: Human Language Tech-nologies, Volume 3 (Industry Papers), pages 41–51.

Chen Shi, Qi Chen, Lei Sha, Sujian Li, Xu Sun,Houfeng Wang, and Lintao Zhang. 2018. Auto-dialabel: Labeling dialogue data with unsuper-vised learning. In Proceedings of the 2018 con-ference on empirical methods in natural lan-guage processing, pages 684–689.

Weiyan Shi, Tiancheng Zhao, and Zhou Yu. 2019.Unsupervised dialog structure learning. arXivpreprint arXiv:1904.03736.

Pei-Hao Su, Pawel Budzianowski, Stefan Ultes,Milica Gasic, and Steve Young. 2017. Sample-efficient actor-critic reinforcement learning withsupervised data for dialogue management.arXiv preprint arXiv:1707.00130.

Shang-Yu Su, Xiujun Li, Jianfeng Gao, JingjingLiu, and Yun-Nung Chen. 2018. Discriminativedeep dyna-q: Robust planning for dialogue pol-icy learning. arXiv preprint arXiv:1808.09442.

Yajing Sun, Yong Shan, Chengguang Tang, YueHu, Yinpei Dai, Jing Yu, Jian Sun, Fei Huang,and Luo Si. 2021. Unsupervised learningof deterministic dialogue structure with edge-enhanced graph auto-encoder. In Proceedings ofthe AAAI Conference on Artificial Intelligence,volume 35, pages 13869–13877.

Da Tang, Xiujun Li, Jianfeng Gao, Chong Wang,Lihong Li, and Tony Jebara. 2018. Subgoal dis-covery for hierarchical dialogue policy learning.arXiv preprint arXiv:1804.07855.

Bo-Hsiang Tseng, Yinpei Dai, Florian Kreyssig,and Bill Byrne. 2021. Transferable dialoguesystems and user simulators. arXiv preprintarXiv:2107.11904.

Alan M Turing. 2009. Computing machinery andintelligence. In Parsing the turing test, pages23–65. Springer.

Weikang Wang, Jiajun Zhang, Qian Li, Mei-Yuh Hwang, Chengqing Zong, and Zhifei Li.2019. Incremental learning from scratch fortask-oriented dialogue systems. arXiv preprintarXiv:1906.04991.

Weikang Wang, Jiajun Zhang, Han Zhang, Mei-Yuh Hwang, Chengqing Zong, and Zhifei Li.2018. A teacher-student framework for main-tainable dialog manager. In Proceedings of the2018 Conference on Empirical Methods in Nat-ural Language Processing, pages 3803–3812.

Gellert Weisz, Pawe l Budzianowski, Pei-Hao Su,and Milica Gasic. 2018. Sample efficient deep re-inforcement learning for dialogue systems withlarge action spaces. IEEE/ACM Transactionson Audio, Speech, and Language Processing,26(11):2083–2097.

Joseph Weizenbaum. 1966. Eliza—a computerprogram for the study of natural language com-munication between man and machine. Com-munications of the ACM, 9(1):36–45.

HENRY Weld, Xiaoqi Huang, SIQU Long, JosiahPoon, and SOYEON CAREN Han. 2021. A sur-vey of joint intent detection and slot-filling mod-els in natural language understanding. arXivpreprint arXiv:2101.08091.

Tsung-Hsien Wen, David Vandyke, Nikola Mrk-sic, Milica Gasic, Lina M Rojas-Barahona,Pei-Hao Su, Stefan Ultes, and Steve Young.2016. A network-based end-to-end trainabletask-oriented dialogue system. arXiv preprintarXiv:1604.04562.

Chien-Sheng Wu, Steven Hoi, Richard Socher,and Caiming Xiong. 2020. Tod-bert: pre-trained natural language understandingfor task-oriented dialogue. arXiv preprintarXiv:2004.06871.

Chien-Sheng Wu, Andrea Madotto, EhsanHosseini-Asl, Caiming Xiong, Richard Socher,and Pascale Fung. 2019a. Transferablemulti-domain state generator for task-oriented dialogue systems. arXiv preprintarXiv:1905.08743.

Yuexin Wu, Xiujun Li, Jingjing Liu, Jianfeng Gao,and Yiming Yang. 2019b. Switch-based ac-tive deep dyna-q: Efficient adaptive planningfor task-completion dialogue policy learning. InProceedings of the AAAI Conference on Artifi-cial Intelligence, volume 33, pages 7289–7296.

Yumo Xu, Chenguang Zhu, Baolin Peng, andMichael Zeng. 2020. Meta dialogue policy learn-ing. arXiv preprint arXiv:2006.02588.

Steve Young, Milica Gasic, Blaise Thomson, andJason D Williams. 2013. Pomdp-based statisti-cal spoken dialog systems: A review. Proceed-ings of the IEEE, 101(5):1160–1179.

Mengshi Yu, Jian Liu, Yufeng Chen, Jinan Xu, andYujie Zhang. 2021. Cross-domain slot filling asmachine reading comprehension. IJCAI.

Jianguo Zhang, Kazuma Hashimoto, Chien-ShengWu, Yao Wang, Philip Yu, Richard Socher,and Caiming Xiong. 2020. Find or classify?dual strategy for slot-value predictions on multi-domain dialog state tracking. In Proceedings ofthe Ninth Joint Conference on Lexical and Com-putational Semantics, pages 154–167, Barcelona,Spain (Online). Association for ComputationalLinguistics.

Zhirui Zhang, Xiujun Li, Jianfeng Gao, and En-hong Chen. 2019. Budgeted policy learning fortask-oriented dialogue systems. arXiv preprintarXiv:1906.00499.

Tiancheng Zhao, Kaige Xie, and Maxine Eske-nazi. 2019. Rethinking action spaces for rein-forcement learning in end-to-end dialog agentswith latent variable models. arXiv preprintarXiv:1902.08858.

Li Zhou and Kevin Small. 2019. Multi-domaindialogue state tracking as dynamic knowledgegraph enhanced question answering. arXivpreprint arXiv:1911.06192.