Edina: Building an Open-Domain Socialbot using Self-Dialogues · 1 Edina: Building an Open-Domain...

1

Edina: Building an Open-Domain

Socialbot using Self-Dialogues

ILCC, School of Informatics, University of Edinburgh

[email protected], [email protected],[email protected]

2

Conversational AI is everywhere

http://static4.uk.businessinsider.com/image/581ca089dd08954b518b45b6-1190-625/

we-put-siri-alexa-google-assistant-and-cortana-through-a-marathon-of-tests-to-see-whos-winning-\

the-virtual-assistant-race--heres-what-we-found.jpg

http://static4.uk.businessinsider.com/image/581ca089dd08954b518b45b6-1190-625/we-put-siri-alexa-google-assistant-and-cortana-through-a-marathon-of-tests-to-see-whos-winning-\the-virtual-assistant-race--heres-what-we-found.jpg



3

2016: The year of the chatbot

from ‘Tracxn Research, Chatbot Startup Landscape’, June 2016

4

Chatbot Applications

I Customer service

I IoT

I Other: help people with disabilities, etc.

5

Amazon vs. Google vs. Microsoft

https://www.amazon.com/Amazon-Echo-Bluetooth-Speaker-with-WiFi-Alexa/dp/B00X4WHP5E

https://www.bhphotovideo.com/images/images2500x2500/google_ga3a00417a14_home_1297281.jpg

https://blogs.msdn.microsoft.com/ukhe/2015/09/15/student-survival-tips-from-cortana/

https://www.amazon.com/Amazon-Echo-Bluetooth-Speaker-with-WiFi-Alexa/dp/B00X4WHP5E

https://www.bhphotovideo.com/images/images2500x2500/google_ga3a00417a14_home_1297281.jpg

https://blogs.msdn.microsoft.com/ukhe/2015/09/15/student-survival-tips-from-cortana/

6

Amazon Alexa Prize

I Goal: to build on open-domain conversation AI forcommercial purposes

I Currently, Alexa mostly is mostly rule-based (skills)

I 18 teams involved (12 sponsored by Amazon)

I Users in the U.S. evaluate the conversation with bot on ascale from 1 to 5

7

Our team

8

The problem(s)

9

Where do we start?

I How do we build a chatbot?I No idea!I Let’s look at previous work!

10

Rule-based bots: Mitsuku (try it at mistuku.com!)

11

Rule-based vs. Machine-learning

I Rule-basedI 3 Fully deterministicI 3 Output fully intelligibleI 7 Very constrainedI 7 Time-consuming, Difficult to maintainI 7 Full of fallback strategies

12

Machine-learning methods: Neural Networks

13


I Rule-basedI 3Fully deterministicI 3Output fully intelligibleI 7Very constrainedI 7Time-consuming, Difficult to maintainI 7Full of fallback strategies

I Machine-learningI 3 Easy to maintainI 3 Flexible, broader-coverageI 7 Non-deterministicI 7 Constrained to the domain of the training data

14

Where do we start?

I How do we build a chatbot?I No idea!I Let’s look at previous work!

I What does Amazon want?I Open-domainI The user needs to be happy!!!

15

Open-domain

16


I Rule-basedI 3Fully deterministicI 3Output fully intelligibleI 7 Very constrainedI 7Time-consuming, Difficult to maintainI 7Full of fallback strategies

I Machine-learningI 3 Easy to maintainI 3 Flexible, broader-coverageI 7Non-deterministicI 7Constrained to the domain of the training data

17


I Rule-basedI 3Fully deterministicI 3Output fully intelligibleI 7Very constrainedI 7Time-consuming, Difficult to maintainI 7Full of fallback strategies

I Machine-learningI 3Easy to maintainI 3Flexible, broader-coverageI 7Non-deterministicI 7 Constrained to the domain of the training data

18

Corpora

I OpenSubtitles: Crowdsourced movie subtitles

I Movie scripts from IMDB

I Fisher: Phone conversations

I Ubuntu dialogue corpus: Technical support forUbuntu-related problems

No corpora that fit our domain!!!

19

The user needs to be happy

I Avoid offensive language

I Avoid sensitive topics (politics, religion, sex)

I Be empathetic

20

Be empathetic

[User is looking for info on a stage 4 cancer]

BOT: Who is your favorite actor?USER: <says they have cancer>BOT: I like a few of his other films .USER: <mentions cancer again>BOT: Me too ! I like him in the notebook .

21


I Rule-basedI 3 Fully deterministicI 3 Output fully intelligibleI 7Very constrainedI 7Time-consuming, Difficult to maintainI 7Full of fallback strategies

I Machine-learningI 3Easy to maintainI 3Flexible, broader-coverageI 7 Non-deterministicI 7Constrained to the domain of the training data

22

What is ideal?

I A model that...I mostly machine-learning basedI feeds on clean data that is relevant to the task (what and

how the user wants it!)I maintainable from an engineering and financial perspectiveI outputs intelligible responses

23

What is ideal?

I A model that...I mostly machine-learning basedI feeds on clean data that is relevant to the task (what and

how the user wants it!)I maintainable from an engineering and financial perspectiveI outputs intelligible responses

24

Ask people!

I If you want to know what do people talk about and how theydo it, ask people.

I Two people conversing with each other on a topic

25

Ask people the Turkers!

I Crowdsourcing platform

I Create and upload a task (e.g. ‘have a conversation withanother user on a topic’)

I Have people around the world solve the task

I Collect data

https://pbs.twimg.com/profile_images/661394940816035840/1R9_KPHN.png

https://pbs.twimg.com/profile_images/661394940816035840/1R9_KPHN.png

26

Visual Dialogue(Abhishek et al., 2016)

27

However...

I Having two turkers to chat with each other requires goodtiming and a common ground (the image in VisDial)

E.g.A: Hey, have you seen Guardians of the Galaxy?B: NoA: Not your type I guess.B: Have you?A: I haveB: Sounds nice

I Costs double (when people two people at a time)

28

Self-dialogues

The Turker makes up a fictitious conversation

29

Self-dialogue: example

30

Self-dialogues, cont’d

I 3 Speed and set-up: takes less effort and waiting time togather data from a single user

I 3 Cost effectiveness: halves the cost; after an initial bulk,only sporadic updates to keep on track with trendy topics

I 3 Quality: the users is always an expert in what is talkingabout; knows about the entities introduced in the dialogues

I 3 Naturalness: the flow conversation is natural

I 7 Not 2-people conversations: further analysis (dialogueacts etc.) are hindered

31

Data collected

I 24,283 self-dialogues spread across 23 tasks.

I A peak of 2,307 conversations a day

I Total cost: US $17,947.54

You need a lot of $$$ for these tasks!

32

Data collected, cont’d

Topic/subtopic # Conversations # Words # Turns

Movies 4,126 814,842 82,018Action 414 37,037 4,140Comedy 414 36,401 4,140Fast & Furious 343 33,964 3,430Harry Potter 414 44,220 4,140Disney 2,331 232,573 23,287Horror 414 428,33 4,138Thriller 828 77,975 8,277Star Wars 1,726 178,351 17,260Superhero 414 40,967 4,140

Music 4,911 924,993 98,123Pop 684 62,383 6,840Rap / Hip-Hop 684 66,376 6,840Rock 684 63,349 6,837The Beatles 679 68,396 6,781Lady Gaga 558 49,313 5,566

Music and Movies 216 37,303 4,320

NFL Football 2,801 562,801 55,939

33

The system

34

System overview

35

A deterministic queue

I Queue of components: when a component fails, the next oneis called

1. EVI: a factoid Q&A component provided by Amazon2. Rule-based: deals with general chit-chat3. Edina’s likes and dislikes: a bit of personality (based on Wiki

views)4. Matching score: our main component. Retrieves the

most-likely answer from the self-dialogue database.5. Proactive: change the topic on its own volition6. Neural network: A generative neural network kicks in if

everything else fails.

36

A deterministic queue

I Queue of components: when a component fails, the next oneis called

1. EVI: a factoid Q&A component provided by Amazon2. Rule-based: deals with general chit-chat3. Edina’s likes and dislikes: a bit of personality (based on Wiki

views)4. Matching score: our main component. Retrieves the

most-likely answer from the self-dialogue database.5. Proactive: change the topic on its own volition6. Neural network: A generative neural network kicks in if

everything else fails.

37

Rule-based

I Bot’s identity: anonymized until the finals

I Edina’s favorites: favorite actor, artist, singer, etc.

I Sensitive topics: suicide, cancer, death as well as promptscontaining offensive contents that needed to be ‘gracefully’caught

I Topic shifting: deals with requests of topic shifting

I Games and jokes

I + a set of the most frequent prompts from Alexa users,provided by Amazon

38

Matching score

I Our main component

I Matches a user query q with the conversation contexts cof all potential responses from the pool of self-dialoguesgathered through AMT, to return the most likely response r(and a confidence score).

E.g.q: Have you seen Hidden Figures?

c−2: Any cool new movie?c−1: What about Hidden Figures?r : I thought Hidden Figures was very thin on the actualmathematics of it all. S: 0.87

39

Matching score - cont’d

I The matching score is an interpolation of bag-of-words,IDF-based scores (rare words are upweighted).

S(q, ri , ci ) =(Sc + Scr )(Sc)n

η+ λS2cq (1)

where Sc ,Scr , (Sc)n and S2cq are subscores and λ, η and nare constants.

40

Neural network

I Language model with multiplicative LSTM (Krause et al.,2017)

I Trained on OpenSubtitles and fine-tuned on our data

41

Evaluation

42

Evaluation

I Evaluating the usefulness of the matching score

I Qualitative evaluation

I Evaluations we haven’t done but we would like to do

43

Evaluating the matching score

I We sample conversation triplets from our self-dialogue poolA:It is my favorite Disney movie.B:I think Beauty and the Beast is my favorite.A:I think my favorite is ‘Sound of Music’

I We manually score the actual reply against what thematching score returns.

44

Evaluating the matching score - Results

Response type Mean human evaluation score

Human AMT worker 4.09± 0.15Matching score top 4, confidence > 0.7 3.60± 0.07Matching score top-4, confidence > 0.5 3.50± 0.08Matching score top-4 3.08± 0.09

Corr.: r = 0.259 (p < 0.0005)

45

Qualitative evaluation of conversation

B: What bands have you seen live?H: I’ve seen Sufjan Stevens, do you like them?B: What’s he like?H: Quite boring to be honestB: I know what you are talking about .H: Right?! What kind of bands have you seen live?B: Yeah! I saw Cake a few months ago, and just last week I sawThe Protomen.H: Really! How was Cake?B: Eating cake?H: No i’m not eating a cakeB: No, the band. Have you ever heard of them?H: No i haven’tB: She does more traditional type music, but I absolutely love theway she sings.H: I’ll have to check her outB: You should ! she ‘s the best singer for sure .

46

Things we haven’t done but we would have loved to do

I Assessing whether the entities mentioned in the self-dialoguesreflect the entities Alexa users talk about

I Using the scores from Alexa users to tune our system

Evaluating open-domain chatbots is difficult!

47

Conclusion

48

Conclusion

I Open-domain conversational AI is hard and still a (very) openproblem

I Data collection/annotation is a real challenge, butself-dialogues are efficient and surprisingly effective.

I Evaluation is still an open problem

I An hybrid-system is a reasonable solution for this challenge(rule-based x machine-learning x IR)

49

Final remarks

I We got 6th place (out of 15 team)

I ...despite being the underdogs of the competition

I Teamwork is hard but pays off!

50

What’s next?

51

https://imgflip.com/i/1yie8f

https://imgflip.com/i/1yie8f

Edina: Building an Open-Domain Socialbot using Self-Dialogues · 1 Edina: Building an Open-Domain...

Documents

Transcript of Edina: Building an Open-Domain Socialbot using Self-Dialogues · 1 Edina: Building an Open-Domain...