6653 Parkwood Rd, Edina, MN 55436 : Parkwood Knolls by RE/MAX Results Edina - Kris Lindahl
Edina: Building an Open-Domain Socialbot using Self-Dialogues · 1 Edina: Building an Open-Domain...
Transcript of Edina: Building an Open-Domain Socialbot using Self-Dialogues · 1 Edina: Building an Open-Domain...
1
Edina: Building an Open-Domain
Socialbot using Self-Dialogues
ILCC, School of Informatics, University of Edinburgh
2
Conversational AI is everywhere
http://static4.uk.businessinsider.com/image/581ca089dd08954b518b45b6-1190-625/
we-put-siri-alexa-google-assistant-and-cortana-through-a-marathon-of-tests-to-see-whos-winning-\
the-virtual-assistant-race--heres-what-we-found.jpg
3
2016: The year of the chatbot
from ‘Tracxn Research, Chatbot Startup Landscape’, June 2016
4
Chatbot Applications
I Customer service
I IoT
I Other: help people with disabilities, etc.
5
Amazon vs. Google vs. Microsoft
https://www.amazon.com/Amazon-Echo-Bluetooth-Speaker-with-WiFi-Alexa/dp/B00X4WHP5E
https://www.bhphotovideo.com/images/images2500x2500/google_ga3a00417a14_home_1297281.jpg
https://blogs.msdn.microsoft.com/ukhe/2015/09/15/student-survival-tips-from-cortana/
6
Amazon Alexa Prize
I Goal: to build on open-domain conversation AI forcommercial purposes
I Currently, Alexa mostly is mostly rule-based (skills)
I 18 teams involved (12 sponsored by Amazon)
I Users in the U.S. evaluate the conversation with bot on ascale from 1 to 5
7
Our team
8
The problem(s)
9
Where do we start?
I How do we build a chatbot?I No idea!I Let’s look at previous work!
10
Rule-based bots: Mitsuku (try it at mistuku.com!)
11
Rule-based vs. Machine-learning
I Rule-basedI 3 Fully deterministicI 3 Output fully intelligibleI 7 Very constrainedI 7 Time-consuming, Difficult to maintainI 7 Full of fallback strategies
12
Machine-learning methods: Neural Networks
13
Rule-based vs. Machine-learning
I Rule-basedI 3Fully deterministicI 3Output fully intelligibleI 7Very constrainedI 7Time-consuming, Difficult to maintainI 7Full of fallback strategies
I Machine-learningI 3 Easy to maintainI 3 Flexible, broader-coverageI 7 Non-deterministicI 7 Constrained to the domain of the training data
14
Where do we start?
I How do we build a chatbot?I No idea!I Let’s look at previous work!
I What does Amazon want?I Open-domainI The user needs to be happy!!!
15
Open-domain
16
Rule-based vs. Machine-learning
I Rule-basedI 3Fully deterministicI 3Output fully intelligibleI 7 Very constrainedI 7Time-consuming, Difficult to maintainI 7Full of fallback strategies
I Machine-learningI 3 Easy to maintainI 3 Flexible, broader-coverageI 7Non-deterministicI 7Constrained to the domain of the training data
17
Rule-based vs. Machine-learning
I Rule-basedI 3Fully deterministicI 3Output fully intelligibleI 7Very constrainedI 7Time-consuming, Difficult to maintainI 7Full of fallback strategies
I Machine-learningI 3Easy to maintainI 3Flexible, broader-coverageI 7Non-deterministicI 7 Constrained to the domain of the training data
18
Corpora
I OpenSubtitles: Crowdsourced movie subtitles
I Movie scripts from IMDB
I Fisher: Phone conversations
I Ubuntu dialogue corpus: Technical support forUbuntu-related problems
No corpora that fit our domain!!!
19
The user needs to be happy
I Avoid offensive language
I Avoid sensitive topics (politics, religion, sex)
I Be empathetic
20
Be empathetic
[User is looking for info on a stage 4 cancer]
BOT: Who is your favorite actor?USER: <says they have cancer>BOT: I like a few of his other films .USER: <mentions cancer again>BOT: Me too ! I like him in the notebook .
21
Rule-based vs. Machine-learning
I Rule-basedI 3 Fully deterministicI 3 Output fully intelligibleI 7Very constrainedI 7Time-consuming, Difficult to maintainI 7Full of fallback strategies
I Machine-learningI 3Easy to maintainI 3Flexible, broader-coverageI 7 Non-deterministicI 7Constrained to the domain of the training data
22
What is ideal?
I A model that...I mostly machine-learning basedI feeds on clean data that is relevant to the task (what and
how the user wants it!)I maintainable from an engineering and financial perspectiveI outputs intelligible responses
23
What is ideal?
I A model that...I mostly machine-learning basedI feeds on clean data that is relevant to the task (what and
how the user wants it!)I maintainable from an engineering and financial perspectiveI outputs intelligible responses
24
Ask people!
I If you want to know what do people talk about and how theydo it, ask people.
I Two people conversing with each other on a topic
25
Ask people the Turkers!
I Crowdsourcing platform
I Create and upload a task (e.g. ‘have a conversation withanother user on a topic’)
I Have people around the world solve the task
I Collect data
https://pbs.twimg.com/profile_images/661394940816035840/1R9_KPHN.png
26
Visual Dialogue(Abhishek et al., 2016)
27
However...
I Having two turkers to chat with each other requires goodtiming and a common ground (the image in VisDial)
E.g.A: Hey, have you seen Guardians of the Galaxy?B: NoA: Not your type I guess.B: Have you?A: I haveB: Sounds nice
I Costs double (when people two people at a time)
28
Self-dialogues
The Turker makes up a fictitious conversation
29
Self-dialogue: example
30
Self-dialogues, cont’d
I 3 Speed and set-up: takes less effort and waiting time togather data from a single user
I 3 Cost effectiveness: halves the cost; after an initial bulk,only sporadic updates to keep on track with trendy topics
I 3 Quality: the users is always an expert in what is talkingabout; knows about the entities introduced in the dialogues
I 3 Naturalness: the flow conversation is natural
I 7 Not 2-people conversations: further analysis (dialogueacts etc.) are hindered
31
Data collected
I 24,283 self-dialogues spread across 23 tasks.
I A peak of 2,307 conversations a day
I Total cost: US $17,947.54
You need a lot of $$$ for these tasks!
32
Data collected, cont’d
Topic/subtopic # Conversations # Words # Turns
Movies 4,126 814,842 82,018Action 414 37,037 4,140Comedy 414 36,401 4,140Fast & Furious 343 33,964 3,430Harry Potter 414 44,220 4,140Disney 2,331 232,573 23,287Horror 414 428,33 4,138Thriller 828 77,975 8,277Star Wars 1,726 178,351 17,260Superhero 414 40,967 4,140
Music 4,911 924,993 98,123Pop 684 62,383 6,840Rap / Hip-Hop 684 66,376 6,840Rock 684 63,349 6,837The Beatles 679 68,396 6,781Lady Gaga 558 49,313 5,566
Music and Movies 216 37,303 4,320
NFL Football 2,801 562,801 55,939
33
The system
34
System overview
35
A deterministic queue
I Queue of components: when a component fails, the next oneis called
1. EVI: a factoid Q&A component provided by Amazon2. Rule-based: deals with general chit-chat3. Edina’s likes and dislikes: a bit of personality (based on Wiki
views)4. Matching score: our main component. Retrieves the
most-likely answer from the self-dialogue database.5. Proactive: change the topic on its own volition6. Neural network: A generative neural network kicks in if
everything else fails.
36
A deterministic queue
I Queue of components: when a component fails, the next oneis called
1. EVI: a factoid Q&A component provided by Amazon2. Rule-based: deals with general chit-chat3. Edina’s likes and dislikes: a bit of personality (based on Wiki
views)4. Matching score: our main component. Retrieves the
most-likely answer from the self-dialogue database.5. Proactive: change the topic on its own volition6. Neural network: A generative neural network kicks in if
everything else fails.
37
Rule-based
I Bot’s identity: anonymized until the finals
I Edina’s favorites: favorite actor, artist, singer, etc.
I Sensitive topics: suicide, cancer, death as well as promptscontaining offensive contents that needed to be ‘gracefully’caught
I Topic shifting: deals with requests of topic shifting
I Games and jokes
I + a set of the most frequent prompts from Alexa users,provided by Amazon
38
Matching score
I Our main component
I Matches a user query q with the conversation contexts cof all potential responses from the pool of self-dialoguesgathered through AMT, to return the most likely response r(and a confidence score).
E.g.q: Have you seen Hidden Figures?
c−2: Any cool new movie?c−1: What about Hidden Figures?r : I thought Hidden Figures was very thin on the actualmathematics of it all. S: 0.87
39
Matching score - cont’d
I The matching score is an interpolation of bag-of-words,IDF-based scores (rare words are upweighted).
S(q, ri , ci ) =(Sc + Scr )(Sc)n
η+ λS2cq (1)
where Sc ,Scr , (Sc)n and S2cq are subscores and λ, η and nare constants.
40
Neural network
I Language model with multiplicative LSTM (Krause et al.,2017)
I Trained on OpenSubtitles and fine-tuned on our data
41
Evaluation
42
Evaluation
I Evaluating the usefulness of the matching score
I Qualitative evaluation
I Evaluations we haven’t done but we would like to do
43
Evaluating the matching score
I We sample conversation triplets from our self-dialogue poolA:It is my favorite Disney movie.B:I think Beauty and the Beast is my favorite.A:I think my favorite is ‘Sound of Music’
I We manually score the actual reply against what thematching score returns.
44
Evaluating the matching score - Results
Response type Mean human evaluation score
Human AMT worker 4.09± 0.15Matching score top 4, confidence > 0.7 3.60± 0.07Matching score top-4, confidence > 0.5 3.50± 0.08Matching score top-4 3.08± 0.09
Corr.: r = 0.259 (p < 0.0005)
45
Qualitative evaluation of conversation
B: What bands have you seen live?H: I’ve seen Sufjan Stevens, do you like them?B: What’s he like?H: Quite boring to be honestB: I know what you are talking about .H: Right?! What kind of bands have you seen live?B: Yeah! I saw Cake a few months ago, and just last week I sawThe Protomen.H: Really! How was Cake?B: Eating cake?H: No i’m not eating a cakeB: No, the band. Have you ever heard of them?H: No i haven’tB: She does more traditional type music, but I absolutely love theway she sings.H: I’ll have to check her outB: You should ! she ‘s the best singer for sure .
46
Things we haven’t done but we would have loved to do
I Assessing whether the entities mentioned in the self-dialoguesreflect the entities Alexa users talk about
I Using the scores from Alexa users to tune our system
Evaluating open-domain chatbots is difficult!
47
Conclusion
48
Conclusion
I Open-domain conversational AI is hard and still a (very) openproblem
I Data collection/annotation is a real challenge, butself-dialogues are efficient and surprisingly effective.
I Evaluation is still an open problem
I An hybrid-system is a reasonable solution for this challenge(rule-based x machine-learning x IR)
49
Final remarks
I We got 6th place (out of 15 team)
I ...despite being the underdogs of the competition
I Teamwork is hard but pays off!
50
What’s next?