Visual Dialog Presentation - University of Torontofidler/slides/2017/CSC2539/Sayyed_slides.pdfbot...
Transcript of Visual Dialog Presentation - University of Torontofidler/slides/2017/CSC2539/Sayyed_slides.pdfbot...
WhatisaDialogSystem?• Adialogsystemisamachine(computersystem)withthegoalofconversingwithhumanwithalogicalstructure.
• Thecommunicationwithmachinecanbedonethroughtext,speech,gesture andsoon.
• ANaturalDialogSystemisaformofdialogsystemthattriestoimproveusabilityandusersatisfactionbyimitatinghumanbehaviour.(Berg,2014)
• Turingtest:amachine'sabilityto exhibitintelligentbehaviour equivalentto,orindistinguishablefrom,thatofahuman.
DialogSystem(Chatbot)
Voice
Text
TypesofDialogSystem• Goal-orientedagents: itneedstounderstandtheuserinputandcompletearelatedtaskwithacleargoalwithinalimitednumberofdialogturns.
• Finite-State:Restaurantreservation,airlinebooking,…
• ActiveOntalogy/FrameBased:Personalassistsant,SIRI,Alexa,GoogleNow
• Chatbots:generalconversationwithawidescope
• Chit-chatting• Entertainment• Examples:ELISA,ALICE,APRRY,…
Goal-orientedAgent
KnowledgeBase
ExternalSystems
APICalls
Finite-StateDialog• Aseriesofquestionstobeansweredbyuser
• Fullcontroloftheconversationbythesystem
• Ignoringanyunrelatedanswers
• Simpletobuildandgoodforsimpletasks
• Onlyoneinformationatatime
• Verypracticalbutnotanaturaldialog
From:DanJurafsky slides
ShowmeallChinese
restaurantsinToronto.
ActiveOntology/FrameBased• Morenaturalconversationwithmixed-initiative(Conversationinitiativeshiftsbetweentheuserandthesystem)
• Usercanaskmultiplequestionsorgivemultipleinformationinonesentence
• UsingFrameandSlots:onceallmandatoryslotsinaframearefilled,itwillgeneratequerytoaknowledgebaseorexternalsystems.
• UsingNaturalLanguageUnderstandingtoextractslotsfromsentences(MLcanbeused).
Iwanttobookaflight
fromTorontotoLondon
onTuesdayMorning
LIST
LISTTYPE
CUISINE
LOCATION
BOOKING TYPE
FROM TO
DATE TIME
Sometextsfrom:DanJurafsky slides
ActiveOntology/FrameBased- continued
Voice Synthesis
Voice Synthesis
VoiceRecognition
DialogManagement
ActionSelection
LanguageUnderstanding
SessionContext
KnowledgeBase
Complete?VoiceInput
ClarifyingQuestion
BestOutcome
TextInput
Yes
No
SemanticInterpretation
MissingSlots
InferredUserInput
Basedonafigure fromJeromeBellegarda
Example:AmazonAlexa
AmazonEcho
AmazonAlexaservice
Voice
Utterances
CustomSkillService
AmazonDeveloperPortal
AmazonEchoAppMarket
RegisterSkill
SampleUtterances
IntentSchema
Skills
SlotsandSlotTypes
DB
ExternalSystems
Example:AmazonAlexa• Skills arevoiceenabledapps• ForeveryIntent wedefineasmanyaspossiblesampleutterances
• Sampleutterancescanhaveslots inthem
• Slotsarecategorizedbyslottypes
• Therearebuilt-inintentstostart orstop askilloraskforhelp.
SlotType“FAACODES”:AAC,AAF,AAHAAI,…
IntentSchema:{“intent”:“airportInfoIntent”,”slots”:[{“name”:“AIRPORTCODE”,“type”:“FAACODES”
}]}
SampleUtterances:airportInfoIntent {AIRPORTCODE}airportInfoIntent airportinto{AIRPORTCODE}airportInfoIntent flightdelay{AIRPORTCODE}airportInfoIntent info{AIRPORTCODE}airportInfoIntent flightstatus{AIRPORTCODE}airportInfoIntent airport{AIRPORTCODE}airportInfoIntent flightinfo{AIRPORTCODE}…
GeneralChatbotsRule-based:• Basedonpatternmatching(AIML,ChatScript,Regex,…)
• Usingmentalmodels• Threedifferentmemory:
• Utterance• Session• Global
Corpus-based:• Largecorpusdata• Deepneuralnetworks• Informationretrieval(mine
conversationsandretrievesimilarresponses)
<pattern>HELLO</pattern><random><li>Howdoyoudo.Pleasestateyourproblem..</li><li>Hi.Whatseemstobeyourproblem?.</li></random>
SamplePatterns(ELIZA)
<pattern>YOUARE*</pattern><random><li>WhatmakesyouthinkIam<star/>?.</li><li>DoesitpleaseyoutobelieveIam<star/>?.</li><li>Doyousometimeswishyouwere<star/>?.</li><li>Perhapsyouwould liketobe<star/>..</li></random>
Sometextsfrom:DanJurafsky slides
ANeuralConversationModel(O.Vinyals,Q.V.Le2015)• Sequencetosequence(Seq2seq)mappingusingrecurrentneuralnetworks(readstheinputsequenceonetokenatatime,andpredictstheoutputsequence,alsoonetokenatatime)
• Duringtraining,thetrueoutputsequenceisgiventothemodel
• Themodelistrainedtomaximizethecrossentropyofthecorrectsequencegivenitscontext
• Duringinference,thetrueoutputsequenceisnotobserved,simplyfeedthepredictedoutputtokenasinputtopredictthenextoutput (“greedy”inference)orusebeamsearch,andfeedseveralcandidatesattheprevioussteptothenextstep.Selectionbasedontheprobabilityofthesequence
ANeuralConversationModel– sampleoutput
Describeyourproblem:IforgotmypasswordMachine:hi,thisis<NAME>fromhelpdeskconnect,whatcani helpyouwithtoday?Human:hi,i forgotmypasswordMachine: okgreatHuman: canyoutellmehowi canrecoverit?Machine: i ’ll needtoverifywhoitisatthatmachineMachine: canwedoahangoutMachine: ?
Human:yesMachine: <anonymizedhangoutinstructionsandverifications>Human:cool,i amgoodnowMachine:how’sthislookinggood?Machine:i ’ll goaheadandclosethischat.Machine: takecare.:)Human:byeMachine: bye
• TrainedonITHelpdeskTroubleshootingdataset(Training:30Mtokens,Validation:3Mtokens)
• TrainedasinglelayerLSTMwith1024memorycellsusingstochasticgradientdescentwithgradientclipping.Thevocabularyconsistsofthemostcommon20Kwords,whichincludesspecialtokensindicatingturntakingandactor.
End-To-EndDialogSystemsUsingGenerativeHierarchicalNeuralNetworkModels(I.V.Serban et.al.2016)
• ThreeRNNmodules• EncoderRNN(encodingeachsub-sequenceintoavector)
• ContextRNN(encodesallprevioussub-sequencesintoavector)
• DecoderRNN(generatesthenextsub-sequence) * The randomness injected by the variable z corresponds to higher-level
decisions, like topic or sentiment of the sentence.
End-to-EndGoal-OrientedDialog(A.Bordes et.al2017)• Theworkhorseoftraditionaldialogsystemsisslot-filling• End-to-enddialogsystems,usuallybasedonneuralnetworks,shownpromisingperformanceinnongoal-orientedchit-chatsettings,wheretheyweretrainedtopredictthenextutteranceinsocialmediaandforumthreads
• Conductinggoal-orienteddialogrequiresskillsthatgobeyondlanguagemodeling,e.g.,askingquestionstoclearlydefineauserrequest,queryingKnowledgeBases(KBs),interpretingresultsfromqueries todisplayoptionstousersorcompletingatransaction
• Thepapershows:end-to-enddialogsystembasedonMemoryNetworkscanreachpromising,yetimperfect,performanceandlearntoperformnon-trivialoperations
End-to-EndGoal-OrientedDialog
Goal-orienteddialogtasks:• Auser(ingreen)chatswithabot(inblue)tobookatableatarestaurant.ModelsmustpredictbotutterancesandAPIcalls(indarkred).Task1teststhecapacityofinterpretingarequestandaskingtherightquestionstoissueanAPIcall.
• Task2checkstheabilitytomodifyanAPIcall.
• Task3and4testthecapacityofusingoutputsfromanAPIcall(inlightred)toproposeoptions(sortedbyrating)andtoprovideextra-information.
• Task5combineseverything.
End-to-EndGoal-OrientedDialog- results
Dataextractedfromarealonlineconciergeserviceperformingrestaurantbooking
Synthetic(generated)dataset
VisualDialog(A.Daset.al.2016)ComputerVisionandArtificialIntelligenceTrends:• Imageclassification• Scenerecognition• Objectdetection• Learningtoplayvideogames• ImageandvideoQA
What’sNext?• VisualDialog:Abilitytoholdameaningfuldialogwithhumansinnaturallanguageaboutvisualcontent
VisualDialog– PotentialApplications• AidingvisuallyimpairedusersinunderstandingtheirsurroundingsorsocialmediacontentAI:‘JohnjustuploadedapicturefromhisvacationinHawaii’,Human:‘Great,isheatthebeach?’,AI:‘No,onamountain’
• AidinganalystsinmakingdecisionsbasedonlargequantitiesofsurveillancedataHuman:‘Didanyoneenterthisroomlastweek?’,AI:‘Yes,27instancesloggedoncamera’,Human:‘Wereanyofthemcarryingablackbag?’
• InteractingwithanAIassistantHuman:‘Alexa– canyouseethebabyinthebabymonitor?’,AI:‘Yes,Ican’,Human:‘Ishesleepingorplaying?’
• Roboticsapplications(e.g.searchandrescuemission)Human:‘Istheresmokeinanyroomaroundyou?’,AI:‘Yes,inoneroom’,Human:‘Gothereandlookforpeople’
VisualDialogvs.DialogSystem• VisualDialogTask (visualanalogueoftheTuringTest): givenanimageI,ahistoryofadialogconsistingofasequenceofquestion-answerpairs,andanaturallanguagefollow-upquestion,thetaskforthemachineistoanswerthequestioninfree-formnaturallanguage.
• VisualDialogismorespecificthanageneralchatbot becausethedialogisaboutaspecificimage.
• VisualDialogisnot gearedtowardaspecificgoal(similartogoal-drivendialogsystems).Thereforeslot-fillingmethodswon’twork.
VisualDialogDataset– DataCollection• Gooddataforthistaskshouldincludedialogsthathave:
• Temporalcontinuity• Groundingintheimage• Mimicnatural‘conversational’exchanges
• CollectedvisualdialogdataonimagesfromtheCommonObjectsinContext(COCO)dataset,whichcontainsmultipleobjectsineverydayscenes.
• Freeform,open-ended naturallanguagequestionscollectedviatwoworkers chattingonAmazonMechanicalTurk(AMT)real-time
• The‘questioner’seesonlyasinglelineoftextdescribinganimage(captionfromCOCO);theimageremainshiddentothequestioner.
• Theirtaskistoaskquestionsaboutthishiddenimagesoasto‘imaginethescenebetter’
• The‘answerer’seestheimageandthecaption.Theirtaskistoanswerthequestionsaskedbytheirchatpartner.
VisualDialogDataset– Analysis• One dialog(10 question-answerpairs)on68k imagesfromCOCO(58k trainand10k val ),oratotalof680,000 QApairs
• Morenatural conversationcomparingtootherimageQAdatasetsbecausethequestionerdoesn’tseetheimage(novisualprimingbias)
• Highermean-lengthofanswers(3.1words)andlessbinaryanswers(e.g.‘Yes’,‘No’)
• Coreference indialog: 38%ofquestions,22%ofanswers,andnearlyall(99%)dialogscontainatleastonepronoun
• TemporalContinuityinDialogTopics: basedonhumanevaluationonsmaples,across10rounds,VisDial questionhave4:55+- 0:17topicsonaverage,confirmingthatthesearenotindependentquestions
VisualDialog- EvaluationProtocol• Evaluateindividualresponsesindependentlyateachround(t=1,2,…,10)inaretrievalormultiple-choicesetup
INPUT
OUTPUT
• Themodelisevaluatedonretrievalmetrics:
• rankofhumanresponse• recall@k ,i.e .existenceofthehumanresponseintop-krankedresponses
• meanreciprocalrank(MRR)ofthehumanresponse
• CandidateAnswers:ground-truth,answersto50similarquestions,30mostpopularanswers,19randomanswers
NeuralVisualDialogModels• Experimentedwiththeencoder-decodercombinations
• Encoders:convertinputs(I,H,Qt)intoajointrepresentation• Inallcases,werepresentIviathel2-normalizedactivationsfromthepenultimatelayerofVGG-16
• ForeachencoderE,weexperimentwithallpossibleablatedversions:E(Qt),E(Qt,I),E(Qt,H),E(Qt,I,H)
• Decoders:rank candidateanswers based on the jointrepresentationfromencoders
• Generative (LSTM)and Discriminative (Softmax)
VisualDialog- Encoders• LateFusion(LF)Encoder:
• TreatHasalongstringwiththeentirehistory(H0,…,Ht-1)concatenated.• Qt andHareseparatelyencodedwith2differentLSTMs• individualrepresentationsofparticipatinginputs(I,H,Qt)areconcatenatedandlinearlytransformedtoadesiredsizeofjointrepresentation.
• HierarchicalRecurrentEncoder(HRE):• Similararchitectureas‘HierarchicalLatentVariableEncoder-DecoderModel’
• MemoryNetwork(MN)Encoder:• EncodeQt withanLSTMtogeta512-dvector• encodeeachpreviousroundofhistory(H0,…,Ht-1)withanotherLSTMtogetatx512matrix.
• Computeinnerproductofquestionvectorwitheachhistoryvectortogetscoresoverpreviousrounds,whicharefedtoasoftmax togetattention-over-historyprobabilities.
VisualDialog– HierarchicalRecurrentEncoder
EachQA-pairindialoghistoryisindependentlyencodedbyanotherLSTMwithsharedweights
Theimage-questionrepresentation,computedforeveryroundfrom1throught,isconcatenatedwithhistoryrepresentationfromthepreviousroundandconstitutesasequenceofquestion-historyvectors
VisualDialog– MemoryNetwork Afully-connectedlayerandtanhnonlinearity
computeattentionoverthetfactsbyinnerproduct
VisualDialog- Decoders• Generative(LSTM):
• EncodedvectorissetastheinitialstateoftheLSTMlanguagemodel• Maximizesthelog-likelihoodofthegroundtruthanswersequencegivenitscorrespondingencodedrepresentation(trainedend-to-end)
• Usesthemodel’sloglikelihood scoresandrankcandidateanswers
• Discriminative(softmax):• ComputesdotproductsimilaritybetweentheinputencodingandanLSTMencodingofeachoftheansweroptions
• Thedotproductsarefedintoasoftmax tocomputetheposteriorprobabilityovertheoptions
• Maximizesthelogliklihoodofthecorrectoptionsandoptionsaresimplyrankedbasedontheirposteriorprobabilities.
VisualDialog– GenerativeDecoder
LSTM
EncodedVector
W1
LSTM
W2
W1
LSTM
Wn
Wn-1
• EncodedvectorissetastheinitialstateoftheLSTMlanguagemodel• Maximizesthelog-likelihoodofthegroundtruthanswersequencegivenitscorrespondingencodedrepresentation(trainedend-to-end)
• Usesthemodel’sloglikelihood scoresandrankcandidateanswers
VisualDialog– DiscriminativeDecoder
• ComputesdotproductsimilaritybetweentheinputencodingandanLSTMencodingofeachoftheansweroptions
• Thedotproductsarefedintoasoftmax tocomputetheposteriorprobabilityovertheoptions• Maximizesthelogliklihood ofthecorrectoptionsandoptionsaresimplyrankedbasedontheir
posteriorprobabilities
LSTM
EncodedVector
A1
LSTM
EncodedVector
A100
Softmax
PosteriorProbabilities
VisualDialog- Conclusions• Demonstratedthefirstvisualchatbot.• Theresultsandanalysisindicatesthatthereissignificantscopeforimprovement,theauthorsbelievethistaskcanserveasatestbed formeasuringprogresstowardsvisualintelligence.
PotentialImprovements:• Usingamodeltogenerateresponsesratherthanrankingcandidateanswers
• Includelanguagefeatures(e.g.part-of-speech)astheinputs• Extendittovideos