Open-ended Visual Question-Answering
-
Upload
xavier-giro -
Category
Technology
-
view
821 -
download
1
Transcript of Open-ended Visual Question-Answering
Open-ended Visual Question-Answering
[thesis][web][code]
Issey Masuda Mora Santiago Pascual de la PuenteXavier Giró i Nieto
Roadmap
Introduction Related Work
Methodology Results Conclusions Future work
2
Introduction Related Work
Methodology Results Conclusions Future Work
Introduction
3
Visual Question-Answering
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2425-2433). 4
Predict the answer of a given question related to an image
5
Visual Question-Answering: Types
6
Real images Abstract scenes
Multi-Choice
Open-ended
Q: Does it appear to be rainy?
A: no
Q: What is just under the tree?
A: a ball
Q: How many slices of pizza are there?
A: 1, 2, 3, 4
Q: What is for desert?
A: cake, ice cream, cheesecake, pie
Example
7
Question: What is bobbing in the water other than the boats?Answer: buoys
Motivation
8
New visual Turing test
Motivation: AI research
● Multidisciplinary tasks● Models able to perform more
complex activities● Different sub-problems tackled at
once
9
Computer Vision
KnowledgeRepresentation and Reasoning
Natural Language Processing
Introduction Related Work
Methodology Results Conclusions Future Work
Related Work
10
Deep Learning
11Credit: Google
VQA: Common approach
12
Visual representation
Textual representation
Predict answerMerge
Question
What object is flying?
AnswerKite
CNN
Word/sentence embedding + LSTM
Tools: Convolutional Neural Networks (CNN)
13
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).
AlexNet
Tools: Word and Sentence embeddings
14
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. InAdvances in neural information processing systems (pp. 3111-3119).
Experiments from: Socher et. al. (2013b) and Collbert et. al. (2011)
King Man- Woman+ Queen=
Tools: Long Short-Term Memory networks (LSTM)
15Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
Introduction Related Work
Methodology Results Conclusions Future Work
Methodology
16
First steps: Text-based QA
17
Extending text-based QA for VQA
18Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Substitute VGG-16 with KCNN
19Liu, Z. (2015). Kernelized Deep Convolutional Neural Network for Describing Complex Images. arXiv preprint arXiv:1509.04581.
Sentence embedding and image projection
20
Image
Question
Answer
Introduction Related Work
Methodology Results Conclusions Future Work
Results
21
VQA Dataset: Real Images, Open-ended questions
22
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). Vqa: Visual question answering. CVPR 2015.
1 (image) x 3 (questions) x 10 (answers)
Evaluation
23
Metric: Script:
● Characters to lowercase● Remove periods (unless decimal
periods)● Number words to digits● Remove articles● Add apostrophe to contractions● Replace punctuation with space
VQA Challenge
24
53.62%CVPR2016 VQA Challenge
Real Images Open-ended, test-standard dataset partition
25
Results in detail
26
VALIDATION SET TEST SET
Model Yes/No Number Other Overall Yes/No Number Other Overall
Model 1 71.82 23.79 27.99 43.87 71.62 28.76 29.32 46.70
Model 3 75.02 28.60 29.30 46.32 - - - -
Model 2 75.62 31.81 28.11 46.36 - - - -
Model 5 78.15 32.79 33.91 50.32 78.15 36.20 35.26 53.03
Model 4 78.73 32.82 35.5 51.34 78.02 35.68 36.54 53.62
Results in context
27
100%0%
Humans
83.30%
UC Berkeley & Sony
66.47%
Baseline LSTM&CNN
54.06%
Baseline Nearest neighbor
42.85%
Baseline Prior per question type
37.47%
Baseline All yes
29.88%
Ours
53.62%
Comparison with the baseline
Our model
● Single word answer● Generate answers
28
Baseline
● Multi word answers (hardcoded)● Classify over the 1000 most common
answers
Qualitative results: I
29
Qualitative results: II
30
Deep Python Project
31https://github.com/imatge-upc/vqa-2016-cvprw
Research contribution: Extended abstract
32VQA workshop, CVPR 2016
Research controbution: Extended abstract - Poster
33
… ticket to Las Vegas 34
35Presenting our poster and extended abstract at CVPR 2016, Las Vegas, USA
VQA Challenge statistics: Answering method
36
Introduction Related Work
Methodology Results Conclusions Future Work
Conclusions
37
Conclusion
38
✓ Present to VQA Challenge, CVPR 2016
Goals accomplished
✓ First GPI project using text processing techniques
✓ Create a scalable VQA model✓ Build a modular and reusable
software package
✓ Extended abstract accepted to VQA workshop CVPR 2016
ConclusionPersonal overview
● Submission to VQA Challenge● VQA, hot topic at CVPR 2016● Model designed to generate
answers instead of classifying them
● Question-Answer pair generation proposal
39
Introduction Related Work
Methodology Results Conclusions Future Work
Future Work
40
Future work
41
● Decoder for multiple word answers
● Character embedding● Attention mechanisms● Question-Answer pairs
generationNext steps
Automatic Question-Answer Pairs Generation
42
Thank You!43
Do you have any question?
Project resource links
● Thesis: https://imatge.upc.edu/web/sites/default/files/pub/xMasuda-Mora_0.pdf
● Web page: http://imatge-upc.github.io/vqa-2016-cvprw/● Source code: https://github.com/imatge-upc/vqa-2016-cvprw
44
Motivation: First steps towards QA Generation
45
AI System
Question
What is the man doing?
AnswerSurf
VQA: Counterexample
46
Dynamic Parameter Prediction Network (DPPnet)
Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with dynamic parameter prediction. CVPR 2016
Experiments: Batch Normalization
47
Losses I
48
Losses II
49
Losses III
50
VQA Challenge statistics: Image modelling
51
VQA Challenge statistics: Question modelling
52