Open-ended Visual Question-Answering

Open-ended Visual Question-Answering

[thesis][web][code]

Issey Masuda Mora Santiago Pascual de la PuenteXavier Giró i Nieto

https://imatge.upc.edu/web/sites/default/files/pub/xMasuda-Mora_0.pdf

http://imatge-upc.github.io/vqa-2016-cvprw/

https://github.com/imatge-upc/vqa-2016-cvprw

Roadmap

Introduction Related Work

Methodology Results Conclusions Future work

2


Methodology Results Conclusions Future Work

Introduction

3

Visual Question-Answering

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2425-2433). 4

Predict the answer of a given question related to an image

5

Visual Question-Answering: Types

6

Real images Abstract scenes

Multi-Choice

Open-ended

Q: Does it appear to be rainy?

A: no

Q: What is just under the tree?

A: a ball

Q: How many slices of pizza are there?

A: 1, 2, 3, 4

Q: What is for desert?

A: cake, ice cream, cheesecake, pie

Example

7

Question: What is bobbing in the water other than the boats?Answer: buoys

Motivation

8

New visual Turing test

Motivation: AI research

● Multidisciplinary tasks● Models able to perform more

complex activities● Different sub-problems tackled at

once

9

Computer Vision

KnowledgeRepresentation and Reasoning

Natural Language Processing



Related Work

10

Deep Learning

11Credit: Google

VQA: Common approach

12

Visual representation

Textual representation

Predict answerMerge

Question

What object is flying?

AnswerKite

CNN

Word/sentence embedding + LSTM

Tools: Convolutional Neural Networks (CNN)

13

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).

AlexNet

http://papers.nips.cc/paper/4824-imagenet-classification-w




Tools: Word and Sentence embeddings

14

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. InAdvances in neural information processing systems (pp. 3111-3119).

Experiments from: Socher et. al. (2013b) and Collbert et. al. (2011)

King Man- Woman+ Queen=

http://papers.nips.cc/paper/5021-distributed-representations






http://papers.nips.cc/paper/5027-zero-shot-learning-through-cross-modal-transfer



Tools: Long Short-Term Memory networks (LSTM)

15Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.

http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6795963&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6795963






Methodology

16

First steps: Text-based QA

17

Extending text-based QA for VQA

18Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

http://arxiv.org/abs/1409.1556





Substitute VGG-16 with KCNN

19Liu, Z. (2015). Kernelized Deep Convolutional Neural Network for Describing Complex Images. arXiv preprint arXiv:1509.04581.






Sentence embedding and image projection

20

Image

Question

Answer



Results

21

VQA Dataset: Real Images, Open-ended questions

22

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). Vqa: Visual question answering. CVPR 2015.

1 (image) x 3 (questions) x 10 (answers)

http://www.cv-foundation.org/openaccess/content_iccv_2015/html/Antol_VQA_Visual_Question_ICCV_2015_paper.html



Evaluation

23

Metric: Script:

● Characters to lowercase● Remove periods (unless decimal

periods)● Number words to digits● Remove articles● Add apostrophe to contractions● Replace punctuation with space

VQA Challenge

24

53.62%CVPR2016 VQA Challenge

Real Images Open-ended, test-standard dataset partition

25

Results in detail

26

VALIDATION SET TEST SET

Model Yes/No Number Other Overall Yes/No Number Other Overall

Model 1 71.82 23.79 27.99 43.87 71.62 28.76 29.32 46.70

Model 3 75.02 28.60 29.30 46.32 - - - -

Model 2 75.62 31.81 28.11 46.36 - - - -

Model 5 78.15 32.79 33.91 50.32 78.15 36.20 35.26 53.03

Model 4 78.73 32.82 35.5 51.34 78.02 35.68 36.54 53.62

Results in context

27

100%0%

Humans

83.30%

UC Berkeley & Sony

66.47%

Baseline LSTM&CNN

54.06%

Baseline Nearest neighbor

42.85%

Baseline Prior per question type

37.47%

Baseline All yes

29.88%

Ours

53.62%

Comparison with the baseline

Our model

● Single word answer● Generate answers

28

Baseline

● Multi word answers (hardcoded)● Classify over the 1000 most common

answers

Qualitative results: I

29

Qualitative results: II

30

Deep Python Project

31https://github.com/imatge-upc/vqa-2016-cvprw

Research contribution: Extended abstract

32VQA workshop, CVPR 2016

Research controbution: Extended abstract - Poster

33

… ticket to Las Vegas 34

35Presenting our poster and extended abstract at CVPR 2016, Las Vegas, USA

VQA Challenge statistics: Answering method

36



Conclusions

37

Conclusion

38

✓ Present to VQA Challenge, CVPR 2016

Goals accomplished

✓ First GPI project using text processing techniques

✓ Create a scalable VQA model✓ Build a modular and reusable

software package

✓ Extended abstract accepted to VQA workshop CVPR 2016

ConclusionPersonal overview

● Submission to VQA Challenge● VQA, hot topic at CVPR 2016● Model designed to generate

answers instead of classifying them

● Question-Answer pair generation proposal

39



Future Work

40

Future work

41

● Decoder for multiple word answers

● Character embedding● Attention mechanisms● Question-Answer pairs

generationNext steps

Automatic Question-Answer Pairs Generation

42

Thank You!43

Do you have any question?

Project resource links

● Thesis: https://imatge.upc.edu/web/sites/default/files/pub/xMasuda-Mora_0.pdf

● Web page: http://imatge-upc.github.io/vqa-2016-cvprw/● Source code: https://github.com/imatge-upc/vqa-2016-cvprw

44




http://imatge-upc.github.io/vqa-2016-cvprw/

https://github.com/imatge-upc/vqa-2016-cvprw

Motivation: First steps towards QA Generation

45

AI System

Question

What is the man doing?

AnswerSurf

VQA: Counterexample

46

Dynamic Parameter Prediction Network (DPPnet)

Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with dynamic parameter prediction. CVPR 2016

http://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Noh_Image_Question_Answering_CVPR_2016_paper.html



Experiments: Batch Normalization

47

Losses I

48

Losses II

49

Losses III

50

VQA Challenge statistics: Image modelling

51

VQA Challenge statistics: Question modelling

52

Open-ended Visual Question-Answering

Technology

Transcript of Open-ended Visual Question-Answering