Open-ended Visual Question-Answering

52
Open-ended Visual Question-Answering [thesis ][web ][code ] Issey Masuda Mora Santiago Pascual de la Puente Xavier Giró i Nieto

Transcript of Open-ended Visual Question-Answering

Page 1: Open-ended  Visual Question-Answering

Open-ended Visual Question-Answering

[thesis][web][code]

Issey Masuda Mora Santiago Pascual de la PuenteXavier Giró i Nieto

Page 2: Open-ended  Visual Question-Answering

Roadmap

Introduction Related Work

Methodology Results Conclusions Future work

2

Page 3: Open-ended  Visual Question-Answering

Introduction Related Work

Methodology Results Conclusions Future Work

Introduction

3

Page 4: Open-ended  Visual Question-Answering

Visual Question-Answering

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2425-2433). 4

Page 5: Open-ended  Visual Question-Answering

Predict the answer of a given question related to an image

5

Page 6: Open-ended  Visual Question-Answering

Visual Question-Answering: Types

6

Real images Abstract scenes

Multi-Choice

Open-ended

Q: Does it appear to be rainy?

A: no

Q: What is just under the tree?

A: a ball

Q: How many slices of pizza are there?

A: 1, 2, 3, 4

Q: What is for desert?

A: cake, ice cream, cheesecake, pie

Page 7: Open-ended  Visual Question-Answering

Example

7

Question: What is bobbing in the water other than the boats?Answer: buoys

Page 8: Open-ended  Visual Question-Answering

Motivation

8

New visual Turing test

Page 9: Open-ended  Visual Question-Answering

Motivation: AI research

● Multidisciplinary tasks● Models able to perform more

complex activities● Different sub-problems tackled at

once

9

Computer Vision

KnowledgeRepresentation and Reasoning

Natural Language Processing

Page 10: Open-ended  Visual Question-Answering

Introduction Related Work

Methodology Results Conclusions Future Work

Related Work

10

Page 11: Open-ended  Visual Question-Answering

Deep Learning

11Credit: Google

Page 12: Open-ended  Visual Question-Answering

VQA: Common approach

12

Visual representation

Textual representation

Predict answerMerge

Question

What object is flying?

AnswerKite

CNN

Word/sentence embedding + LSTM

Page 13: Open-ended  Visual Question-Answering

Tools: Convolutional Neural Networks (CNN)

13

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).

AlexNet

Page 16: Open-ended  Visual Question-Answering

Introduction Related Work

Methodology Results Conclusions Future Work

Methodology

16

Page 17: Open-ended  Visual Question-Answering

First steps: Text-based QA

17

Page 18: Open-ended  Visual Question-Answering

Extending text-based QA for VQA

18Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

Page 19: Open-ended  Visual Question-Answering

Substitute VGG-16 with KCNN

19Liu, Z. (2015). Kernelized Deep Convolutional Neural Network for Describing Complex Images. arXiv preprint arXiv:1509.04581.

Page 20: Open-ended  Visual Question-Answering

Sentence embedding and image projection

20

Image

Question

Answer

Page 21: Open-ended  Visual Question-Answering

Introduction Related Work

Methodology Results Conclusions Future Work

Results

21

Page 22: Open-ended  Visual Question-Answering

VQA Dataset: Real Images, Open-ended questions

22

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). Vqa: Visual question answering. CVPR 2015.

1 (image) x 3 (questions) x 10 (answers)

Page 23: Open-ended  Visual Question-Answering

Evaluation

23

Metric: Script:

● Characters to lowercase● Remove periods (unless decimal

periods)● Number words to digits● Remove articles● Add apostrophe to contractions● Replace punctuation with space

Page 24: Open-ended  Visual Question-Answering

VQA Challenge

24

Page 25: Open-ended  Visual Question-Answering

53.62%CVPR2016 VQA Challenge

Real Images Open-ended, test-standard dataset partition

25

Page 26: Open-ended  Visual Question-Answering

Results in detail

26

VALIDATION SET TEST SET

Model Yes/No Number Other Overall Yes/No Number Other Overall

Model 1 71.82 23.79 27.99 43.87 71.62 28.76 29.32 46.70

Model 3 75.02 28.60 29.30 46.32 - - - -

Model 2 75.62 31.81 28.11 46.36 - - - -

Model 5 78.15 32.79 33.91 50.32 78.15 36.20 35.26 53.03

Model 4 78.73 32.82 35.5 51.34 78.02 35.68 36.54 53.62

Page 27: Open-ended  Visual Question-Answering

Results in context

27

100%0%

Humans

83.30%

UC Berkeley & Sony

66.47%

Baseline LSTM&CNN

54.06%

Baseline Nearest neighbor

42.85%

Baseline Prior per question type

37.47%

Baseline All yes

29.88%

Ours

53.62%

Page 28: Open-ended  Visual Question-Answering

Comparison with the baseline

Our model

● Single word answer● Generate answers

28

Baseline

● Multi word answers (hardcoded)● Classify over the 1000 most common

answers

Page 29: Open-ended  Visual Question-Answering

Qualitative results: I

29

Page 30: Open-ended  Visual Question-Answering

Qualitative results: II

30

Page 31: Open-ended  Visual Question-Answering

Deep Python Project

31https://github.com/imatge-upc/vqa-2016-cvprw

Page 32: Open-ended  Visual Question-Answering

Research contribution: Extended abstract

32VQA workshop, CVPR 2016

Page 33: Open-ended  Visual Question-Answering

Research controbution: Extended abstract - Poster

33

Page 34: Open-ended  Visual Question-Answering

… ticket to Las Vegas 34

Page 35: Open-ended  Visual Question-Answering

35Presenting our poster and extended abstract at CVPR 2016, Las Vegas, USA

Page 36: Open-ended  Visual Question-Answering

VQA Challenge statistics: Answering method

36

Page 37: Open-ended  Visual Question-Answering

Introduction Related Work

Methodology Results Conclusions Future Work

Conclusions

37

Page 38: Open-ended  Visual Question-Answering

Conclusion

38

✓ Present to VQA Challenge, CVPR 2016

Goals accomplished

✓ First GPI project using text processing techniques

✓ Create a scalable VQA model✓ Build a modular and reusable

software package

✓ Extended abstract accepted to VQA workshop CVPR 2016

Page 39: Open-ended  Visual Question-Answering

ConclusionPersonal overview

● Submission to VQA Challenge● VQA, hot topic at CVPR 2016● Model designed to generate

answers instead of classifying them

● Question-Answer pair generation proposal

39

Page 40: Open-ended  Visual Question-Answering

Introduction Related Work

Methodology Results Conclusions Future Work

Future Work

40

Page 41: Open-ended  Visual Question-Answering

Future work

41

● Decoder for multiple word answers

● Character embedding● Attention mechanisms● Question-Answer pairs

generationNext steps

Page 42: Open-ended  Visual Question-Answering

Automatic Question-Answer Pairs Generation

42

Page 43: Open-ended  Visual Question-Answering

Thank You!43

Do you have any question?

Page 44: Open-ended  Visual Question-Answering

Project resource links

● Thesis: https://imatge.upc.edu/web/sites/default/files/pub/xMasuda-Mora_0.pdf

● Web page: http://imatge-upc.github.io/vqa-2016-cvprw/● Source code: https://github.com/imatge-upc/vqa-2016-cvprw

44

Page 45: Open-ended  Visual Question-Answering

Motivation: First steps towards QA Generation

45

AI System

Question

What is the man doing?

AnswerSurf

Page 47: Open-ended  Visual Question-Answering

Experiments: Batch Normalization

47

Page 48: Open-ended  Visual Question-Answering

Losses I

48

Page 49: Open-ended  Visual Question-Answering

Losses II

49

Page 50: Open-ended  Visual Question-Answering

Losses III

50

Page 51: Open-ended  Visual Question-Answering

VQA Challenge statistics: Image modelling

51

Page 52: Open-ended  Visual Question-Answering

VQA Challenge statistics: Question modelling

52