Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding...

24
Universität Hamburg MIN-Fakultät Fachbereich Informatik TAMS-Folien Natural Language Visual Grounding with Keyword-Aware Attention Network Jinpeng Mi Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Fachbereich Informatik Technische Aspekte Multimodaler Systeme 8. Januar 2019 Jinpeng Mi 1

Transcript of Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding...

Page 1: Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding with Keyword-Aware Attention Network JinpengMi Universität Hamburg Fakultät für

Universität Hamburg

MIN-FakultätFachbereich Informatik

TAMS-Folien

Natural Language Visual Grounding withKeyword-Aware Attention Network

Jinpeng Mi

Universität HamburgFakultät für Mathematik, Informatik und NaturwissenschaftenFachbereich Informatik

Technische Aspekte Multimodaler Systeme

8. Januar 2019

Jinpeng Mi 1

Page 2: Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding with Keyword-Aware Attention Network JinpengMi Universität Hamburg Fakultät für

Universität Hamburg

MIN-FakultätFachbereich Informatik

TAMS-Folien

Gliederung

1. IntroductionNatural Language Visual GroundingAttention Mechanism

2. ArchitectureTextual Attention NetworkKeyword-aware Attention Network

3. ToDo

Jinpeng Mi 2

Page 3: Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding with Keyword-Aware Attention Network JinpengMi Universität Hamburg Fakultät für

Universität Hamburg

MIN-FakultätFachbereich Informatik

Introduction - Natural Language Visual Grounding TAMS-Folien

Natural Language Visual Grounding

I task: given a referring expression, localize the referred object orarea in an image

referring expression: a glass of water on the table

Jinpeng Mi 3

Page 4: Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding with Keyword-Aware Attention Network JinpengMi Universität Hamburg Fakultät für

Universität Hamburg

MIN-FakultätFachbereich Informatik

Introduction - Natural Language Visual Grounding TAMS-Folien

I applications: visual understanding systems, dialogue systems,natural language based interaction with intelligence agents, e.g.,robots

I main difficulties:

- how to learn the correlation between natural language referringexpression and visual domain (image region)

- how to locate the target object (the spatial relationship betweenobjects)

Jinpeng Mi 4

Page 5: Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding with Keyword-Aware Attention Network JinpengMi Universität Hamburg Fakultät für

Universität Hamburg

MIN-FakultätFachbereich Informatik

Introduction - Natural Language Visual Grounding TAMS-Folien

visual grounding is re-formulated three sub-problems:I which words to focus on in a referring expression

I where to look in an image

I which object to locate

Jinpeng Mi 5

Page 6: Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding with Keyword-Aware Attention Network JinpengMi Universität Hamburg Fakultät für

Universität Hamburg

MIN-FakultätFachbereich Informatik

Introduction - Natural Language Visual Grounding TAMS-Folien

public datasets

I RefCOCO: 19994 images, 142210 expressions

Jinpeng Mi 6

Page 7: Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding with Keyword-Aware Attention Network JinpengMi Universität Hamburg Fakultät für

Universität Hamburg

MIN-FakultätFachbereich Informatik

Introduction - Natural Language Visual Grounding TAMS-Folien

I RefCOCO+: 19992 images, 141564 expressions

Jinpeng Mi 7

Page 8: Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding with Keyword-Aware Attention Network JinpengMi Universität Hamburg Fakultät für

Universität Hamburg

MIN-FakultätFachbereich Informatik

Introduction - Natural Language Visual Grounding TAMS-Folien

I RefCOCOg: 25799 images, 95010 expressions (no test set)

Jinpeng Mi 8

Page 9: Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding with Keyword-Aware Attention Network JinpengMi Universität Hamburg Fakultät für

Universität Hamburg

MIN-FakultätFachbereich Informatik

Introduction - Attention Mechanism TAMS-Folien

Attention Mechanism

I inspired by how the human visual cortex employs visualattention mechanism to focus on informative regions in visualscenes

I first proposed in machine translation[1], image captioning[2]I type: hard attention and soft attention

[1]Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning toalign and translate, ICLR 2014.[2]Xu, K., Ba, J., Kiros, ..., Bengio, Y. Show, attend and tell: Neural image captiongeneration with visual attention, ICML 2015.

Jinpeng Mi 9

Page 10: Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding with Keyword-Aware Attention Network JinpengMi Universität Hamburg Fakultät für

Universität Hamburg

MIN-FakultätFachbereich Informatik

Introduction - Attention Mechanism TAMS-Folien

Architecture

I which words to focus onI where to look in an imageI which object to locate

Jinpeng Mi 10

Page 11: Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding with Keyword-Aware Attention Network JinpengMi Universität Hamburg Fakultät für

Universität Hamburg

MIN-FakultätFachbereich Informatik

Architecture - Textual Attention Network TAMS-Folien

which words to focus on

I Syntactic Parsing

Jinpeng Mi 11

Page 12: Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding with Keyword-Aware Attention Network JinpengMi Universität Hamburg Fakultät für

Universität Hamburg

MIN-FakultätFachbereich Informatik

Architecture - Textual Attention Network TAMS-Folien

which words to focus on

I referring expression filteringfiler insignificant words: determiner, coordinating conjunction, "to",interjection, modal words, linking verb, etc.I examplesraw: young man with blond hair wearing a white shirt and dark tiein a ballroomfiltered: young man with blond hair wearing white shirt dark tie inballroom

raw:a person standing behind a snowboarder with a blue jacket andblack pantsfiltered:person standing behind snowboarder with blue jacket blackpants

Jinpeng Mi 12

Page 13: Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding with Keyword-Aware Attention Network JinpengMi Universität Hamburg Fakultät für

Universität Hamburg

MIN-FakultätFachbereich Informatik

Architecture - Textual Attention Network TAMS-Folien

which words to focus on

I acquire different weights for different words

glass of water on table

Embedding(GloVE)

BiLSTM

Textual Attention Network

module weights

Jinpeng Mi 13

Page 14: Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding with Keyword-Aware Attention Network JinpengMi Universität Hamburg Fakultät für

Universität Hamburg

MIN-FakultätFachbereich Informatik

Architecture - Textual Attention Network TAMS-Folien

which words to focus on

I deep representation of a referring expression

et = embedding(wt), t ∈ [1,T ] (1)−→ht = BiLSTM(et ,

−−→ht−1) (2)

←−ht = BiLSTM(et ,

←−−ht−1) (3)

ht = [−→ht ,←−ht ] (4)

where T is the length of a filtered referring expression.

Jinpeng Mi 14

Page 15: Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding with Keyword-Aware Attention Network JinpengMi Universität Hamburg Fakultät für

Universität Hamburg

MIN-FakultätFachbereich Informatik

Architecture - Textual Attention Network TAMS-Folien

which words to focus on

I Textual Attention Network

ut = tanh(Wwht + bw ) (5)

αt =exp(utTβw )∑Tt exp(utTβw )

(6)

rt = FC (αt � ht) (7)

where Ww , bw and βw are trainable vectors, rt is calculatedweights, � denotes element-wise production.* Yang Z, Yang D, Dyer C, et al. Hierarchical attention networks for documentclassification. Proceedings of NAACL-HLT 2016.

Jinpeng Mi 15

Page 16: Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding with Keyword-Aware Attention Network JinpengMi Universität Hamburg Fakultät für

Universität Hamburg

MIN-FakultätFachbereich Informatik

Architecture - Textual Attention Network TAMS-Folien

which words to focus on

I result

Jinpeng Mi 16

Page 17: Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding with Keyword-Aware Attention Network JinpengMi Universität Hamburg Fakultät für

Universität Hamburg

MIN-FakultätFachbereich Informatik

Architecture - Textual Attention Network TAMS-Folien

which words to focus on

Jinpeng Mi 17

Page 18: Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding with Keyword-Aware Attention Network JinpengMi Universität Hamburg Fakultät für

Universität Hamburg

MIN-FakultätFachbereich Informatik

Architecture - Keyword-aware Attention Network TAMS-Folien

where to look in an image

I Keyword-aware Visual Attention Network

RoIs

Local Visual Feature

Pretained CNN

RetinaNet

Referring Expression

Representation

L2 Normalization

Keyword-aware Visual Attention

Network

VGG19

Conv5_4(7, 7, 512)

Jinpeng Mi 18

Page 19: Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding with Keyword-Aware Attention Network JinpengMi Universität Hamburg Fakultät für

Universität Hamburg

MIN-FakultätFachbereich Informatik

Architecture - Keyword-aware Attention Network TAMS-Folien

where to look in an image

v′= Conv(v) (8)

sr = f (Wsrt + bs) (9)

Matten = softmax(sr � v′) (10)

where v′ denotes projected feature map, f is non-linear function,Ws and bs are trainable vectors, Matten is generated attention map,� denotes element-wise production.

Jinpeng Mi 19

Page 20: Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding with Keyword-Aware Attention Network JinpengMi Universität Hamburg Fakultät für

Universität Hamburg

MIN-FakultätFachbereich Informatik

Architecture - Keyword-aware Attention Network TAMS-Folien

where to look in an image

Jinpeng Mi 20

Page 21: Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding with Keyword-Aware Attention Network JinpengMi Universität Hamburg Fakultät für

Universität Hamburg

MIN-FakultätFachbereich Informatik

Architecture - Keyword-aware Attention Network TAMS-Folien

where to look in an image

Jinpeng Mi 21

Page 22: Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding with Keyword-Aware Attention Network JinpengMi Universität Hamburg Fakultät für

Universität Hamburg

MIN-FakultätFachbereich Informatik

Architecture - Keyword-aware Attention Network TAMS-Folien

which object to locate

Local Spatial Feature

Pretained CNN Tile

Concatenate

Fully Connected

Referring Expression

Representation

Element-wise Product

MLP

VGG19

FC7(1, 4096)

(1, 1024)

Jinpeng Mi 22

Page 23: Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding with Keyword-Aware Attention Network JinpengMi Universität Hamburg Fakultät für

Universität Hamburg

MIN-FakultätFachbereich Informatik

ToDo TAMS-Folien

ToDo

I debug and trainI adjust parametersI improve architectureI grasping experiments on PR2

Jinpeng Mi 23

Page 24: Natural Language Visual Grounding with Keyword-Aware ...€¦ · Natural Language Visual Grounding with Keyword-Aware Attention Network JinpengMi Universität Hamburg Fakultät für

Universität Hamburg

MIN-FakultätFachbereich Informatik

ToDo TAMS-Folien

Thank you for your attention!

Jinpeng Mi 24