Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are...

33
Relation Networks for Visual Modeling Han Hu Visual Computing Group Microsoft Research Asia (MSRA) Collaborators: Zheng Zhang, Yue Cao, Jiayuan Gu, Jiarui Xu, Jifeng Dai, Yichen Wei, Stephen Lin, Liwei Wang, Zhenda Xie, Fangyun Wei https://ancientmooner.github.io /

Transcript of Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are...

Page 1: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Relation Networks for Visual Modeling

Han Hu

Visual Computing Group

Microsoft Research Asia (MSRA)

Collaborators: Zheng Zhang, Yue Cao, Jiayuan Gu, Jiarui Xu, Jifeng Dai, Yichen Wei, Stephen Lin, Liwei Wang, Zhenda Xie, Fangyun Wei

https://ancientmooner.github.io/

Page 2: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Human Brain

• Human cortex can universally perceive different senses

figure credit to J. Sharma et al.

Page 3: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Intelligent Machines

• A universal learning pipeline

data prediction

Loss(prediction, label)back-propagation

Page 4: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Intelligent Machines

• Particular basic model for different task/data

convolutionLSTM, GRU, convolution, self-attention, …

graph networks

Page 5: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Universal Basic Models for Intelligent Machines?

Page 6: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Relation Networks: Towards Universal Basic Models

graph neural networks (self)-attention

query embed key embed

“player”

similarity set

value embed

feat set

weighted sumout feat for

“player”

left figure credit to P. Battaglia et al.

similar things: graph neural networks, self-attention, …

Page 7: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Relation Networks for Graph Data

T. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. ICLR 2018

Page 8: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Relation Networks for NLP

A. Vaswani and et al. Attention Is All You Need. NIPS 2017

Page 9: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Relation Networks for Visual Modeling

pixel-pixel object-pixel object-object

ConvolutionVariants

RoIAlign

Relation Networks

None

Relation Networks

Relation Networks

our study timeline

Page 10: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Object-Object Relation Modeling

Han Hu*, Jiayuan Gu*, Zheng Zhang*, Jifeng Dai and Yichen Wei. Relation Networks for Object Detection. CVPR 2018

Page 11: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Object-Object Relation Modeling

It is much easier to detect the glove if we know there is a baseball player.

Han Hu*, Jiayuan Gu*, Zheng Zhang*, Jifeng Dai and Yichen Wei. Relation Networks for Object Detection. CVPR 2018

Page 12: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Object Relation Module

relation relation relation

concat

input

relationoutput

Plug-and-Play

✓Parallel, learnable, no additional supervision, translational invariant, stackable

(d-dim)

(d-dim)

Key Feature✓Relative Geometric Term

✓Multiple Relation Branches

✓Shortcut

Page 13: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

OR

M

The First Fully End-to-End Object Detector

Learnableduplicate removal

FC

FC

CO

NV

sJoint instance

recognition

OR

M

OR

M

× 𝑟1 × 𝑟2

back propagation steps

S. Ren et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015

Page 14: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Results on COCO Object Detection

*Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported)

+3.0 mAP

+2.0 mAP

+1.0 mAP

• less than 10% computation overhead on all backbones

Page 15: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Object Pairs with High Relation Weights

reference object other objects contributing high weights

instance recognition duplicate removal

Page 16: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Class Co-Occurrence Information is Learnt

Class Co-occurrence Probability Learnt Attentional Weights

𝑟 = 0.90

Page 17: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Extension: Spatial-Temporal Object Relation

Jiarui Xu, Yue Cao, Zheng Zhang and Han Hu. Spatial-Temporal Relation Networks for Multi-Object Tracking. Tech Report 2018

Page 18: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Learnable Object-Pixel Relation (vs. RoIAlign)

Geometric

Appearance

Image Feature to Region Feature

Jiayuan Gu, Han Hu, Liwei Wang, Yichen Wei and Jifeng Dai. Learning Region Features for Object Detection. ECCV 2018

Page 19: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Pixel-Pixel Relation Modeling

convolution ConvNets

Page 20: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Question I: Can We Go Beyond Convolution?

Can we model the patterns by one channel?

convolution = template matching

Han Hu, Zheng Zhang, Zhenda Xie and Stephen Lin. Local Relation Networks for Visual Recognition. Tech Report 2019

template -> compose

Page 21: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Related Works: Capsule Networks

Increase Increasedecrease decrease

• Not aligned well with modern learning infrastructure

Figure credit by Aurélien Géron S. Sabour et al. Dynamic Routing Between Capsules. NIPS2017

Page 22: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Related Works: Non-Local Neural Networks

X. Wang et al. Non-local Neural Networks. CVPR2018

• Complementary to ConvNets

Page 23: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Beyond Convolution: Local Relation Layer

geometric composability (prior)

deeper

= relation network + locality + geometric prior + scalar key/query

appearance composability

deeper

Page 24: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Local Relation Network (LR-Net)

Totally convolution free!

Page 25: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Classification on ImageNet (26 Layers)

Page 26: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Robust to Adversarial Attacks

Page 27: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Question II: Do Non-local Networks Work Well Due to Relation Learning?

Yue Cao*, Jiarui Xu*, Stephen Lin, Fangyun Wei and Han Hu. GCNet: Non-local Networks meet SE-Net and Beyond. Tech Report 2019

attention maps for different query pixels

Page 28: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Explicit Query-Independent Attention Map

• Simplified Non-Local Blocks

The same accurate but significantly reducing computation!

Page 29: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Meet SE-Net (2017 ImageNet Champion)

Page 30: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Abstraction and New Instantiation

new instantiation

1. Global Attention Pooling (Simplified NL-Net)

2. Bottleneck transform (SE-Net)

3. Addition (Simplified NL-Net)

Page 31: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

COCO Object Detection Results

• Baseline: Mask R-CNN + ResNet50 + FPN

method AP (bbox) AP (mask) #param FLOPs

baseline 37.2 33.8 44.4M 279.4G

NL-Net 38.0 34.7 46.5M 288.7G

SE-Net 38.2 34.7 46.9M 279.5G

GC-Net 39.4 35.7 46.9M 279.6G

Page 32: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Discussion: versus Deformable ConvNets

• Both can model content aware adaptiveness

• Verification vs. Regression

• Generality (arbitrary vs. grid)

• Partly complementary

[1] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu and Yichen Wei. Deformable Convolutional Networks. In ICCV 2017.[2] Xizhou Zhu, Han Hu, Stephen Lin and Jifeng Dai. Deformable ConvNets v2: More Deformable, Better Results. In CVPR 2019.[3] Ze Yang, Shaohui Liu, Han Hu, Liwei Wang and Stephen Lin. RepPoints: Point Set Representation for Object Detection. Tech Report.

relation networks deformable conv

Page 33: Relation Modeling for Computer Vision - GitHub Pages · *Faster R-CNN with ResNet-101 model are used (evaluation on minival/test-dev are reported) +3.0 mAP +2.0 mAP +1.0 mAP •less

Thanks!pixel-pixel object-pixel object-object

ConvolutionVariants

RoIAlign

Relation Networks

None

Relation Networks

Relation Networks

Relation Network is All You Need for AI——SkyNet