High Level Computer Vision Deep Learning for Computer Vision … · 2017. 6. 14. · High Level...

High Level Computer Vision

Deep Learning for Computer Vision Part 4

Bernt Schiele - schiele@mpi-inf.mpg.de Mario Fritz - mfritz@mpi-inf.mpg.de

https://www.mpi-inf.mpg.de/hlcv

High Level Computer Vision - June 14, 2o17

Overview

• Recurrent Neural Networks ‣ motivation for recurrent neural networks

‣ a particularly successful RNN: Long Short Term Memory (LSTM)

‣ slide credit to Andrej Karpathy, Jeff Donahue and Marcus Rohrbach

• Yann LeCun… ‣ What’s Wrong With Deep Learning (keynote June 2015)

‣ slide credit to Yann LeCun (and Xiaogang Wang)

Recurrent Networks offer a lot of flexibility:

slidecredit:AndrejKarpathy

Sequences in VisionSequences in the input… (many-to-one)

JumpingDancingFightingEating

Running

slidecredit:JeffDonahue

Sequences in VisionSequences in the output…(one-to-many)

A happy brown dog.

Sequences in VisionSequences everywhere!(many-to-many)

A dog jumps over a hurdle.

ConvNets

Krizhevsky et al., NIPS 2012

Problem #1

fixed-size, static input

Problem #1

fixed-size, static input

Problem #2

Krizhevsky et al., NIPS 2012

output is a single choice from a fixed list of options

doghorsefishsnake

Problem #2slidecredit:JeffDonahue

output is a single choice from a fixed list of options

a happy brown doga big brown doga happy red doga big red dog

Problem #2

Recurrent Networks offer a lot of flexibility:

Language Models

Recurrent Neural Network Based Language Model [Tomas Mikolov, 2010]

Word-level language model. Similar to:

Suppose we had the training sentence “cat sat on mat”

We want to train a language model: P(next word | previous words)

Suppose we had the training sentence “cat sat on mat”

We want to train a language model: P(next word | previous words)

x0 <START>

x1 “cat”

“cat sat on mat”

x2 “sat”

x3 “on”

x4 “mat”

300 (learnable) numbers associated with each word in vocabulary

x0 <START>

x1 “cat”

x2 “sat”

x3 “on”

x4 “mat”

hidden layer (e.g. 500-D vectors) h4 = tanh(0, Wxh * x4)

x0 <START>

x1 “cat”

x2 “sat”

x3 “on”

x4 “mat”

10,001-D class scores: Softmax over 10,000 words and a special <END> token. y4 = Why * h4

hidden layer (e.g. 500-D vectors) h4 = tanh(0, Wxh * x4)

x0 <START>

x1 “cat”

x2 “sat”

x3 “on”

x4 “mat”

10,001-D class scores: Softmax over 10,000 words and a special <END> token. y4 = Why * h4

hidden layer (e.g. 500-D vectors) h4 = tanh(0, Wxh * x4 + Whh * h3)

Recurrent Neural Network:

Training this on a lot of sentences would give us a language model. A way to predict

P(next word | previous words)

x0 <START>

P(next word | previous words) h0

x0 <START>

x1 “cat”

sample!

x0 <START>

x1 “cat”

x0 <START>

x1 “cat”

x2 “sat”

sample!

x0 <START>

x1 “cat”

x2 “sat”

x0 <START>

x1 “cat”

x2 “sat”

x3 “on”

sample!

x0 <START>

x1 “cat”

x2 “sat”

x3 “on”

x0 <START>

x1 “cat”

x2 “sat”

x3 “on”

x4 “mat”

sample!

x0 <START>

x1 “cat”

x2 “sat”

x3 “on”

x4 “mat”

x0 <START>

x1 “cat”

x2 “sat”

x3 “on”

x4 “mat”

samples <END>? done.

“straw hat”

training example

“straw hat”

training example

“straw hat”

training example

“straw hat”

training example

x0 <START>

x1 “straw”

x2 “hat”

<START> straw hat

“straw hat”

training example

x0 <START>

x1 “straw”

x2 “hat”

<START> straw hat

before: h0 = tanh(0, Wxh * x0)

now: h0 = tanh(0, Wxh * x0 + Wih * v)

test image

x0 <START>

<START>

x0 <START>

<START>

test image

x0 <START>

<START>

test image

sample!

x0 <START>

<START>

test image

x0 <START>

<START>

test image

sample!

x0 <START>

<START>

test image

x0 <START>

<START>

test image

sample! <END> token => finish.

Sequence Learning• Instances of the form x = <x1, x2, x3, …, xT>

• Variable sequence length T

• Learn a transition function f with parameters W:

• f should update hidden state ht and output yt

h0 := 0

for t = 1, 2, 3, …, T:

<yt, ht> = fW(xt, ht-1)

Sequence LearningEquivalent to a T-layer deep network, unrolled in time

fhT-1…

Sequence Learning• What should the transition function f be?

• At a minimum, we want something non-linear and differentiable

Sequence Learning• First attempt — a “vanilla” RNN:

ht = σ(Whxxt + Whhht-1 + bh)

zt = σ(Whzht + bz)

• Problems

• Difficult to train — vanishing/exploding gradients

• Unable to “select” inputs, hidden state, outputs

SequenceLearning

• LSTM - Long Short Term Memory[Hochreiter & Schmidhuber, 1997]

• Selectively propagate or forget hidden state • Allows long-term dependencies to be learned

• Effective for • speech recognition • handwriting recognition • translation • parsing

slidecredit:MarcusRohrbach

MarcusRohrbach|LRCN–anArchitectureforVisualRecognition,Description,andQuestionAnswering|

LSTMforsequencemodeling

MarcusRohrbach|LRCN–anArchitectureforVisualRecognition,Description,andQuestionAnswering|

LSTMforsequencemodeling

Sequence Learning

LSTM (Hochreiter &

Schmidhuber, 1997)

Exactly remember previous

cell state — discard input

Sequence Learning

LSTM (Hochreiter &

Schmidhuber, 1997)

Forget previous

cell state — only use

modulated input Wxt-1

1Wxt-1

LRCN• Long-term Recurrent Convolutional Networks

• End-to-end trainable framework for sequence problems in vision

CNN LSTM

Image Description

LSTM LSTM LSTM LSTM LSTMsequential output

a<BOS> dog is jumping <EOS>

Image Description

LSTM LSTM LSTM LSTM LSTMsequential output

= embed a one-hot vector

Image Description

LSTM LSTM LSTM LSTM LSTMsingle LSTM layer

= embed a one-hot vector

Image Description

two LSTM layers

a dog is jumping

Image Description

two LSTM layers, factored

dog is jumping <EOS>a <BOS>

Image Description

ArchitectureFlickr30k [1]

Caption-to-Image Recall@1

Single Layer 14.1%

Two Layer 3.8%

Two Layer, Factored 17.5%

Four Layer, Factored 15.8%

[1] P. Hodosh, A. Young, M. Lai, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions.

Image Description

COCO [1] CIDEr-D c5

ScoresCaffeNet VGGNet

Raw 68.8% 77.3%

Finetuned 75.8% 83.9%

[1] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Com- mon objects in context. arXiv preprint arXiv:1405.0312, 2014. [2] K. Simonyan & A. Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition”. ICLR 2015.

Image Descriptionslidecredit:JeffDonahue

Activity Recognition

studying running jumping jumping

Average

jumping

sequential input

Activity Recognition

UCF101Class.Acc.

RGB Optical Flow

RGB+Flow

Single-Frame CNN

67.7% 72.2% 78.8%

LRCN 68.2% 77.5% 82.7%

Khurram Soomro, Amir Roshan Zamir and Mubarak Shah, UCF101: A Dataset of 101 Human Action Classes From Videos in The Wild., CRCV-TR-12-01, November, 2012.

Video Description

LSTM LSTM LSTM LSTMLSTM

sequential input & output

LSTM LSTMLSTMLSTM

<EOS>jumpingisdoga

Video Description

Coherent Multi-Sentence Video Description with Variable Level of Detail. A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich and M. Pinkal and B. Schiele. GCPR, 2014.

MPII TACoS Multi-Level Dataset

Video Description

LSTM LSTM LSTM LSTM

LSTM LSTMLSTMLSTM

Pre-trained Detector Predictions

<BOS> <EOS>vegetablescutsmana

Video Description

ApproachGeneration Accuracy(BLEU)

SMT 26.9%

LRCN 28.8%

Video Description

CNNCNNCNNCNN

Average

LSTM LSTM LSTM LSTM LSTM

a dog is jumping <EOS><BOS>

Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko. “Translating Videos to Natural Language Using Deep Recurrent Neural Networks,” NAACL 2015 (oral). http://arxiv.org/abs/1412.4729

Wow I can’t believe that worked

Well, I can kind of see it

Not sure what happened there...

“The Unreasonable Effectiveness of Recurrent Neural Networks”

karpathy.github.io

Character-level language model example

Vocabulary: [h,e,l,o]

Example training sequence: “hello”

Try it yourself: char-rnn on Github (uses Torch7)

Cooking Recipes

Obama Speeches

Learning from Linux Source Code

Yoav Goldberg n-gram experiments

Order 10 ngram model on Shakespeare:

But on Linux:

Visualizing and Understanding Recurrent Networks Andrej Karpathy*, Justin Johnson*, Li Fei-Fei (on arXiv.org as of June 2015)

Hunting interpretable cells

quote detection cell

line length tracking cell

if statement cell

quote/comment cell

code depth cell

something interesting cell (not quite sure what)

slidecredit:YanLeCun

slidecredit:YannLeCun

High Level Computer Vision - June 14, 2o17 107

slidecredit:YannLeCun

slidecredit:XiaogangWang

High Level Computer Vision Deep Learning for Computer Vision … · 2017. 6. 14. · High Level...

Documents

Transcript of High Level Computer Vision Deep Learning for Computer Vision … · 2017. 6. 14. · High Level...

computer vision - local-features-tutorial.github.io · diﬀerentiable computer vision an introduction to kornia Edgar Riba Open Source Vision Foundation - OpenCV.org Computer Vision

Computer Vision ECSE 6650 Computer Visionqji/CV/CV19_intro.pdfComputer Vision Qiang Ji jiq@rpi.edu Computer Vision Computer vision is concerned with modeling and replicating human

2O17 - greatersatx.com

Carlo Tomasi, Computer Science. Human Vision Computer Vision?

Computer Vision: Vision and Modeling

Computer Graphics & Computer Vision

Computer vision fundamentals of computer vision - mubarak shah - allbookfree.tk

COMPUTER VISION SYNDROME COMPUTER VISION SYNDROME.

TSBB15 Computer Vision - Linköping University · January 19, 2016 Computer Vision lecture 1 Computer Vision Laboratory TSBB15 Computer Vision Examiner: Per-Erik Forssén

Report 2O17...Report 2017 . Malta International Airport | 1 Report 2O17 MALTA INTERNATIONAL AIRPORT PLC

Computer vision fundamentals of computer vision - mubarak shah

CISSE 2O17 - french-healthcare-alliance.com.cn

Computer Vision with MATLAB Master Class€¦ · Typical Computer Vision Challenges ... Computer Vision System Toolbox Design and simulate computer vision and video processing systems

ANNUAL REPORT 2O17 - West Tradies

2O17 ANNUAL REPORT

Computer vision

Computer Vision - Suraj @ LUMSsuraj.lums.edu.pk/~cs101a06/lectures/Computer Vision.pdf1 Computer Vision CS101: Wk 08 Topical Lecture What is Computer Vision “The goal of Computer

COMP 776: Computer Vision - Computer Sciencelazebnik/spring11/lec01_intro.pdfWhy study computer vision? • Vision is useful • Vision is interesting • Vision is difficult – Half