Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Modeling Images, Videos and Text

Using the Caffe Deep Learning

Library (Part 2)

Kate Saenko

Microsoft Summer Machine Learning School, St Petersburg 2015

PART I

INTRODUCTION

THE VISUAL DESCRIPTION PROBLEM

MODELING IMAGES

MODELING LANGUAGE

INTRO TO NEURAL NETWORKS

VIDEO-TO-TEXT NEURAL NETWORK

PART II

INTRO TO CAFFE

CAFFE IMAGE AND LANGUAGE MODELS

PART I

INTRODUCTION

MODELING IMAGES

MODELING LANGUAGE

PART II

INTRO TO CAFFE

Intro to Caffe

http://tutorial.caffe.berkeleyvision.org/

Caffe Tour Follow along with the tutorial

slides!

PART I

INTRODUCTION

MODELING IMAGES

MODELING LANGUAGE

PART II

INTRO TO CAFFE

Sequence Learning • Instances of the form x = <x1, x2, x3, …, xT>

• Variable sequence length T

• Learn a transition function f with parameters W:

• f should update hidden state ht and output yt

h0 := 0

for t = 1, 2, 3, …, T:

<yt, ht> = fW(xt, ht-1)

Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015

Sequence Learning

f hT-1 …

Equivalent to a T-layer deep network, unrolled in time

Sequence Learning • What should the transition function f be?

• At a minimum, we want something non-linear and

differentiable

Sequence Learning • A “vanilla” RNN:

ht = σ(Whxxt + Whhht-1 + bh)

zt = σ(Whzht + bz)

• Problems

• Difficult to train — vanishing/exploding gradients

• Unable to “select” inputs, hidden state, outputs

• Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015

• RNNs have vanishing gradient

problem

• Solution: LSTM by Hochreiter &

Schmidhuber (1997)

• Keeps information for a long time

• memory cell:

– Keep gate=1 maintains state

– write gate allows input

– read gate allows output

– differentiable

output to

rest of

input from

rest of

Background - LSTM Unit

based on slide by Geoff Hinton

1.7 1.7

Background - LSTM Unit

based on slide by Geoff Hinton

Sequence Learning

Long Short-Term Memory (LSTM)

Proposed by Hochreiter and Schmidhuber, 1997

Sequence Learning

(Hochreiter &

Schmidhuber, 1997)

• Allows long-term

dependencies to be

learned

• Effective for

• speech recognition

• handwriting

recognition

• translation

• parsing

Sequence Learning

(Hochreiter &

Schmidhuber, 1997)

1 ct-1

ct-1 0 0

Exactly

remember

discard input

Jeff Donahue

Sequence Learning

(Hochreiter &

Schmidhuber, 1997)

Forget

only remember

Jeff Donahue

Image Description

LSTM LSTM LSTM LSTM LSTM sequential

output

a <BOS> man is talking <EOS>

Video Description

LSTM LSTM LSTM LSTM LSTM

sequential input & output

LSTM LSTM LSTM LSTM

<EOS> talking is man a

Sequence learning features now available in Caffe.

Check out PR #2033 “Unrolled recurrent layers (RNN, LSTM)”

Training Sequence Models • At training time, want the model to predict the next time

step given all previous time steps: p(wt+1 | w1:t)

• Example: A bee buzzes.

input output

0 <BOS> a

1 a bee

2 bee buzzes

3 buzzes <EOS>

Sequence Input Format • First input: “cont” (continuation) indicators (T x N)

• Second input: data (T x N x D)

a dog fetches <EOS> the bee

0 1 1 1 0 1

cat in a hat <EOS> a

0 1 1 1 1 0

buzzes <EOS> a

tree falls <EOS>

N = 2, T = 6

batch 1 batch 2…

Sequence Input Format • Inference is exact over infinite batches

• Backpropagation approximate — truncated at batch

boundaries

a dog fetches <EOS> the bee

0 1 1 1 0 1

cat in a hat <EOS> a

0 1 1 1 1 0

buzzes <EOS> a

tree falls <EOS>

N = 2, T = 6

batch 1 batch 2…

Sequence Input Format • Words are usually represented as one-hot vectors

“bee”

8000D vector;

all 0s except

index of

word vocabulary

with 8000

Sequence Input Format • EmbedLayer projects one-hot vector

“bee”

8000D vector;

all 0s except

index of

(index of the

word “bee”

in our

vocabulary)

layer {

type: “Embed”

embed_param {

input_dim: 8000

num_output: 500

embedding

of “bee”

2.7e-2

1.5e-1

3.1e+2

Image Description

Video Description

LSTM LSTM LSTM LSTM LSTM

LSTM LSTM LSTM LSTM

<EOS> talking is man a

Venugopalan et al., “Sequence to Sequence -- Video to Text,” 2015.

http://arxiv.org/abs/1505.00487

N timesteps: watch video M timesteps: produce caption

• Jeff Donahue

Language model trained

on image captions notebook

Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Science

Transcript of Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

CUDA & CAFFE

Size Matters: Metric Visual Search Constraints …Size Matters: Metric Visual Search Constraints from Monocular Metadata Mario Fritz UC Berkeley EECS & ICSI Kate Saenko UC Berkeley

Learning visual representations for unfamiliar environments Kate Saenko, Brian Kulis, Trevor Darrell UC Berkeley EECS & ICSI.

CAFFE LAVAZZA - Eataly · CAFFE CORRETTO 8 LA VIA DEL TE TEA 4 Ask about today’s selection. CAFFE LAVAZZA CAFFE LAVAZZA CAFFE LAVAZZA CAFFE LAVAZZA H O U S EM A D E H O U S EM A

CELLY CAFFE

Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

Nonnas Caffe & Colazione

CAFFE LADRO - Nexternal · increasingly important component of Caffe Ladro’s business model. According to Adrienne Kerrigan, Wholesale Marketing Coordinator at Caffe Ladro, improving

arXiv:1612.01939v1 [cs.CV] 6 Dec 2016 › pdf › 1612.01939v1.pdf · arXiv:1612.01939v1 [cs.CV] 6 Dec 2016. 2 Baochen Sun, Jiashi Feng, and Kate Saenko 1 Introduction Machine learning

Dolci Caffe’ Liquori - Tamburino

arXiv:1609.04356v1 [cs.CV] 14 Sep 2016cs-people.bu.edu/xpeng/pdfs/combining_accv2016.pdf · arXiv:1609.04356v1 [cs.CV] 14 Sep 2016. 2 Xingchao Peng and Kate Saenko Fig.1.Two-Stream

Dictionary-based Visual Sense Models Kate Saenko and Trevor Darrell.

Caffe Vita Website Redesign

Adversarial Self-Defense for Cycle-Consistent GANs...Adversarial Self-Defense for Cycle-Consistent GANs Dina Bashkirova 1, Ben Usman , and Kate Saenko 1,2 1Boston University 2MIT-IBM

Каталог Machiavelli Caffe

Caffe e Salute

CVPR 2019 Silicon Valley - Massachusetts Institute of ... Liu Greg Mori Kate Saenko Silvio Savarese ... involved in both computer vision and machine learning. ... CVPR 2019 Silicon

Learning Deep Object Detectors From 3D Models · 2015-10-24 · Learning Deep Object Detectors from 3D Models Xingchao Peng, Baochen Sun, Karim Ali, Kate Saenko University of Massachusetts

Caffe Paper

Caffe high summer 2014