Intelligent Perception -...

Intelligent Perception

S. M. Ali Eslami

December 2016

Underlying scene Observation

?

Underlying scene Observation

?

1. How should the scene be represented?

2. How should the representation be computed?

Learning paradigms

x

z

SupervisedLearning

yhorse

Computer

Horse

Cow

OutputInput

Prep

roce

ssin

g

Feat

ure

Extr

actio

n

Feat

ure

Sele

ctio

n

Lear

ned

Dis

crim

inat

ion

Calib

ratio

n

Algorithm

Computer

Horse

Cow

OutputInput

Stag

e 1

Stag

e 2

Stag

e 3

Stag

e 4

Stag

e 5

Algorithm

Introduction

● Optimize directly for the end loss

● End-to-end training, no engineered inputs

● With enough data, learn a big non-linear function

● Supervised labeling is often enough for transferrable representations

● Large labeled dataset + big / deep neural network + GPUs

Deep Supervised Learning

Clarifai (2014)

Introduction


Zhang et al. (2015) Simonyan et al. (2014)

Text Classification Video Classification

Introduction

● Innovation continues○ Inception (Szegedy et al., 2015)○ Residual connections (He et al., 2015)○ Batchnorm (Ioffe et al., 2015)

● Performance is continuously improving


Szegedy et al., (2015)

Where does the data come from?

What is the correct representation?

Learning paradigms

x

z

x

SupervisedLearning

Reinforcement Learning

y

z

ahorse left env

Human-level control in ATARI

End-to-end reinforcement learning

Mnih et al. (2015)

http://www.youtube.com/watch?v=Iw4tTDGUlIY

How much experience do we really need?

Learning paradigms

Decoder

x

z

xx

z

x

SupervisedLearning


GenerativeModelling

y

z

a yhorse left env

Learning paradigms

Decoder

x

z

xx

z

x

SupervisedLearning


GenerativeModelling

y

z

a y

(2.3, -1, 0.5, 3)

not blinkinghorse left env

Highly structured

General Purpose Graphics Programming

Vikash Mansinghka, Tejas D. Kulkarni, Yura N. Perov, and Joshua B. Tenenbaum (2013)

Partially structured

A Stochastic Grammar of Images

Song-Chun Zhu and David Mumford (2007)

Partially structured

S. M. Ali Eslami and Christopher K. I. Williams (2012)

Fully unstructured

Geoffrey Hinton (2006) Antti Rasmus et al. (2016) Jeff Donahue et al. (2016)

Attend, Infer, Repeat: Fast Scene Understanding with Generative Models

S. M. Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, Koray Kavukcuoglu, Geoffrey HintonNeural Information Processing Systems (NIPS), 2016

To obtain object-based representationsTo learn from orders-of-magnitude less data

Motivation

x

z

blue brick

Mod

elIm

age

Cau

se

x

z

blue brick pile of bricks

x

z

Mod

elIm

age

Cau

se

not sufficient forgraspingcountingtransfergeneralisation

x

z

x

z1 z2

Mod

elIm

age

Cau

se

blue brick red brickpile of bricks

x

z

Mod

elIm

age

Cau

se

x

zwhat

y1

z1 zwherez1 zwhat

y2

z2 zwherez2

atty1

atty2

blue brick red brickpile of bricks blue brickabove

red brickbelow

x

z1 z2

Decoder

x y

z Decoder

x y

h1 h2 h3

z1 z2 z3

x

z

x

z1 z2 z3

Mod

elIn

fere

nce

Net

wor

k

Decoder

x y

h1 h2 h3

z1 z2 z3 Decoder

x y

h1 h2 h3

zpresz1 zpresz2 zpresz3zwhatz1 zwhatz2 zwhatz3zwherez1 zwherez2 zwherez3

x

zwhat

y1

z1 zwherez1 zwhat

y2

z2 zwherez2

atty1

atty2

Mod

elIn

fere

nce

Net

wor

k

x

z1 z2 z3

Decoder

x y

h1 h2 h3

z1 z2 z3 Decoder

x y

h1 h2 h3

zpresz1 zpresz2 zpresz3zwhatz1 zwhatz2 zwhatz3zwherez1 zwherez2 zwherez3

x

zwhat

y1

z1 zwherez1 zwhat

y2

z2 zwherez2

atty1

atty2

Mod

elIn

fere

nce

Net

wor

k

x

z1 z2 z3

focus on representation not reconstruction

output is a setorder? count?

x y

zpres

zwhatxatt yatt

hi

zwhere...

VA

E

yi

i ii

i

i

... ...

Demo reel

http://www.youtube.com/watch?v=EKsgjR4Txk4

Omniglot

6

9

no

yes

Representational power

Sum? Increasing order?

Additional structure

x

z

distributed vector that correlates with blue brick

learned

x

z


x

z

distributed vector that correlates with blue brick

class=brickcolour=blueposition=Protation=R

learned

specified

Decoder

x y

h1 h2 h3

z1 z2 z3

x

z1 z2 z3


specified

Inverse graphics

Policy learning

Tabl

e-to

pM

NIS

T

Unsupervised Learning of 3D Structure from Images

Danilo Rezende, S. M. Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, Nicolas HeessNeural Information Processing Systems (NIPS), 2016

To recover 3D structure from 2D imagesTo form stable representations, regardless of camera position

Motivation

To recover 3D structure from 2D imagesTo form stable representations, regardless of camera position

● Inherently ill-posed○ All objects appear under self occlusion, infinite explanations○ Therefore build statistical models to know what’s likely and what’s not

● Even with models, inference is intractable○ Important to capture multi-modal explanations

● How are 3D scenes best represented?○ Meshes or voxels?

● Where is training data collected from?

Motivation

Projection operatorsUnsupervised Learning of 3D Structure from Images


Unconditional samples


Class-conditional samples

http://www.youtube.com/watch?v=bT_i5NGxuL0


Class-conditional samples


Multi-modality of inference


3D structure from multiple 2D images


Inferring object meshes


Inferring object meshes

http://www.youtube.com/watch?v=OW_cOyoE3oA

http://www.youtube.com/watch?v=sq8sf1Nfg2A

http://www.youtube.com/watch?v=RXVMvAdv4Wo

● Deep Supervised Learning

● Deep Reinforcement Learning

● Model-based Methods

● Structured / Unstructured Generative Models

Recap

[email protected]

Intelligent Perception -...

Documents

Transcript of Intelligent Perception -...