Intelligent Perception -...
Transcript of Intelligent Perception -...
Intelligent Perception
S. M. Ali Eslami
December 2016
Underlying scene Observation
?
Underlying scene Observation
?
1. How should the scene be represented?
2. How should the representation be computed?
Learning paradigms
x
z
SupervisedLearning
yhorse
Computer
Horse
Cow
OutputInput
Prep
roce
ssin
g
Feat
ure
Extr
actio
n
Feat
ure
Sele
ctio
n
Lear
ned
Dis
crim
inat
ion
Calib
ratio
n
Algorithm
Computer
Horse
Cow
OutputInput
Stag
e 1
Stag
e 2
Stag
e 3
Stag
e 4
Stag
e 5
Algorithm
Introduction
● Optimize directly for the end loss
● End-to-end training, no engineered inputs
● With enough data, learn a big non-linear function
● Supervised labeling is often enough for transferrable representations
● Large labeled dataset + big / deep neural network + GPUs
Deep Supervised Learning
Clarifai (2014)
Introduction
Deep Supervised Learning
Zhang et al. (2015) Simonyan et al. (2014)
Text Classification Video Classification
Introduction
● Innovation continues○ Inception (Szegedy et al., 2015)○ Residual connections (He et al., 2015)○ Batchnorm (Ioffe et al., 2015)
● Performance is continuously improving
Deep Supervised Learning
Szegedy et al., (2015)
Where does the data come from?
What is the correct representation?
Learning paradigms
x
z
x
SupervisedLearning
Reinforcement Learning
y
z
ahorse left env
Human-level control in ATARI
End-to-end reinforcement learning
Mnih et al. (2015)
How much experience do we really need?
Learning paradigms
Decoder
x
z
xx
z
x
SupervisedLearning
Reinforcement Learning
GenerativeModelling
y
z
a yhorse left env
Learning paradigms
Decoder
x
z
xx
z
x
SupervisedLearning
Reinforcement Learning
GenerativeModelling
y
z
a y
(2.3, -1, 0.5, 3)
not blinkinghorse left env
Highly structured
General Purpose Graphics Programming
Vikash Mansinghka, Tejas D. Kulkarni, Yura N. Perov, and Joshua B. Tenenbaum (2013)
Partially structured
A Stochastic Grammar of Images
Song-Chun Zhu and David Mumford (2007)
Partially structured
S. M. Ali Eslami and Christopher K. I. Williams (2012)
Fully unstructured
Geoffrey Hinton (2006) Antti Rasmus et al. (2016) Jeff Donahue et al. (2016)
Attend, Infer, Repeat: Fast Scene Understanding with Generative Models
S. M. Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, Koray Kavukcuoglu, Geoffrey HintonNeural Information Processing Systems (NIPS), 2016
To obtain object-based representationsTo learn from orders-of-magnitude less data
Motivation
x
z
blue brick
Mod
elIm
age
Cau
se
x
z
blue brick pile of bricks
x
z
Mod
elIm
age
Cau
se
not sufficient forgraspingcountingtransfergeneralisation
x
z
x
z1 z2
Mod
elIm
age
Cau
se
blue brick red brickpile of bricks
x
z
Mod
elIm
age
Cau
se
x
zwhat
y1
z1 zwherez1 zwhat
y2
z2 zwherez2
atty1
atty2
blue brick red brickpile of bricks blue brickabove
red brickbelow
x
z1 z2
Decoder
x y
z Decoder
x y
h1 h2 h3
z1 z2 z3
x
z
x
z1 z2 z3
Mod
elIn
fere
nce
Net
wor
k
Decoder
x y
h1 h2 h3
z1 z2 z3 Decoder
x y
h1 h2 h3
zpresz1 zpresz2 zpresz3zwhatz1 zwhatz2 zwhatz3zwherez1 zwherez2 zwherez3
x
zwhat
y1
z1 zwherez1 zwhat
y2
z2 zwherez2
atty1
atty2
Mod
elIn
fere
nce
Net
wor
k
x
z1 z2 z3
Decoder
x y
h1 h2 h3
z1 z2 z3 Decoder
x y
h1 h2 h3
zpresz1 zpresz2 zpresz3zwhatz1 zwhatz2 zwhatz3zwherez1 zwherez2 zwherez3
x
zwhat
y1
z1 zwherez1 zwhat
y2
z2 zwherez2
atty1
atty2
Mod
elIn
fere
nce
Net
wor
k
x
z1 z2 z3
focus on representation not reconstruction
output is a setorder? count?
x y
zpres
zwhatxatt yatt
hi
zwhere...
VA
E
yi
i ii
i
i
... ...
Omniglot
6
9
no
yes
Representational power
Sum? Increasing order?
Additional structure
x
z
distributed vector that correlates with blue brick
learned
x
z
Additional structure
x
z
distributed vector that correlates with blue brick
class=brickcolour=blueposition=Protation=R
learned
specified
Decoder
x y
h1 h2 h3
z1 z2 z3
x
z1 z2 z3
Additional structure
specified
Inverse graphics
Policy learning
Tabl
e-to
pM
NIS
T
Unsupervised Learning of 3D Structure from Images
Danilo Rezende, S. M. Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, Nicolas HeessNeural Information Processing Systems (NIPS), 2016
To recover 3D structure from 2D imagesTo form stable representations, regardless of camera position
Motivation
To recover 3D structure from 2D imagesTo form stable representations, regardless of camera position
● Inherently ill-posed○ All objects appear under self occlusion, infinite explanations○ Therefore build statistical models to know what’s likely and what’s not
● Even with models, inference is intractable○ Important to capture multi-modal explanations
● How are 3D scenes best represented?○ Meshes or voxels?
● Where is training data collected from?
Motivation
Unsupervised Learning of 3D Structure from Images
Unsupervised Learning of 3D Structure from Images
Projection operatorsUnsupervised Learning of 3D Structure from Images
Unsupervised Learning of 3D Structure from Images
Unconditional samples
Unsupervised Learning of 3D Structure from Images
Class-conditional samples
Unsupervised Learning of 3D Structure from Images
Class-conditional samples
Unsupervised Learning of 3D Structure from Images
Multi-modality of inference
Unsupervised Learning of 3D Structure from Images
3D structure from multiple 2D images
Unsupervised Learning of 3D Structure from Images
Inferring object meshes
Unsupervised Learning of 3D Structure from Images
Inferring object meshes
● Deep Supervised Learning
● Deep Reinforcement Learning
● Model-based Methods
● Structured / Unstructured Generative Models
Recap