Picture: A Probabilistic Programming Language for Scene...

2
Picture: A Probabilistic Programming Language for Scene Perception Tejas D Kulkarni MIT [email protected] Pushmeet Kohli Microsoft Research [email protected] Joshua B Tenenbaum MIT [email protected] Vikash Mansinghka MIT [email protected] Abstract Recent progress on probabilistic modeling and statisti- cal learning, coupled with the availability of large training datasets, has led to remarkable progress in computer vision. Generative probabilistic models, or “analysis-by-synthesis” approaches, can capture rich scene structure but have been less widely applied than their discriminative counterparts, as they often require considerable problem-specific engineering in modeling and inference, and inference is typically seen as requiring slow, hypothesize-and-test Monte Carlo meth- ods. Here we present Picture, a probabilistic programming language for scene understanding that allows researchers to express complex generative vision models, while auto- matically solving them using fast general-purpose inference machinery. Picture provides a stochastic scene language that can express generative models for arbitrary 2D/3D scenes, as well as a hierarchy of representation layers for compar- ing scene hypotheses with observed images by matching not simply pixels, but also more abstract features (e.g., contours, deep neural network activations). Inference can flexibly integrate advanced Monte Carlo strategies with fast bottom- up data-driven methods. Thus both representations and inference strategies can build directly on progress in discrim- inatively trained systems to make generative vision more robust and efficient. We use Picture to write programs for 3D face analysis, 3D human pose estimation, and 3D object reconstruction – each competitive with specially engineered baselines. 1. Introduction Probabilistic scene understanding systems aim to pro- duce high-probability descriptions of scenes conditioned on observed images or videos, typically either via discrimina- tively trained models or generative models in an “analysis by synthesis” framework. Discriminative approaches lend them- selves to fast, bottom-up inference methods and relatively knowledge-free, data-intensive training regimes, and have been remarkably successful on many recognition problems [1, 4]. Generative approaches hold out the promise of ana- lyzing complex scenes more richly and flexibly [8, 2, 5, 6, 3], but have been less widely embraced for two main reasons: Inference typically depends on slower forms of approximate inference, and both model-building and inference can in- volve considerable problem-specific engineering to obtain robust and reliable results. These factors make it difficult to develop simple variations on state-of-the-art models, to thoroughly explore the many possible combinations of mod- eling, representation, and inference strategies, or to richly integrate complementary discriminative and generative mod- eling approaches to the same problem. More generally, to handle increasingly realistic scenes, generative approaches will have to scale not just with respect to data size but also with respect to model and scene complexity. This scaling will arguably require general-purpose frameworks to com- pose, extend and automatically perform inference in complex structured generative models – tools that for the most part do not yet exist. Here we present Picture, a probabilistic programming language that aims to provide a common representation lan- guage and inference engine suitable for a broad class of generative scene perception problems. We see probabilistic programming as key to realizing the promise of “vision as inverse graphics”. Generative models can be represented via stochastic code that samples hypothesized scenes and generates images given those scenes. Rich deterministic and stochastic data structures can express complex 3D scenes that are difficult to manually specify. Multiple representation and inference strategies are specifically designed to address the main perceived limitations of generative approaches to vision. Instead of requiring photo-realistic generative mod- els with pixel-level matching to images, we can compare hypothesized scenes to observations using a hierarchy of more abstract image representations such as contours, dis- criminatively trained part-based skeletons, or deep neural network features. Available Markov Chain Monte Carlo (MCMC) inference algorithms include not only traditional Metropolis-Hastings, but also more advanced techniques for inference in high-dimensional continuous spaces, such as el- liptical slice sampling, and Hamiltonian Monte Carlo which can exploit the gradients of automatically differentiable ren- derers. These top-down inference approaches are integrated with bottom-up and automatically constructed data-driven 1

Transcript of Picture: A Probabilistic Programming Language for Scene...

Page 1: Picture: A Probabilistic Programming Language for Scene ...sunw.csail.mit.edu/2015/papers/75_Kulkarni_SUNw.pdfPicture: A Probabilistic Programming Language for Scene Perception Tejas

Picture: A Probabilistic Programming Language for Scene Perception

Tejas D KulkarniMIT

[email protected]

Pushmeet KohliMicrosoft Research

[email protected]

Joshua B TenenbaumMIT

[email protected]

Vikash MansinghkaMIT

[email protected]

Abstract

Recent progress on probabilistic modeling and statisti-cal learning, coupled with the availability of large trainingdatasets, has led to remarkable progress in computer vision.Generative probabilistic models, or “analysis-by-synthesis”approaches, can capture rich scene structure but have beenless widely applied than their discriminative counterparts, asthey often require considerable problem-specific engineeringin modeling and inference, and inference is typically seenas requiring slow, hypothesize-and-test Monte Carlo meth-ods. Here we present Picture, a probabilistic programminglanguage for scene understanding that allows researchersto express complex generative vision models, while auto-matically solving them using fast general-purpose inferencemachinery. Picture provides a stochastic scene language thatcan express generative models for arbitrary 2D/3D scenes,as well as a hierarchy of representation layers for compar-ing scene hypotheses with observed images by matching notsimply pixels, but also more abstract features (e.g., contours,deep neural network activations). Inference can flexiblyintegrate advanced Monte Carlo strategies with fast bottom-up data-driven methods. Thus both representations andinference strategies can build directly on progress in discrim-inatively trained systems to make generative vision morerobust and efficient. We use Picture to write programs for3D face analysis, 3D human pose estimation, and 3D objectreconstruction – each competitive with specially engineeredbaselines.

1. Introduction

Probabilistic scene understanding systems aim to pro-duce high-probability descriptions of scenes conditioned onobserved images or videos, typically either via discrimina-tively trained models or generative models in an “analysis bysynthesis” framework. Discriminative approaches lend them-selves to fast, bottom-up inference methods and relativelyknowledge-free, data-intensive training regimes, and havebeen remarkably successful on many recognition problems[1, 4]. Generative approaches hold out the promise of ana-lyzing complex scenes more richly and flexibly [8, 2, 5, 6, 3],

but have been less widely embraced for two main reasons:Inference typically depends on slower forms of approximateinference, and both model-building and inference can in-volve considerable problem-specific engineering to obtainrobust and reliable results. These factors make it difficultto develop simple variations on state-of-the-art models, tothoroughly explore the many possible combinations of mod-eling, representation, and inference strategies, or to richlyintegrate complementary discriminative and generative mod-eling approaches to the same problem. More generally, tohandle increasingly realistic scenes, generative approacheswill have to scale not just with respect to data size but alsowith respect to model and scene complexity. This scalingwill arguably require general-purpose frameworks to com-pose, extend and automatically perform inference in complexstructured generative models – tools that for the most partdo not yet exist.

Here we present Picture, a probabilistic programminglanguage that aims to provide a common representation lan-guage and inference engine suitable for a broad class ofgenerative scene perception problems. We see probabilisticprogramming as key to realizing the promise of “vision asinverse graphics”. Generative models can be representedvia stochastic code that samples hypothesized scenes andgenerates images given those scenes. Rich deterministic andstochastic data structures can express complex 3D scenesthat are difficult to manually specify. Multiple representationand inference strategies are specifically designed to addressthe main perceived limitations of generative approaches tovision. Instead of requiring photo-realistic generative mod-els with pixel-level matching to images, we can comparehypothesized scenes to observations using a hierarchy ofmore abstract image representations such as contours, dis-criminatively trained part-based skeletons, or deep neuralnetwork features. Available Markov Chain Monte Carlo(MCMC) inference algorithms include not only traditionalMetropolis-Hastings, but also more advanced techniques forinference in high-dimensional continuous spaces, such as el-liptical slice sampling, and Hamiltonian Monte Carlo whichcan exploit the gradients of automatically differentiable ren-derers. These top-down inference approaches are integratedwith bottom-up and automatically constructed data-driven

1

Page 2: Picture: A Probabilistic Programming Language for Scene ...sunw.csail.mit.edu/2015/papers/75_Kulkarni_SUNw.pdfPicture: A Probabilistic Programming Language for Scene Perception Tejas

Scene Language

Approximate Renderer

Representation LayerScene

ID

⌫(.)IR

e.g. Deep Neural Net, Contours, Skeletons, Pixels

⌫(ID)⌫(IR)

(a)

S⇢

�(⌫(ID), ⌫(IR))

Likelihood or Likelihood-free Comparator

orP (ID|IR, X)

RenderingTolerance

X⇢

ObservedImage

(b)

Given current

Inference EngineAutomatically

producesMCMC, HMC, Elliptical Slice,

Data-driven proposals

.

.

.

qP ((S⇢, X⇢) ! (S0⇢, X 0⇢))

qhmc(S⇢real ! S0⇢

real)

qslice(S⇢real ! S0⇢

real)

qdata((S⇢, X⇢) ! (S0⇢, X 0⇢))

New(S⇢, X⇢) (S0⇢, X 0⇢)

ID

and image

3D Face program

3D object program

Random samples

drawn from example

probabilistic programs

IR

(c)

3D human-pose program (d)

ObservedImage

Inferred(reconstruction)

Inferred model re-rendered with

novel poses

Inferred model re-rendered with

novel lighting

ObservedImage Picture Baseline Observed

Image Picture Baseline

Figure 1: Overview: (a) All models share a common template; only the scene description S and image ID changes across problems. Everyprobabilistic graphics program f defines a stochastic procedure that generates both a scene description and all the other information neededto render an approximation IR of a given observed image ID . The program f induces a joint probability distribution on these programtraces ρ. Every Picture program has the following components. Scene Language: Describes 2D/3D scenes and generates particular scenerelated trace variables Sρ ∈ ρ during execution. Approximate Renderer: Produces graphics rendering IR given Sρ and latents Xρ forcontrolling the fidelity or tolerance of rendering. Representation Layer: Transforms ID or IR into a hierarchy of coarse-to-fine imagerepresentations ν(ID) and ν(IR) (deep neural networks, contours and pixels). Comparator: During inference, IR and ID can be comparedusing a likelihood function or a distance metric λ (as in Approximate Bayesian Computation [7]). (b) Inference Engine: Automaticallyproduces a variety of proposals and iteratively evolves the scene hypothesis S to reach a high probability state given ID . (c): Representativerandom scenes drawn from probabilistic graphics programs for faces, objects, and bodies. (d) Illustrative results: We demonstrate Pictureon a variety of 3D computer vision problems and check their validity with respect to ground truth annotations and task-specific baselines.

proposals, which can dramatically accelerate inference byeliminating most of the “burn in” time of traditional samplersand enabling rapid mode-switching.

We demonstrate Picture on three challenging vision prob-lems: inferring the 3D shape and detailed appearance offaces, the 3D pose of articulated human bodies, and the 3Dshape of medially-symmetric objects. The vast majority ofcode for image modeling and inference is reusable acrossthese and many other tasks. We shows that Picture yieldsperformance competitive with optimized baselines on eachof these benchmark tasks.

References[1] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra-

manan. Object detection with discriminatively trained part-based models. PAMI, 2010.

[2] V. Jampani, S. Nowozin, M. Loper, and P. V. Gehler. Theinformed sampler: A discriminative approach to bayesian in-ference in generative computer vision models. arXiv preprintarXiv:1402.0859, 2014.

[3] Y. Jin and S. Geman. Context and hierarchy in a probabilisticimage model. In Computer Vision and Pattern Recognition,2006 IEEE Computer Society Conference on, volume 2, pages2145–2152. IEEE, 2006.

[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet clas-sification with deep convolutional neural networks. In NIPS,pages 1106–1114, 2012.

[5] M. M. Loper and M. J. Black. Opendr: An approximate differ-entiable renderer. In ECCV 2014. 2014.

[6] V. Mansinghka, T. D. Kulkarni, Y. N. Perov, and J. Tenenbaum.Approximate bayesian image interpretation using generativeprobabilistic graphics programs. In Advances in Neural Infor-mation Processing Systems, pages 1520–1528, 2013.

[7] R. D. Wilkinson. Approximate bayesian computation (abc)gives exact results under the assumption of model error. Statisti-cal applications in genetics and molecular biology, 12(2):129–141, 2013.

[8] Y. Zhao and S.-C. Zhu. Image parsing via stochastic scenegrammar. In NIPS, 2011.