Combining learned and analytical models for predicting action … · 2018-10-22 · Combining...

Combining learned and analytical models forpredicting action effects from sensory data

Alina Kloss∗, Stefan Schaal†∗, and Jeannette Bohg‡∗Autonomous Motion Department, Max Planck Institute for Intelligent Systems, Germany†Computational Learning and Motor Control Lab, University of Southern California, USA

‡Department of Computer Science, Stanford University, USA

Abstract—One of the most basic skills a robot should possess ispredicting the effect of physical interactions with objects in theenvironment. This enables optimal action selection to reach acertain goal state. Traditionally, dynamics are approximated byphysics-based analytical models. These models rely on specificstate representations that may be hard to obtain from rawsensory data, especially if no knowledge of the object shape isassumed. More recently, we have seen learning approaches thatcan predict the effect of complex physical interactions directlyfrom sensory input. It is however an open question how farthese models generalize beyond their training data. In this work,we investigate the advantages and limitations of neural networkbased learning approaches for predicting the effects of actionsbased on sensory input and show how analytical and learnedmodels can be combined to leverage the best of both worlds.As physical interaction task, we use planar pushing, for whichthere exists a well-known analytical model and a large real-world dataset. We propose to use a convolutional neural networkto convert raw depth images or organized point clouds into asuitable representation for the analytical model and comparethis approach to using neural networks for both, perception andprediction. A systematic evaluation of the proposed approach ona very large real-world dataset shows two main advantages ofthe hybrid architecture. Compared to a pure neural network, itsignificantly (i) reduces required training data and (ii) improvesgeneralization to novel physical interaction.

I. INTRODUCTION

We approach the problem of predicting the consequencesof physical interaction with objects in the environment basedon raw sensory data. Traditionally, interaction dynamics aredescribed by a physics-based analytical model [26, 19, 28]which relies on a certain representation of the environmentstate. This approach has the advantage that the underlyingfunction and the input parameters to the model have physicalmeaning and can therefore be transferred to problems withvariations of these parameters. They also make the underlyingassumptions in the model transparent. However, defining suchmodels for complex scenarios and extracting the requiredstate representation from raw sensory data may be very hard,especially if no assumptions about the shape of objects aremade.

More recently, we have seen approaches that successfullyreplace the physics-based models with learned ones [29, 4,22, 16, 3]. While often more accurate than analytical models,these methods still assume a predefined state representationas input and do not address the problem of how it may beextracted from the raw sensory data.

Some neural network based methods instead simultaneouslylearn a representation of the input and the associated dynamicsfrom large amounts of training data, e.g. [5, 2, 24, 8]. Theyhave shown impressive results in predicting the effect ofphysical interactions. In [2], the authors argue that a neuralnetwork may benefit from choosing its own representation ofthe input data instead of being forced to use a predefinedstate representation. They reason that a problem can oftenbe parametrized in different ways and that some of theseparametrizations might be easier to obtain from the givensensory input than others. The disadvantage of a learnedrepresentation is however that it usually cannot be be mappedto physical quantities. This makes it hard to intuitively under-stand the learned functions and representations. In addition,it remains unclear how these models could be transferred tosimilar problems. Neural networks often have the capacity tomemorize their training data [27] and learn a mapping frominputs to outputs instead of the “true” underlying function.This can make perfect sense if the training data covers thewhole problem domain. However, when data is sparse (e.g.because a robot learns by experimenting), the question of howto generalize beyond the training data becomes very important.

Our hypothesis is that using prior knowledge from existingphysics-based models can provide a way to reduce the amountof required training data and at the same time ensure goodgeneralization beyond the training domain. In this paper, wethus investigate using neural networks for extracting a suitablerepresentation from raw sensory data that can then be con-sumed by an analytical model for prediction. We compare thishybrid approach to using a neural network for both perceptionand prediction and to the analytical model applied on groundtruth input values.

As example physical interaction task, we choose planarpushing. For this task, a well-known physical model [19] isavailable as well as a large, real-world dataset [26] whichwe augmented with simulated images. Given a depth imageof a tabletop scene with one object and the position andmovement of the pusher, our models need to predict the objectposition in the given image and its movement due to thepush. Although the state-space of the object is rather low-dimensional (2D position plus orientation), pushing is alreadya quite involved manipulation problem: The system is under-actuated and the relationship between the push and the objectmovement is highly non-linear. The pusher can slide along

arX

iv:1

710.

0410

2v3

[cs

.RO

] 1

9 O

ct 2

018

the object and dynamics change drastically when it transitionsbetween sticking and sliding-contact or makes and breakscontact.

Our experiments show that despite of relying on depthimages to extract position and contact information, all ourmodels perform similar to the analytical model applied on theground truth state. Given enough training data and evaluatedinside of its training domain, the pure neural network imple-mentation performs best and even outperforms the analyticalmodel baseline significantly. However, when it comes togeneralization to new actions the hybrid approach is muchmore accurate. Additionally, we find that the hybrid approachneeds significantly less training data than the neural networkmodel to arrive at a high prediction accuracy.

A. Contributions

In this work, we make the following contributions:• We show how analytical dynamics models and neural

networks can be combined and trained end-to-end topredict the effects of robot actions based on depth imagesor organized point clouds.

• We compare this hybrid approach to using a pure neuralnetwork for learning both, perception and prediction.Evaluations on a real world physical interaction taskdemonstrate improved data efficiency and generalizationwhen including the analytical model into the network overlearning everything from scratch.

• For evaluation, we augmented an existing dataset of pla-nar pushing with depth and RGB images and additionalcontact information. The code for this is available online.

The present paper is an extension of preliminary resultsin [15] (under review). Here we extend our evaluations andadd results for an architecture that combines the analyticalmodel with a learned error-correction term to better compen-sate for possible inaccuracies of the analytical model. Whilein previous work, the visual setup was limited to a top-down view of the scene we also demonstrate in Section VI,that our proposed architecture is still applicable without thissimplifying assumption.

B. Outline

This paper is structured as follows: We begin with a reviewof related work in Section II. In Section III we formallydescribe the problem we address, introduce an analyticalmodel for planar pushing (III-A) and the dataset we used forour experiments (III-B).

Section IV introduces and compares the different ap-proaches for learning perception and prediction. The eval-uation in Section V focuses on the data-efficiency (V-C)and generalization abilities (V-D, V-E, V-F) of the differentarchitectures and thus mainly addresses the prediction partof the problem. The perception task is kept simple for theseexperiments by using a top-down view of the scene. Thischanges in Section VI where we demonstrate that the hybridapproach also performs well in a less constrained visual

setup. Section VII finally summarizes our results and givesan outlook to future work.

II. RELATED WORK

A. Models for pushing

Analytical models of quasi-static planar pushing have beenstudied extensively in the past, starting with Mason [21]. Goyalet al. [9] introduced the limit surface to relate frictional forceswith object motion, and much work has be done on differentapproximate representations of it [10, 11]. In this work, we usea model by Lynch et al. [19], which relies on an ellipsoidalapproximation of the limit surface.

More recently, there has also been a lot of work on data-driven approaches to pushing [29, 4, 22, 16, 3]. Kopicki et al.[16] describe a modular learner that outperforms a physicsengine for predicting the results of 3D quasi-static pushingeven for generalizing to unseen actions and object shapes.This is achieved by providing the learner not only with thetrajectory of the global object frame, but also with multiplelocal frames that describe contacts. The approach however re-quires knowledge of the object pose from an external trackingsystem and the learner does not place the contact-frames itself.Bauza and Rodriguez [3] train a heteroscedastic GaussianProcess that predicts not only the object movement under acertain push, but also the expected variability of the outcome.The trained model outperforms an analytical model [19] givenvery few training examples. It is however specifically trainedfor one object and generalization to different objects is notattempted. Moreover, this work, too, assumes access to theground truth state, including the contact point and the anglebetween the push and the object surface.

B. Learning dynamics based on raw sensory data

Many recent approaches in reinforcement learning aim tosolve the so called “pixels to torque” problem, where thenetwork processes images to extract a representation of thestate and then directly returns the required action to achievea certain task [18, 17]. Jonschkowski and Brock [13] arguethat the state-representation learned by such methods canbe improved by enforcing robotic priors on the extractedstate, that may include e.g. temporal coherence. This is analternative way of including basic principles of physics in alearning approach, compared to what we propose here. Whilepolicy learning requires understanding the effect of actions, theabove methods do not acquire an explicit dynamics model.We are interested in learning such an explicit model, as itenables optimal action selection (potentially over a larger timehorizon). The following papers share this aim.

Agrawal et al. [2] consider a learning approach for pushingobjects. Their network takes as input the pushing action anda pair of images: one before and one after a push. Afterencoding the images, two different network streams attempt topredict (i) the encoding of the second image given the first andthe action and (ii) the action necessary to transition from thefirst to the second encoding. Simultaneously training for bothtasks improves the results on action prediction. The authors

do not enforce any physical models or robotic priors. As thelearned models directly operate on image encodings insteadof physical quantities, we cannot compare the accuracy of theforward prediction part (i) to our results.

SE3-Nets [5] process organized (i.e. image shaped) 3D pointclouds and an action to predict the next point cloud. Foreach object in the scene, the network predicts a segmentationmask and the parameters of an SE3 transform (linear velocity,rotation angle and axis). In newer work [6], an intermediatestep is added, that computes the 6D pose of each object,before predicting the transforms based on this more structuredstate representation. The output point cloud is obtained bytransforming all input pixels according to the transform forthe object they correspond to. The resulting predictions arevery sharp and the network is shown to correctly segment theobjects and determine which are affected by the action. Anevaluation of the generalization to new objects or forces washowever not performed.

Our own architecture is inspired by this work. The pureneural network we use to compare to our hybrid approach canbe seen as a simplified variant of SE3-Nets, that predicts SE2transforms (see Sec. IV). Since we define the loss directlyon the predicted movement of the object, we omit predictingthe next observation and the segmentation masks required forthis. We also use a modified perception network, which reliesmostly on a small image patch around the robot end-effector.

Finn et al. [8] is similar to [5] and explores differentpossibilities of predicting the next frame of a sequence ofactions and RGB images using recurrent neural networks.

Visual Interaction Networks [24] also take temporal infor-mation into account. A convolutional neural network encodesconsecutive images into a sequence of object states. Dynamicsare predicted by a recurrent network that considers pairs ofobjects to predict the next state of each object.

C. Combining analytical models and learning

The idea of using analytical models in combination withlearning has also been explored in previous work. Degraveet al. [7] implemented a differentiable physics engine for rigidbody dynamics in Theano and demonstrate how it can beused to train a neural network controller. In [23], the authorssignificantly improve Gaussian Process learning of inversedynamics by using an analytical model of robot dynamicswith fixed parameters as the mean function or as featuretransform inside the covariance function of the GP. Both workshowever do not cover visual perception. Most recently, Wuet al. [25] used a graphics and physics engine to learn toextract object-based state representations in an unsupervisedway: Given a sequence of images, a network learns to producea state representation that is predicted forward in time usingthe physics engine. The graphics engine is used to renderthe predicted state and its output is compared to the nextimage as training signal. In contrast to the aforementionedwork, we not only combine learning and analytical models, butalso evaluate the advantages and limitations of this approach.

Finally, [20] present an interesting approach to learning func-tions by training a neural network to combine a number ofmathematical base operations (like multiplication, division,sine and cosine). This enables their “Equation Learner” tolearn functions which generalize beyond the domain of thetraining data, just like traditional analytical models. Trainingthese networks is however challenging and involves trainingmany different models and choosing the best in an additionalmodel selection step.

III. PROBLEM STATEMENT

Our aim is to analyse the benefits of combining neuralnetworks with analytical models. We therefore compare thishybrid approach to models that exclusively rely on eitherapproach. As a test bed, we use planar pushing, for whicha well-known analytical model and a real-world dataset areavailable.

We consider the following problem: The input consists ofa depth image Dt of a tabletop scene with one object and thepusher at time t, the starting position pt of the pusher and itsmovement between this and the next timestep ut = pt+1−pt.With this information, the models need to predict the objectposition ot before the push is applied and its movement vo

t =ot+1 − ot due to the push.

This can be divided into two subproblems:Perception: Extract a suitable state representation of the

scene at time t (before the push) xt from the input image.This representation should contain the object position ot.

fperception(Dt) = xt

Prediction: Given the state representation xt, the positionpt of the pusher and its movement ut:t+1, predict how theobject will move:

fprediction(xt,pt,ut:t+1) = vo

t:t+1

In the following sections, we will introduce an analyticalmodel for computing fprediction and the dataset of real robotpushes that we use for training and evaluation.

A. An Analytical Model of Planar Pushing

We use the analytical model of quasi-static planar pushingthat was devised by Lynch et al. [19]. It predicts the objectmovement vo given the pusher velocity u, the contact point cand associated surface normal n as well as two friction-relatedparameters l and µ. The model is illustrated in Figure 1, whichalso contains a list of symbols. Note that this model is stillapproximate and far from perfectly modelling the stochasticprocess of planar pushing [26].

Predicting the effect of a push with this model has twostages: First, it determines whether the push is stable (“stickingcontact”) or whether the pusher will slide along the object(“sliding contact”). In the first case, the velocity of the objectat the contact point will be the same as the velocity of thepusher. In the sliding case however, the pusher movement canbe almost orthogonal to the resulting motion at the contactpoint. We call the motion at the contact point “effective push

x

x

o position of the objectvo linear and angular object velocityvp linear velocity at the contact point -

effective push velocityp position of the pusheru linear pusher velocity - actionc contact point (global)c′ contact point relative to on surface normal at cl ratio between maximal torsional and

linear friction forceµ friction coefficient pusher-object

x

x

fb left or right boundary force of thefriction cone

mb torques corresponding to theboundary forces

vo,b object velocities resulting fromboundary forces

vp,b effective push velocities corre-sponding to the boundary forces

b = l, r placeholder for left or rightboundary

s contact indicator, s ∈ [0, 1]k rotation axis

Fig. 1: Overview and illustration of the terminology for pushing.

velocity” vp. It is the output of the first stage. Given vp andthe contact point, the second stage then predicts the resultingtranslation and rotation of the object’s centre of mass.

Stage 1: Determining the contact type and computing vp:To determine the contact type (slipping or sticking), we haveto find the left and right boundary forces fl, fr of the frictioncone (i.e. the forces for which the pusher will just not startsliding along the object) and the corresponding torques ml,mr. The opening angle α of the friction cone is defined bythe friction coefficient µ between pusher and object. The forcesand torques are then computed by

α = arctan(µ) (1)fl = R(−α)n fr = R(α)n (2)ml = c′xfly − c′yflx mr = c′xfry − c′yfrx (3)

where R(α) denotes a rotation matrix given α and c′ = c−ois the contact point relative to the object’s centre of mass.

To relate the forces to object velocities, Lynch et al. [19] usean ellipsoidal approximation to the limit surface. To simplifynotation, we use subscript b to refer to quantities associatedwith either the left l or right r boundary forces. vo,b and ωo,bdenote linear and angular object velocity, respectively. vp,b

are the push velocities that would create the boundary forces.They span the so called ”motion cone”.

vo,b =ωo,bl

2

mbfb (4)

vp,b = ωo,b(l2

mbfb + k× c′) (5)

ωo,b acts as a scaling factor. Since we are only interested in thedirection of vp,b and not in its magnitude, we set ωo,b = mb:

vp,b = l2fb +mbk× c′ (6)

To compute the effective push velocity vp, we need todetermine the contact case: If the push velocity lies outside ofthe motion cone, the contact will slip. The resulting effectivepush velocity then acts in the direction of the boundaryvelocity vp,b which is closer to the push direction:

vp =u · n

vp,b · nvp,b (7)

Otherwise contact is sticking and we can use the pushervelocity as effective push velocity vp = u. When the norm

of n is zero (due to e.g. a wrong prediction of the perceptionneural network), we set the output vp,b to zero.

Stage 2: Using vp to predict the object motion: Given theeffective push velocity vp and the contact point c′ relativeto the object centre of mass, we can compute the linear andangular velocity vo = [vox, voy, ω] of the object.

The object will of course only move if the pusher is incontact with the object. To use the model also in caseswhere no force acts on the object, we introduce the contactindicator variable s. It takes values between zero and one andis multiplied with vp to switch off responses when there is nocontact. We allow s to be continuous instead of binary to givethe model a chance to react to the pusher making or breakingcontact during the interaction.

vox =(l2 + c′2x )svpx + c′xc

′ysvpy

l2 + c′2x + c′2y(8)

voy =(l2 + c′2y )svpy + c′xc

′ysvpx

l2 + c′2x + c′2y(9)

ω =c′xvoy − c′yvox

l2(10)

Discussion of Underlying Assumptions: The analyticalmodel is built on three simplifying assumptions: (i) quasi-static pushing, i.e. the force applied to the object is bigenough to move the object, but not to accelerate it (ii) thepressure distribution of the object on the surface is uniformand the limit-surface of frictional forces can be approximatedby an ellipsoid (iii) the friction coefficient between surfaceand object is constant.

The analysis performed by Yu et al. [26] shows that as-sumption (ii) and (iii) are violated frequently by real worlddata. Assumption (i) holds for push velocities below 50 mm

s .In addition, the contact situation may change during pushing(as the pusher may slide along the object and even losecontact), such that the model predictions become increasinglyinaccurate the longer ahead it needs to predict in one step.

B. Data

We use the MIT Push Dataset [26] for our experiments.It contains object pose and force recordings from real robotexperiments, where eleven different planar objects are pushedon four different surface materials. For each object-surface

Fig. 2: Rendered objects of the Push Dataset [26]: rect1-3,butter, tri1-3, hex, ellip1-3. Red dots indicate the subset ofcontact points we use to collect a test set with held-out pushesfor experiment V-D.

combination, the dataset contains about 6000 pushes that varyin the manipulator (“pusher”) velocity and acceleration, thepoint on the object where the pusher makes contact andthe angle between the object surface and the push direction.Pushes are 5 cm long and data was recorded at 250 Hz.

As this dataset does not contain RGB or depth images,we render them using OpenGL and the mesh-data suppliedwith the dataset. In this work, we only use the depth images,RGB will be considered in future work. A rendered sceneconsists of a flat surface with one of four textures (for thefour surface materials), on which one of the objects is placed.The pusher is represented by a vertical cylinder, with noarm attached. Figures 2 and 3 show the different objectsand example images. We also annotated the dataset with allinformation necessary to apply the analytical model to use itas a baseline. The code for annotation and rendering imagesis available online at https://github.com/mcubelab/pdproc.

For each experiment, we construct datasets for training andtesting from a subset of the Push Dataset. As the analyticalmodel does not take acceleration of the pusher into account,we only use push variants with zero pusher acceleration. Wehowever do evaluate on data with high pusher velocities, thatbreak the quasi-static assumption made in the analytical model(in Sec. V-E). One data point in our datasets consists of adepth image showing the scene before the push is applied,the object position before and after the push and the initialposition and movement of the pusher. The prediction horizonis 0.5 seconds in all datasets 1. More information about thespecific datasets for each experiment can be found in thecorresponding sections.

We use data from multiple randomly chosen timesteps ofeach sequence in the Push Dataset. Some of the examplesthus contain shorter push-motions than others, as the pusherstarts moving with some delay or ends its movement duringthe 0.5 seconds time-window. To achieve more visual varianceand to balance the number of examples per object type, wesample a number of transforms of the scene relative to thecamera for each push. Finally, about a third of our datasetconsists of examples where we moved the pusher away fromthe object, such that it is not affected by the push movement.

Fig. 3: Rendered RGB and depth images on surfaces plywoodand abs.

IV. COMBINING NEURAL NETWORKS AND ANALYTICALMODELS

We now introduce the neural network variants that we willanalyse in the following section. All architectures share thesame first network stage that processes raw depth imagesand outputs a lower-dimensional encoding and the objectposition. Given this output, the pushing action (movement uand position p) of the pusher and the friction parameters µand l2, the second part of these networks predicts the linearand angular velocity vo of the object. This predictive partdiffers between the network variants. While three of them(simple, full, error) use variants of the analytical dynamicsmodel established in Sec. III-A, variant neural has to learnthe dynamics with a neural network. The prediction part hasabout 1.8 million trainable parameters for all variants exceptfor error, which has 2.7 million parameters.

We implement all our networks as well as the analyticalmodel in tensorflow [1], which allows us to propagate gradi-ents through the analytical models just like any other layer.

A. Perception

The architecture of the network part that processes theimage is depicted in Fig. 4. We assume that the robot knowsthe position of its end-effector, which allows us to extract asmall (80× 80 pixel) image patch (“glimpse”) around the tipof the pusher. If the pusher is close enough to the object tomake contact, the contact point and the normal to the objectsurface can be estimated from this smaller image. It thusserves as an attention-mechanism to focus the computationson the most relevant part of the image. Only the position ofthe object needs to be estimated from the full image. Takentogether, this is all the information necessary to predict theobject movement.

The glimpse is processed with three convolutional layerswith ReLu non-linearity, each followed by max-pooling andbatch normalization [12]. The full image is processed with asequence of four convolutional and three deconvolution layers,of which the last has only one channel. This output featuremap resembles an object segmentation map. We use spatialsoftmax [17] to get the pixel location of the object centre.

Initial experiments showed that not using the glimpsestrongly decreased performance for all networks. We alsofound that using both, the glimpse and an encoding of the fullimage, for estimating all physical parameters was disadvanta-geous: Using the full image increases the number of trainableparameters in the prediction network but adds no informationthat is not already contained in the glimpse.

https://github.com/mcubelab/pdproc

B. PredictionNeural Network only (neural): Figure 5 a) shows the

prediction part of the variant neural, which uses a neuralnetwork to learn the dynamics of pushing. The input to thispart is a concatenation of the output from perception with theaction and friction parameter l. The network processes thisinput with three fully-connected layers before predicting theobject velocity vo. All intermediate fully-connected layers useReLu non-linearities. The output layers do not apply a non-linearity.

Full analytical model (hybrid) : This variant uses thecomplete analytical model as described in Section III-A. Sev-eral fully-connected layers extract the necessary input valuesfrom the glimpse encoding and the action, as shown in Figure5 b). These are the contact point c, the surface normal n andthe contact indicator s. For predicting s, we use a sigmoidalnon-linearity to limit the predicted values to [0, 1].

Simplified analytical model (simple): Simple (Figure 5c) only uses the second stage of the analytical model. As forhybrid, a neural network extracts the model inputs (effectivepush velocity vp, contact point c) from the encoded glimpseand the action.

We use this variant as a middle ground between the twoother options: It still contains the main mechanics of how aneffective push at the contact point moves the object, but leavesit to the neural network to deduce the effective push velocityfrom the scene and the action. This gives the model morefreedom to correct for possible shortcomings of the analyticalmodel. We expect these to manifest mostly in the first stageof the model, as small errors can have a big effect there whenthey influence whether a contact is estimated as sticking orslipping. The second stage of the analytical model does notspecify how the input action relates to the object movementand simple therefore allows us to evaluate the importance ofthis particular aspect of the analytical model.

Full analytical model + error term (error): One concernwhen using a predefined analytical model is that the trainednetwork cannot improve over the performance of the analyticalmodel. If the analytical model is inaccurate, the hybrid archi-tecture can only compensate to some degree by manipulatingthe input values of the model, i.e. by predicting “incorrect”values for the components of the state representation. Thislimits its ability to compensate for model errors as it mightnot be possible to account for all types of errors in this way.

As an alternative, we propose to learn an error-correctionterm which is added to the output prediction of the analyticalmodel. The error-term is thus not constrained by the modeland should be able to compensate for a broader class of modelerrors.

Figure 5 d) shows the architecture. As input for predictingthe error-term, we use the same values that neural receives forpredicting the object velocity, i.e. the glimpse encoding, theaction, the predicted object position and the friction parameter.Note that we do not propagate gradients to the inputs ofthe error-prediction module. The intuition behind this is thatwe do not want the error-prediction to interfere with the

240x320

80x80

conv

1

conv

2

conv

3

conv

4

conv

1

conv

3

conv

2

Glimpse:

Out:Input:

deco

nv1

deco

nv2

deco

nv3

Encoding: Localization:

glim

pse

enco

ding

obje

ct p

ositi

on

spat

ial s

oftm

ax

Fig. 4: Perception part for all network variants. White boxesrepresent tensors, arrows represent network layers (green) anddataflow (black).

prediction of the inputs for the analytical model. We evaluatethe effect of this architectural decision in Section V-H. Asecond variant that we compare to in this section aims toimprove the generalizability of the error-prediction to fasterpush movements. This is achieved by normalizing the inputaction to unit length before feeding it into the error-predictionmodule.

C. Training

For training, we use Adam optimizer [14] with a learningrate of 1−4 and a batch-size of 32 for 75,000 steps. The lossL penalizes the Euclidean distance between the predicted andthe real object position in the input image (pos), the Euclideanerror of the predicted object translation (trans), the error in themagnitude of translation (mag) and in angular movement (rot)in degree (instead of radian, to ensure that all components ofthe loss have the same order of magnitude). We use weightdecay with λ = 0.001.

Let vo and o denote the predicted and vo, o the real objectmovement and position. w are the network weights and νo =[vox, voy] denotes linear object velocity.

L(vo, o,vo,o) = trans+mag + rot+ pos+ λ∑

w‖ w ‖

trans = ‖νo − νo‖ mag = |‖νo‖ − ‖νo|‖rot = 180

π |ω − ω| pos =‖ o− o ‖

When using the variant hybrid, a major challenge is thecontact indicator s: In the beginning of training, the directionof the predicted object movement is mostly wrong. s thereforereceives a strong negative gradient, causing it to decreasequickly. Since the predicted motion is effectively multipliedby s, a low s results in the other parts of the network receivingsmall gradients and thus greatly slows down training. Wetherefore add the error in the magnitude of the predictedvelocity to the loss to prevent s from decreasing too far inthe early training phase.

V. EVALUATING GENERALIZATION

In this section, we test our hypothesis that using an ana-lytical model for prediction together with a neural network

Out:Input: Neural Network:a) Neural Network only (neural)

object velocityfc

2

256512

fc1

fc3

128

action

l

c) Simplified analytical model (simple)

contact point

fc2

256512

fc1

fc3

128

effective push velocity

l

glimpseencoding

action

Ana

lyti

cal m

odel

, par

t 2Fu

ll an

alyt

ical

mod

el

b) Full analytical model (hybrid)

contact indicator

normal

fc2

256512

fc1

fc3

128

l

glimpseencoding

actioncontact point

Full

anal

ytic

al m

odel

d) Full analytical model with error-correction (error)

contact indicator

normal

fc2

256512

fc1

fc3

128

l

glimpseencoding

action contact point

fc2

128256

fc1

fc3

64

lerror term

glimpseencoding

conc

atco

ncat

conc

atco

ncat

conc

at

Fig. 5: Prediction parts of the four network variants neural, hy-brid, simple and error. White boxes represent tensors, arrowsshow network layers (green) and dataflow (black). The red barin architecture (d) indicates that no gradients are propagatedto the inputs of this layer.

for perception improves data efficiency and lead to bettergeneralization than using neural networks for both, perceptionand prediction. We evaluate how the performance of thenetworks depends on the amount of training data and howwell they generalize to (i) pushes with new pushing anglesand contact points, (ii) new push velocities and (iii) unseenobjects.

For the experiments in this section, we use a top-down viewof the scene, such that the object can only move in the imageplane and the z-coordinate of all scene components remains

constant. This simplifies the application of the analytical modelby removing the need for an additional transform between thecamera and the table. It also simplifies the perception taskand allows us to focus this evaluation on the comparison ofthe hybrid and the purely neural approach. In Section VI wewill show how to extend the proposed model to work on moredifficult camera settings.

A. Baselines

We use three baselines in our experiments. All of them usethe ground truth input values of the analytical model (action,object position, contact point, surface normal, contact indicatorand friction coefficients) instead of depth images, and thusonly need to predict the object velocity, but not its initialposition. If the pusher makes contact with the object duringthe push, but is not in contact initially, we use the contact pointand normal from when contact is first made and shorten theaction accordingly. Note that this gives the baseline models abig advantage over architectures that have to infer the inputvalues from raw sensory data.

The first baseline is just the average translation and rotationover the dataset. This is equal to the error when alwayspredicting zero movement, and we therefore name it zero.The second (physics) is the full analytical model evaluatedon the ground truth input values. In addition to the networksdescribed in Section IV, we also train a neural network (neuraldyn) on the (ground truth) input values of the analytical model.It uses the same architecture of fully-connected layers forprediction as neural. This allows us to evaluate whether neuralbenefits from being able to choose its own state representation.

B. Metrics

For evaluation, we compute the average Euclidean distancebetween the predicted and the ground truth object translation(trans) and position (pos) in millimetres as well as the averageerror on object rotation (rot) in degree. As our datasetsdiffer in the overall object movement, we report errors ontranslation and rotation normalized by the average motion inthe corresponding dataset.

C. Data efficiency

The first hypothesis we test is that combining the analyticalmodel with a neural network for perception reduces therequired training data as compared to a pure neural network.

Data: We use a dataset that contains all objects from theMIT Push dataset and all pushes with velocity 20mms and splitit randomly into training and test set. This results in about190k training examples and about 38k examples for testing.To evaluate how the networks’ performance develops withthe amount of training data, we train the models on differentsubsets of the training split with sizes from 2500 to the full190k. We always evaluate on the full test split. To reducethe influence of dataset composition especially on the smalldatasets, we average results over multiple different datasetswith the same size.

Results: Figure 6 shows how the errors in predicted trans-lation, rotation and object position develop with more trainingdata and Table I contains numeric values for training on thebiggest and smallest training split. As expected, the combinedapproach of neural network and analytical model (hybrid anderror) already performs very well on the smallest dataset (2500examples) and beats the other models including the neural dynbaseline, which uses the ground truth state representation, bya large margin. It takes more than 20k training examples forthe other models to reach the performance of hybrid, wherepredicting rotation seems to be harder to learn than translation.

Despite of having to rely on raw depth images instead of theground truth state representation, all models perform at leastclose to the physics baseline when using the full training set.However, only the pure neural network and the hybrid modelwith error-correction are able to improve on the baseline. Thisshows that the analytical model limits hybrid in fitting thetraining data perfectly, since the model itself is not perfectand does not allow for overfitting to noise in the trainingdata. Neural and error have more freedom for fitting thetraining distribution, which however also increases the riskof overfitting.

Combining the learned error-correction with the fixed ana-lytical model is especially helpful for predicting the translationof the object. To also improve the prediction of rotations, themodel needs more than 20k training examples, which is similarto neural. While neural makes a larger improvement on thefull dataset, error combines the comparably good performanceof hybrid on few training examples with the ability to improveon the model given enough data.

The variant simple, which uses only the second part of theanalytical model, also combines learning and a fixed model forpredicting the dynamics. But in contrast to error, this variantseems to combine the disadvantages of both approaches: Itneeds much more training data than hybrid but is still limitedby the performance of the analytical model and gets quicklyoutperformed by the pure neural network when more data isavailable.

The comparison of neural and the baseline neural dyn showsthat despite of having access to the ground truth data, neuraldyn actually performs worse than neural on the full dataset.This seems to agree with the theory of Agrawal et al. [2], thattraining perception and prediction end-to-end and letting thenetwork chose its own state representation instead of forcing itto use a predefined state may be beneficial for neural learning.

D. Generalization to new pushing angles and contact points

The previous experiment showed the performance of thedifferent models when testing on a dataset with a very similardistribution to the training set. Here, we evaluate the perfor-mance of the networks on held-out push configurations thatwere not part of the training data. Note that while the test setcontains combinations of object pose and push action that thenetworks have not encountered during training, the pushingactions or object poses themselves do not lie outside of the

TABLE I: Error in predicted translation (trans) and rotation(rot) as percentage of the average movement (standard errorsin brackets). pos denotes the error in predicted object position.Values shown are for training on the full training set (190kexamples) and on a 2500 examples subset.

trans rot pos [mm]

2.5k

neural 33.6 (0.18)% 62.54 (0.42)% 0.46 (0.002)simple 32.3 (0.19)% 53.6 (0.37)% 0.44 (0.002)hybrid 25.4 (0.17)% 45.5 (0.36)% 0.46 (0.002)

error 24.7 (0.16)% 46.8 (0.36)% 0.45 (0.002)neural dyn 32.6 (0.19)% 63.5 (0.46)% -

190k


error 18.4 (0.12)% 34.6 (0.29)% 0.31 (0.002)neural dyn 19.2 (0.12)% 36.3 (0.29)% -

physics 18.95 (0.13)% 35.4 (0.3)% -zero 2.95 (0.02)mm 1.9 (0.01) ◦ -

training data value range. This experiment thus test the model’sinterpolation abilities.

Data: We again train the networks on a dataset that containsall objects and pushes with velocity 20mms . For constructingthe test set, we collect all pushes with (i) pushing angles ±20◦and 0◦ to the surface normal (independent from the contactpoints) and (ii) at a set of contact points illustrated in Figure 2(independent from the pushing angle).

The remaining pushes are split randomly into a training anda validation set, which we use to monitor the training process.There are about 114k data points in the training split, 23k inthe validation split and 91k in the test set.

Results: As Table II shows, hybrid and error perform bestfor predicting the object velocity for pushes that were notpart of the training set. Although still being close, none ofthe networks can outperform the physics baseline on this testset. Note that the difficulty of the test set in this experimentdiffers from the one in the previous experiment, as can beseen from the different performance of the physics baseline:Due to the central contact points and small pushing angles,the test set contains a high proportion of pushes with stickingcontact, for which the movement of the object is similar tothe movement of the pusher. This makes it hard to comparethe results between Table I and Table II in terms of absolutevalues.

With more than 100k training examples, we supply enoughdata for the pure neural model to clearly outperform the com-bined approach and the baseline in the previous experiment(i.e. when the test set is similar to the training set,see Figure6). The fact that neural now performs worse than hybrid andphysics indicates that its advantage over the physics baselinemay not come from it learning a more accurate dynamicsmodel. Instead, it probably memorizes specific input-outputcombinations that the analytical model cannot predict well,for example due to noisy object pose data.

The same is supposedly also true for error, which cannotimprove on hybrid for new pushes and even performs a littleworse for predicting rotations when evaluated on the testset.

2.5 5 7.510 1520 50 100 19015

20

25

30

35

thousand training examples

tran

s[%

]

2.5 5 7.510 1520 50 100 19030

40

50

60


rot

[%]

neural simple hybrid error neural dyn physics

2.5 5 7.510 1520 50 100 190

0.3

0.4

0.5


pos

[mm

]

Fig. 6: Prediction errors versus training set size (x-axis in logarithmic scale). Errors on translation and rotation are given aspercentage of the average movement in the test set. The model-based architecture hybrid performs much better than the othernetworks when training data is sparse.

TABLE II: Prediction errors for testing on pushes with pushingangles and contact points not seen during training.

trans rot pos [mm]


error 15.6 (0.07)% 34.5 (0.18)% 0.32 (0.001)neural dyn 18.1 (0.07)% 44.1 (0.2)% -

physics 14.6 (0.06)% 32.8 (0.18)% -zero 4.36 (0.013) mm 2.27 (0.009) ◦ -

The difference in performance is however much smaller thanfor neural.

In contrast to hybrid and error, simple again does not seemto profit from the simplified analytical model for generalizationand performs similar to neural.

If we supply fewer training data, the difference betweenhybrid and the other networks is again much more pronounced:Hybrid achieves 20.3 % translation and 43.8 % rotation errorwhereas neural lies at 38.7 % and 63.4 % respectively.

E. Generalization to Different Push Velocities

In this experiment, we test how well the networks generalizeto unseen push velocities. In contrast to the previous experi-ment, the test actions in this experiment have a different valuerange than the actions in the training data, and we are thuslooking at extrapolation. As neural networks are usually notgood at extrapolating beyond their training domain, we expectthe model-based network variants to generalize better to push-velocities not seen during training.

Data: We use the networks that were trained in the firstexperiment (V-C) on the full (190k) training set. The pushvelocity in the training set is thus 20 mm

s . We evaluate ondatasets with different push velocities ranging from 10 mm

s to150 mm

s .Results: Results are shown in Figure 7. Since the input

action does not influence perception of the object position, weonly report the errors on the predicted object motion.

On higher velocities, we see a very large difference betweenthe performance of our combined approach and the pure neural

0

20

40

60

80

tran

s[%

]

10 20 50 75 100 15020

40

60

80

100

push velocity [mms

]

rot

[%]

physics neural simplehybrid error neural dyn

Fig. 7: Errors on predicted translation and rotation for testingon different push velocities. All models were trained on pushvelocity 20 mm

s . While hybrid stays close to the physicsbaseline, the other models have trouble extrapolating.

network. Neural’s predictions quickly become very inaccurate,with the error on predicted rotation rising to more than 88 %of the error when predicting zero movement always. Theperformance of hybrid on the other hand is most constant overthe different push velocities and declines only slightly morethan the physics baseline.

A reason for the decreasing performance of physics onhigher velocities is that the quasi-static assumption is violated:For pushes faster than 50 mm

s , the object gets accelerated andcan continue sliding even after contact to the pusher was lost.

Simple, neural, error and neural-dyn all get worse withincreasing velocity, but simple degrades much less whenpredicting rotations. The reason for this is that all four archi-tectures struggle mostly with predicting the correct magnitudeof the object translation and not so much with predictingthe translation’s direction. By using the second stage ofthe analytical model, simple has information about how thedirection of the object translation relates to its rotation, which

results in much more accurate predictions.The advantage of hybrid for extrapolation lies in the first

stage of the analytical model, which allows it to scale itspredictions according to the magnitude of the action. Thisis in essence a multiplication operation. However, a generalmultiplication of inputs cannot be expressed using only fully-connected layers (as used by simple, neural, neural dyn andthe error-prediction part of error) because fully-connectedlayers essentially perform weighted additions of their inputs.So instead of learning the underlying function, the networksare forced to resort to memorizing input-output relations forthe magnitude of the object motion, which explains whyextrapolation does not work well.

When combining the prediction of the analytical model witha learned error-term, the resulting model thus suffers from thesame issues as the other network-based variants. The declineis however less pronounced than for neural, especially forpredicting translations and only starts after 50mms . A possiblereason for this is that the error-correction term is rather smallcompared to the output of the analytical model. This meansthat the weights with which the action enters the computationof the error term are smaller than for neural, simple or neuraldyn. In V-H, we show that the error-prediction can also bemade more robust to higher velocities by normalizing the pushaction before using it as input to the error-prediction.

F. Generalization to Different Objects

This experiment tests how well the networks generalizeto unseen object shapes and how many different objects thenetworks have to see during training to generalize well.

Data: We train the networks on three different datasets:With one object (butter), two objects (butter and hex) andthree objects (butter, hex and one of the ellipses or triangles).The datasets with fewer objects contain more augmented data,such that the total number of data points is about 35k trainingexamples in each. As test sets, we use one dataset containingthe three ellipses and one containing all triangles. While this isfewer training data than in the previous experiments, it shouldbe sufficient for the pure neural network to perform as wellas hybrid, since the test sets contain only few objects.

Results: The results in Figure 8 show that neural is consis-tently worse than the other networks when predicting rotations.It also improves most notably when one example of the testobjects is in the training set. The differences between themodels are less pronounce when predicting translation, exceptfor simple which performs particularly bad on triangles. Thedifferent models do not differ much when predicting position,which is not surprising, since they share the same perceptionarchitecture. The architecture with added error-term does notperform very different from hybrid, which implies that theerror-correction term does not depend much on the shape ofthe object.

In general, all models perform surprisingly well on ellipses,even if the models only had access to data from the butterobject. Reaching the baseline performance on triangles ishowever only possible with a triangle in the training set.

Predicting the object’s position is most sensitive to the shapesseen during training: It generalizes well to ellipses which havesimilar shape and size as the butter or hex object. The triangleson the other hand are very different from the other objects inthe dataset and the error for localizing triangles is by factorten higher than for ellipses.

G. Visualizations

As a qualitative evaluation, we plot the predictions of ournetworks, the physics baseline and the ground truth objectmotion for 200 repetitions of the same push configuration. Allpushes have the same pushing angle (0◦), velocity (20mms )and contact point, but we sample a different transformationof the whole scene for each push, such that the object’s posein the image varies between the pushes. The networks weretrained on the full dataset from Experiment V-C.

The results shown in Figure 9 illustrate that the groundtruth object motion for the same push varies greatly withtwo distinct modes. By comparing with the prediction of theanalytical model we can estimate how much of this varianceis due to slight changes in the push configuration (which alsoreflect in the analytical model) and how much is caused byother, non-deterministic effects.

The predictions of hybrid and the analytical model are verysimilar. This shows that the state-representation that the neuralnetwork part of hybrid predicts is mostly accurate. This isconfirmed in Figure 10 where we plot the estimated contactpoints and normals. Adding an error-correction term to thehybrid architecture improves the average estimation quality alittle, but also increases the variance of the predictions.

The other models, too, make good predictions in thisexample, but especially simple and neural dyn have morevariance in the direction of the predicted translation.. Figure 10also shows that simple is not very accurate in predicting thecontact points. It is however possible that the error in thecontact point prediction is compensated for by vp, which insimple is not predicted by the analytical model but estimatedby the neural network. It is also interesting to see that neural,neural-dyn and simple all make very similar predictions thatslightly overestimate the rotation.

H. Evaluation of the error-architecture

The previous results have shown that adding a learnederror-correction term to the output of the analytical model inthe hybrid architecture enables the network to improve overthe performance of the analytical model. The error modelwe analysed is able to outperform hybrid and the physicsbaseline if the training set and the test set are similar (seeExperiment V-C).

As explained in Section IV, we chose to block the prop-agation of gradients from the error-correction module tothe glimpse-encoding, because we did not want the error-computation to interfere with the prediction of the staterepresentation. Here, we also evaluate an architecture err-gradthat does not block the gradient propagation. This architecture

trained on

test

ed o

n

20

40

tran

s[%

]

20

40

60

rot

[%]

0

0.5

pos

[mm

]

20

40

tran

s[%

]

20

40

60

rot

[%]

0

5

pos

[mm

]

neural simple hybrid error physics

Fig. 8: Prediction errors in translation, rotation and position on objects not seen during training. Training objects are shownon the x-axis. The top row shows results for evaluating on ellipses, the bottom row on triangles. All networks generalize wellto ellipses, but are worse for triangles, where the error in predicted position is by factor ten higher than for the other objects.Neural particularly struggles with predicting rotations of previously unseen objects.

TABLE III: Evaluation of different architectures for predictingan error-correction term. In contrast to error, error-grad allowsthe propagation of gradients from the error-prediction moduleto the glimpse encoding. Error-norm instead normalizes thepush action to unit length before using it as input to the error-prediction. Values shown are for training on the full training set(190k examples). Results for hybrid and neural are repeatedfor reference.

trans rot pos [mm]

neural 17.4 (0.12)% 33.4 (0.28)% 0.31 (0.002)hybrid 19.3 (0.13)% 36.1 (0.3)% 0.32 (0.002)

error 18.4 (0.12)% 34.6 (0.29)% 0.31 (0.002)error-grad 17.9 (0.12)% 34.4 (0.29)% 0.29 (0.002)error-norm 18.3 (0.12)% 35.3 (0.29)% 0.31 (0.002)

physics 18.95 (0.13)% 35.4 (0.3)% -zero 2.95 (0.02)mm 1.9 (0.01) ◦ -

manages to beat hybrid by an even bigger margin, as shownin Table III.

The downside of propagating the gradients becomes appar-ent if we look at generalization to new pushing velocities:While the predictions of error become worse with increasingvelocity, they still remain more accurate than the predictionsof neural, as illustrated in Figure 11. Error-grad on the otherhand performs even worse than the pure neural network. Areason for this difference could be that error-grad relies morestrongly on the error-correction term than error. This allowsit to fit the training data more closely but at the same timeimpedes generalization to unseen actions.

As explained before, the reason for the decline in perfor-mance when extrapolating is that the neural networks cannotscale their predictions correctly according to the input velocity.One possibility to make the error-prediction more robust tohigher input velocities is the architecture we call error-norm.In this model, we scale the push action to unit length before

using it as input to the error-prediction. This makes theerror-prediction independent of the magnitude of the action,while still giving it information about the push direction. Theresulting model performs only slightly worse than error insidethe training domain, but much better for extrapolation. It is stillworse than hybrid though, since it cannot properly scale theerror term to match higher velocities.

Using the error-correction term of course becomes muchmore interesting if the analytical model is bad. To test howwell the hybrid and error architecture can compensate forwrong models, we manipulate the friction parameter l bysetting it to 1.5 or 3 times its real value. The results are shownin Table IV.

Manipulating the friction values is especially harmful forpredicting the rotation of the object, and both hybrid and errorperform much better than the physics baseline with the wrongfriction parameter. The visualization in Figure 12 shows thatboth models predicted incorrect contact points to counter theeffect of the higher friction value. This makes sense, since thelocation of the contact point influences the tradeoff betweenhow much the object rotates and how much it translates.The predictions from error deviate farther from the groundtruth values, which shows that the additional error-term doesnot prevent the model from manipulating the input values tothe analytical model. Instead, it achieves its good results bycombining both forms of correction.

VI. EXTENSION TO NON-TRIVIAL VIEWPOINTS

In the previous part, we used depth images that showed atop-down view of the scene. This simplified the perceptionpart and allowed us to focus our experiments on comparingthe different architectures for learning the dynamics model.In this Section, we briefly describe what changes when wemove away from the top-down perspective and show that theproposed hybrid approach still works well on scenes that wererecorded from an arbitrary viewpoint.

(a) Ground truth (b) Physics [19] (c) Hybrid (d) Simple

(e) Neural (f) Neural-dyn (g) Error

Fig. 9: Qualitative evaluation on 200 repeated pushes with the same push configuration (angle, velocity, contact point). Thegreen rectangles show the (predicted) pose of the object after the push and the blue lines illustrate the object’s translation (forbetter visibility, we upscaled the lines by factor 5). The thicker orange rectangle is the average ground truth position of theobject after the push. Red crosses indicate the predicted initial object positions. All models predict the movement of the objectand its initial position well, but cannot capture the multimodal distribution of the ground truth data.

(a) Contact points pre-dicted by simple

(b) Contact points pre-dicted by hybrid

(c) Contact points pre-dicted by error

(d) Contact normal pre-dicted by hybrid

(e) Contact normal pre-dicted by error

Fig. 10: Predicted contact points and normals from 200 repeated pushes with the same push configuration (angle, velocity,contact point). The black point marks the (average) ground truth contact point. While hybrid and error make fairly accuratepredictions, simple predicts the contact points not on the edge of the object but close to its center.

Inspired by Byravan and Fox [5], we extend the input datafrom depth images to full 3d point clouds. These point cloudsare still image-shaped and can thus be treated like normalimages whose channels encode coordinate values instead ofcolour or intensity.

A. Challenges

For describing the scene geometrically, we need threecoordinate frames, the world frame, the camera frame and aframe that is attached to the object. In our rendered scenes,the origin of the world frame is located at the centre of thetable and its x−y plane is aligned with the table surface. Theobject frame is attached to the centre of mass of the object.

In planar pushing, we expect that the object can only moveand rotate in the x − y plane of its supporting surface andthe x − y plane of object and world frame are thus aligned.The movement of the object is described relative to the worldframe.

The camera-frame is located at the sensor position and itsz-axis points towards the origin of the world frame. Depthimages contain the distance of each visible point to the sensoralong the z axis, i.e. the z coordinates of said point in cameraframe. Finally, a 3D point (x y z)T in camera coordinatesis mapped to a pixel (u v)T by applying the perspective

20

40

60tr

ans

[%]

10 20 50 75 100 15020

40

60

80

100

push velocity [mms

]

rot

[%]

physics neural hybriderror error-grad error-norm

Fig. 11: Evaluation of the different architectures for predictingan error-correction term on unseen push velocities. All modelswere trained on push velocity 20 mm

s . None of the error-prediction models is as robust as hybrid to higher inputvelocities. Error-norm performs best because its predictederror terms are independent from the push velocity. Error-grad presumably relies more on the error-prediction term thanthe other architectures and therefore performs worst outsideof the training domain.

TABLE IV: Prediction errors of physics, hybrid and errorwhen using a manipulated friction parameter l. In contrast tophysics, both neural networks can compensate for the resultingerror of the analytical model. Hybrid can however only modifythe input values to the analytical model, while error cancorrect the model’s output directly and thus compensates theerror of the analytical model much better.

trans rot pos [mm]

1.5 · lhybrid 20.7 (0.13)% 40.5 (0.32)% 0.3 (0.002)

error 19.2 (0.14)% 35.9 (0.3)% 0.25 (0.002)physics 23.9 (0.15)% 46.1 (0.37)% -

3 · lhybrid 25.1 (0.15)% 66.9 (0.45)% 0.31 (0.002)

error 19.6 (0.13)% 37.2 (0.3)% 0.32 (0.002)physics 35.6 (0.23)% 80.1 (0.53)% -

projection with focal length f :(uv

)=f

z

(xy

)(11)

In the special case of a top-down view, the x− y-plane ofworld and camera frame are aligned, such that we can applythe analytical model presented in Section III-A in both frames.In addition, the measured depth (z in camera frame) of thetable and the object are independent of their x and y position.This reduces the perspective projection to a scaling operationwith a constant factor and allowed us to predict the objectmovement directly in pixel space. It also made segmentingthe object very easy, since it is associated with one specificdepth value.

Without the assumption of a top-down view, predictingobject movement in pixel-space becomes more challenging,since the same movement will span more pixels the furtheraway the object is from the camera. In addition, perspectivedistortion now affects the shape and size of the objects in pixelspace, which is e.g. relevant for localizing the object.

B. Network architecture

We now describe the changes we made to the architecturesfrom Section V to adapt to new camera configurations. To beable to relate pixel coordinates to 3d coordinates, we assumethat we have access to the parameters of the camera (focallength) and the transform between camera frame and worldframe.

1) Perception: The perception part remains largely as itwas before. As our architecture estimates the position ofthe object in pixel space (using spatial softmax), we usethe given camera parameters and transform to calculate thecorresponding position in 3d world-coordinates. For this weneed the z coordinate that corresponds to the predicted pixelcoordinates, which we get by interpolating between the valuesof the four pixels closest to the predicted coordinates. Dueto inaccuracies in the depth-values and the projection matrix,this operation introduces an error of about 0.3 mm, which wehowever consider negligible.

2) Prediction and Training: If we do not use a top-downview of the scene, the question becomes relevant in whichcoordinate-frame the network should predict the contact pointand the normal: pixel space, camera coordinates or worldcoordinates. We decided to continue using pixel coordinatessince predictions in this space can be most directly relatedto the input image and the predicted feature maps. Afterprediction, we however have to transfer the predicted contactpoint and normal from pixel-space to world-coordinates, sincethe analytical model can only be used in the world frame wherethe movement of the object falls exclusively onto the x − yplane.

We found training the model on data from arbitrary view-points more challenging than in the top-down scenario. Tofacilitate the process, we modified the network and the trainingloss to treat contact prediction and velocity prediction sepa-rately: Instead of using the predicted contact indicator in theanalytical model, the network now predicts the contact indi-cator s as a separate output. We interpret s as the probabilitythat the pusher is in contact with the object and place a cross-entropy loss on it. For the predicted velocity, we only penalizeerrors if there was (ground truth) contact. This prevents non-informative gradients to the velocity-prediction if the objectdid not move and also solves the problem that the contact-prediction tries to compensate for wrong velocity predictions.

At test-time, the predicted contact indicator is multipliedwith the predicted object movement from the analytical modelto get the final velocity prediction.

The new loss for a single training example looks as follows:Let s, vo and o denote the predicted and s, vo, o the realcontact indicator, object movement and position. w are the

(a) Physics (b) Hybrid (c) Error

(d) Contact points predicted by hy-brid

(e) Contact points predicted by er-ror

(f) Contact normal predicted byhybrid

(g) Contact normal predicted byerror

Fig. 12: Predicted movement, contact points and normals from 200 repeated pushes when using a wrong friction parameter(1.5 · l). The black point marks the (average) ground truth contact point. Both networks compensate for the wrong frictionparameter by predicting the contact point in a slightly wrong position, but the deviation from the ground truth is stronger forerror, which also flipps the direction of the predicted normal.

network weights and νo = [vox, voy] denotes linear objectvelocity.

L(s, vo, o, s,vo,o) = trans+ rot+ pos+ . . .

. . . λ2 contact+ λ1∑

w‖ w ‖

trans = s ‖νo − νo‖rot = s 180π |ω − ω|pos =‖ o− o ‖

contact = −(s log(s) + (1− s) log(1− s))

To ensure that all components of the loss are of thesame magnitude, we compute translation and position errorin millimetres and rotation in degree. We set λ1 = 10 andλ3 = 0.001.

C. Evaluation

To show that our approach is not dependent on the top-down perspective, we train and evaluate it on images collectedfrom a more natural viewpoint: The camera is located at(0 − 0.25 0.4

)in world-coordinates, which corresponds to

a more natural perspective where the viewer stands in frontof the table instead of hovering above the centre. Figure 13shows the depth channel of three example images.

Data: As in Section V, we use a dataset that containsall objects from the MIT Push dataset and all pushes withvelocity 20mms for evaluating on this new viewpoint. We split

it randomly into about 190k training examples and about 37kexamples for testing.

Results: After 75k training steps, the hybrid model performsvery similar to the top-down case: 19.4% translation error and37.2% error in rotation, as compared to 19.3% and 36.1% inthe top-down case. The only notable difference is that the errorin predicted position climbs from 0.32 mm to 1.12 mm. Thiscan be partially explained by the inaccuracy introduced whenconverting from pixel coordinates to world coordinates. It canalso be an effect of the perspective distortion, that influencesshape and size of the objects depending on how close they areto the camera. In relation to the size of the object, an error ofaround 1 mm is however still very small.

VII. CONCLUSION AND FUTURE WORK

In this paper, we considered the problem of predictingthe effect of physical interaction from raw sensory data. Wecompared a pure neural network approach to a hybrid approachthat uses a neural network for perception and an analyticalmodel for prediction. Our test bed involved pushing of planarobjects on a surface - a non-linear, discontinuous manipulationtask for which we have both, millions of data points and ananalytical model.

We observed two main advantages of the hybrid architec-ture. Compared to the pure neural network, it significantly(i) reduces required training data and (ii) improves gener-alization to novel physical interaction. The analytical model

Fig. 13: Example depth images recorded from a non-top-down viewpoint. The depth values increase towards the back of thescene and the perspective transform affects the shape of the objects.

aides generalization by limiting the ability of the hybridarchitecture to (over-)fit the training data and by providingmultiplication operations for scaling the output according tothe input action. This kind of mathematical operation is hardto learn for fully-connected architectures and requires manyparameters and training examples for covering a large valuerange. The drawback of the hybrid approach is that it cannoteasily improve on the performance of the underlying analyticalmodel.

Neural on the other hand can beat both, the hybrid approachand the analytical model (with ground truth input values) iftrained on enough data. This however only holds when weevaluate on actions encountered during training and does nottransfer to new push configurations, velocities or object shapes.The challenge in these cases is that the distribution of thetraining and test data differ significantly.

To enable the hybrid approach to improve more on theprediction accuracy of its analytical model, we experimentedwith learning an additional additive error-correction term.These models are almost as data-efficient as hybrid and canto some extend retain the ability to generalize to different testdata provided by the analytical model. Our experiments witha wrong analytical model showed that they can compensatefor errors of the model much better than hybrid, which canonly influence the prediction by manipulating the input valuesof the analytical model.

The last architecture, simple, showed that combining learn-ing and analytical models is not automatically guaranteed tolead to good performance. By replacing the first stage of theanalytical model with a neural network, we instead combinedthe disadvantages of both approaches: The architecture needslots of training data and does not generalize well to newpushes, because it misses the part of the analytical model thatexplains the influence of the pushing action on the resultingobject velocity. In contrast to the pure neural network, ithowever also cannot improve on the performance of theanalytical model.

A limitation of the presented hybrid approach is that it maybe hard to find an accurate analytical model for some physicalprocesses and that not all existing models are suitable for ourapproach, as they e.g. need to be differentiable everywhere.Especially the switching dynamics encountered when the

contact situation changes proved to be challenging and morework needs to be done in this direction. In such cases, learningthe predictive model with a neural network is still a very goodoption.

In perception on the other hand, the strengths of neuralnetworks can be well exploited to extract the input parametersof the analytical model from raw sensory data. By trainingend-to-end through a given model, we can avoid the effortof labelling data with the ground truth state. Our experimentsalso showed that training end-to-end allows the hybrid modelsto compensate for smaller errors in the analytical model byadjusting the predicted input values.

Using the state representation of the analytical model forthe hybrid architecture also the advantage that the predictionsof the network can be visualized and interpreted. This is noteasily possible for the intermediate representations learned inthe pure neural network. Our results however suggest thatthe pure neural network benefits from being free to chose itsown state representation, as learning the dynamics model fromthe ground truth state representation lead to worse predictionresults.

In the future, we want to extend our work to more complexperception problems, like training on RGB images or sceneswith multiple objects. An interesting question is if perceptionon point clouds could be facilitated by using methods likepointnet++ [? ] that are specifically designed for this type ofinput data instead of treating the point cloud like a normalimage.

A logical next step is also using our hybrid models forpredicting more than one step into the future. Working onsequences instead makes it for example possible to guidelearning by enforcing constraints temporal consistency or toexploit temporal cues like optical flow. The models could alsobe used in a filtering scenario to track the state of an objectand at the same time infer latent variables of the systemlike the friction coefficients. Similarly, we can also use thelearned models to plan and execute robot actions using modelpredictive control.

ACKNOWLEDGMENTS

This research was supported in part by National ScienceFoundation grants IIS-1205249, IIS-1017134, EECS-0926052,

the Office of Naval Research, the Okawa Foundation, GermanResearch Foundation and the Max-Planck-Society. Any opin-ions, findings, and conclusions or recommendations expressedin this material are those of the authors and do not necessarilyreflect the views of the funding organizations.

NOTES

1We also evaluated two different prediction horizons but found no signif-icant effect on the performance.

2We provide these inputs as friction related information cannot be obtainedfrom single images. Estimation from sequences is considered future work.

REFERENCES

[1] Martın Abadi et al. TensorFlow: Large-scale machinelearning on heterogeneous systems, 2015. Softwareavailable from tensorflow.org.

[2] Pulkit Agrawal, Ashvin V Nair, Pieter Abbeel, JitendraMalik, and Sergey Levine. Learning to poke by poking:Experiential learning of intuitive physics. In Advances inNeural Inform. Process. Syst., pages 5074–5082, 2016.

[3] M. Bauza and A. Rodriguez. A probabilistic data-driven model for planar pushing. In 2017 IEEE Interna-tional Conference on Robotics and Automation (ICRA),pages 3008–3015, May 2017. doi: 10.1109/ICRA.2017.7989345.

[4] Dominik Belter, Marek Kopicki, Sebastian Zurek, andJeremy Wyatt. Kinematically optimised predictionsof object motion. In Intelligent Robots and Systems(IROS 2014), 2014 IEEE/RSJ International Conferenceon, pages 4422–4427. IEEE, 2014.

[5] Arunkumar Byravan and Dieter Fox. Se3-nets: Learningrigid body motion using deep neural networks. InRobotics and Automation (ICRA), 2017 IEEE Int. Conf.on, pages 173–180. IEEE, 2017.

[6] Arunkumar Byravan, Felix Leeb, Franziska Meier, andDieter Fox. Se3-pose-nets: Structured deep dynamicsmodels for visuomotor planning and control. to appear atRobotics and Automation (ICRA), 2018 IEEE Int. Conf.on, abs/1710.00489, 2017.

[7] Jonas Degrave, Michiel Hermans, Joni Dambre, andFrancis Wyffels. A differentiable physics engine for deeplearning in robotics. CoRR, abs/1611.01652, 2016.

[8] Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsu-pervised learning for physical interaction through videoprediction. In Advances in Neural Inform. Process. Syst.29, pages 64–72. 2016.

[9] Suresh Goyal, Andy Ruina, and Jim Papadopoulos. Pla-nar sliding with dry friction part 1. limit surface andmoment function. Wear, 143(2):307 – 330, 1991. ISSN0043-1648. doi: https://doi.org/10.1016/0043-1648(91)90104-3.

[10] Soo Hong Lee and Mark Cutkosky. Fixtureplanning with friction. Journal of Engi-neering for Industry, 113, 08 1991. URLhttp://manufacturingscience.asmedigitalcollection.asme.org/article.aspx?articleid=1447458.

[11] Robert D. Howe and Mark R. Cutkosky. Practical force-motion models for sliding manipulation. The Inter-national Journal of Robotics Research, 15(6):557–572,1996. doi: 10.1177/027836499601500603.

[12] Sergey Ioffe and Christian Szegedy. Batch normalization:Accelerating deep network training by reducing internalcovariate shift. In Proc. 32nd Int. Conf. on MachineLearning, volume 37, pages 448–456, 07–09 Jul 2015.

[13] Rico Jonschkowski and Oliver Brock. Learning staterepresentations with robotic priors. Autonomous Robots,39(3):407–428, Oct 2015. ISSN 1573-7527. doi:10.1007/s10514-015-9459-7.

[14] Diederik Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980,2014.

[15] Alina Kloss, Stefan Schaal, and Jeannette Bohg. Com-bining learned and analytical models for predicting actioneffects. CoRR, abs/1710.04102, 2018. URL http://arxiv.org/abs/1710.04102. Submitted to RSS’18.

[16] Marek Kopicki, Sebastian Zurek, Rustam Stolkin,Thomas Moerwald, and Jeremy L. Wyatt. Learningmodular and transferable forward models of the motionsof push manipulated objects. Autonomous Robots, 41(5):1061–1082, 6 2017. doi: 10.1007/s10514-016-9571-3.

[17] Sergey Levine, Chelsea Finn, Trevor Darrell, and PieterAbbeel. End-to-end training of deep visuomotor policies.J. Mach. Learning Research, 17(39):1–40, 2016.

[18] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel,Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, andDaan Wierstra. Continuous control with deep reinforce-ment learning. CoRR, abs/1509.02971, 2015.

[19] K. M. Lynch, H. Maekawa, and K. Tanie. Manipulationand active sensing by pushing using tactile feedback.In Proc. IEEE/RSJ Int. Conf. Intelligent Robots andSystems, volume 1, pages 416–421, Jul 1992. doi:10.1109/IROS.1992.587370.

[20] Georg Martius and Christoph H. Lampert. Extrapolationand learning equations. CoRR, abs/1610.02995, 2016.URL http://arxiv.org/abs/1610.02995.

[21] Matthew T Mason. Mechanics and planning of manip-ulator pushing operations. The International Journal ofRobotics Research, 5(3):53–71, 1986.

[22] Tekin Mericli, Manuela Veloso, and H Levent Akın.Push-manipulation of complex passive mobile objects us-ing experimentally acquired motion models. AutonomousRobots, 38(3):317–329, 2015.

[23] Duy Nguyen-Tuong and Jan Peters. Using model knowl-edge for learning inverse dynamics. In Robotics andAutomation (ICRA), 2010 IEEE Int. Conf. on, pages2677–2682. IEEE, 2010.

[24] N. Watters, A. Tacchetti, T. Weber, R. Pascanu,P. Battaglia, and D. Zoran. Visual Interaction Networks.ArXiv e-prints, June 2017.

[25] Jiajun Wu, Erika Lu, Pushmeet Kohli, Bill Freeman, andJosh Tenenbaum. Learning to see physics via visualde-animation. In I. Guyon, U. V. Luxburg, S. Bengio,

http://manufacturingscience.asmedigitalcollection.asme.org/article.aspx?articleid=1447458

http://manufacturingscience.asmedigitalcollection.asme.org/article.aspx?articleid=1447458

http://arxiv.org/abs/1710.04102



H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,editors, Advances in Neural Information Processing Sys-tems 30, pages 152–163. Curran Associates, Inc., 2017.

[26] K. T. Yu, M. Bauza, N. Fazeli, and A. Rodriguez. Morethan a million ways to be pushed. a high-fidelity exper-imental dataset of planar pushing. In 2016 IEEE/RSJInt. Conf. Intelligent Robots and Systems (IROS), pages30–37, Oct 2016. doi: 10.1109/IROS.2016.7758091.

[27] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Ben-jamin Recht, and Oriol Vinyals. Understanding deeplearning requires rethinking generalization. CoRR,abs/1611.03530, 2016.

[28] L. Zhang and J. C. Trinkle. The application of particlefiltering to grasping acquisition with visual occlusion andtactile sensing. In 2012 IEEE Int. Conf. on Robotics andAutomation, pages 3805–3812, May 2012.

[29] Jiaji Zhou, Robert Paolini, J Andrew Bagnell, andMatthew T Mason. A convex polynomial force-motionmodel for planar sliding: Identification and application.In Robotics and Automation (ICRA), 2016 IEEE Inter-national Conference on, pages 372–377. IEEE, 2016.

Combining learned and analytical models for predicting action … · 2018-10-22 · Combining...

Documents

Transcript of Combining learned and analytical models for predicting action … · 2018-10-22 · Combining...