IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 2013 1...

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 2013 1

Occlusion Reasoning forObject Detection under Arbitrary Viewpoint

Edward Hsiao, Student Member, IEEE, and Martial Hebert, Member, IEEE

Abstract—We present a unified occlusion model for object instance detection under arbitrary viewpoint. Whereas previous approachesprimarily modeled local coherency of occlusions or attempted to learn the structure of occlusions from data, we propose to explicitlymodel occlusions by reasoning about 3D interactions of objects. Our approach accurately represents occlusions under arbitraryviewpoint without requiring additional training data, which can often be difficult to obtain. We validate our model by incorporatingocclusion reasoning with the state-of-the-art LINE2D and Gradient Network methods for object instance detection and demonstratesignificant improvement in recognizing texture-less objects under severe occlusions.

Index Terms—occlusion reasoning, object detection, arbitrary viewpoint

1 INTRODUCTION

O CCLUSIONS are common in real world scenes andare a major obstacle to robust object detection.

While texture-rich objects can be detected under se-vere occlusions with distinctive local features, such asSIFT [1], many man-made objects have large uniformregions. These texture-less objects are characterized bytheir contour structure, which are often ambiguous evenwithout occlusions. Instance detection of texture-less ob-jects compounds this ambiguity by requiring recognitionunder arbitrary viewpoint with severe occlusions asshown in Fig. 1. While much research has addressedeach component separately (texture-less objects [2]–[5],arbitrary viewpoint [6]–[8], occlusions [9]–[11]), address-ing them together is extremely challenging. The maincontributions of this paper are (i) a concise model ofocclusions under arbitrary viewpoint without requiringadditional training data and (ii) a method to captureglobal visibility relationships without combinatorial ex-plosion.In the past, occlusion reasoning for object detection

has been extensively studied [11]–[13]. One commonapproach is to model occlusions as regions that are in-consistent with object statistics [10], [14], [15] and to en-force local coherency with a Markov Random Field [16]to reduce noise in these classifications. While assumingthat any inconsistent region is an occlusion is valid ifocclusions happen uniformly over an object, it ignoresthe fact there is structure to occlusions for many objects.For example, in real world environments, objects areusually occluded by other objects resting on the samesurface. Thus it is often more likely for the bottom of anobject to be occluded than the top of an object [17].Recently, researchers have attempted to learn the

structure of occlusions from data [9], [18]. With enough

• The authors are with The Robotics Institute, Carnegie Mellon University,5000 Forbes Ave., Pittsburgh, PA, 15213.E-mail: ehsiao,[email protected]

Fig. 1: Example detections of (left) cup and (right) pitcherunder severe occlusions.

data, these methods can learn an accurate model ofocclusions. However, obtaining a broad sampling ofoccluder objects is usually difficult, resulting in biasesto the occlusions of a particular dataset. This becomesmore problematic when considering object detection un-der arbitrary view [6], [19], [20]. Learning approachesneed to learn a new model for each view of an objectand require the segmentation of the occluder for eachtraining image. This is intractable, especially when recentstudies [6] have claimed that approximately 2000 viewsare needed to sample the view space of an object. A keycontribution of our approach is to represent occlusionsunder arbitrary viewpoint without requiring additionalannotated training data of occlusion segmentations. Wedemonstrate that our approach accurately models occlu-sions by using only information about the distributionof object dimensions in an environment and the size ofthe object of interest, and that learning occlusions fromdata does not give better performance.

Researchers have shown in the past that incorporat-ing 3D geometric understanding of scenes [21], [22]improves the performance of object detection systems.Following these approaches, we propose to reason aboutocclusions by explicitly modeling 3D interactions ofobjects. For a given environment, we compute physicalstatistics of objects in the scene and represent an occluder

0000–0000/00/$00.00 c© 2013 IEEE Published by the IEEE Computer Society

2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 2013

as a probabilistic distribution of 3D blocks. The physicalstatistics need only be computed once for a particularenvironment and can be used to represent occlusions formany objects in the scene. By reasoning about occlusionsin 3D, we effectively provide a unified occlusion modelfor different viewpoints of an object as well as differentobjects in the scene.We incorporate occlusion reasoning with object de-

tection by: (i) a bottom-up stage which hypothesizesthe likelihood of occluded regions from the image data,followed by (ii) a top-down stage which uses priorknowledge represented by the occlusion model to scorethe plausibility of the occluded regions. We combine theoutput of the two stages into a single measure to scorea candidate detection.The focus of this paper is to demonstrate that a

relatively simple model of 3D interaction of objects canbe used to represent occlusions effectively for instancedetection of texture-less objects under arbitrary view.Recently, there has been significant progress in simpleand efficient template matching techniques [6], [23] forinstance detection. These approaches work extremelywell when objects are largely visible, but degrade rapidlywhen faced with strong occlusions in heavy backgroundclutter. We incorporate our occlusion reasoning withtwo state-of-the-art gradient-based template matchingmethods, LINE2D [6] and Gradient Network (GN) [5],and demonstrate significant improvement in detectionperformance on the challenging CMU Kitchen OcclusionDataset (CMU KO8) [24].

2 RELATED WORK

Occlusion reasoning has been widely used in manyareas from object recognition to segmentation and track-ing. While the literature is extensive, there has beencomparatively little work on modeling occlusions fromdifferent viewpoints and using 3D information until re-cently. In the following, we review current techniques forocclusion reasoning and broadly classify them into fourcategories. We begin by discussing classical approacheswhich use object statistics, part-based models and mul-tiple images, and then discuss more recent approacheson incorporating 3D information.

2.1 Inconsistent Object Statistics

Occlusions are commonly modeled as regions whichare inconsistent with object statistics. Girshick et al. [14]use an occluder part in their grammar model when allparts cannot be placed. Wang et al. [10] use the scoresof individual HOG filter cells, while Meger et al. [15]use depth inconsistency from 3D sensor data to classifyocclusions. To reduce noise in occlusion classifications,local coherency of regions is often enforced [16]. Ourapproach hypothesizes occlusions as regions which areinconsistent with object statistics, and then performshigher level reasoning about the likelihood of the result-ing occlusion pattern.

2.2 Multiple Images

When multiple images in a sequence are provided, ad-jacent frames are often used to disambiguate the objectfrom the occluders. The location of objects is typicallypassed to subsequent frames to identify potential occlu-sions. For example, Shu et al. [25] pass the informationof occluded parts to the following frame to penalizethem when detecting. Ess et al. [26] keep tracks aliveand extrapolate the state of occluded objects using anExtended Kalman Filter. Xing et al. [27] using a temporalsliding window to generate a set of reliable tracklets anduse them to disambiguate fully occluded objects. Kowdleet al. [28] use motion cues and a smooth motion priorin a Markov Random Field framework to segment thescene into depth layers. Our approach differs from thesemethods in that we directly operate on a single image.

2.3 Part-based Models

While global object templates work well for detectingobjects that are unoccluded, they quickly degrade whenocclusions are present. A common representation is toseparate an object into a set of parts, so that the overalldetector is more robust to individual sections beingoccluded. Tang et al. [29] leverage the fact the occlu-sions often form characteristic patterns and extend theDeformable Parts Model (DPM) [30] for joint persondetection. Vedaldi and Zisserman [31] decompose theHOG descriptor into small blocks which can selectivelyswitch between either an object descriptor or an oc-clusion descriptor. For human pose estimation, Sigaland Black [32] encode occlusion relationships betweenbody parts using hidden binary variables. Shu et al. [25]examine the contribution of each part using a linear SVMand adapts the classifier to use unoccluded parts whichmaximize the probability of detection. Wu and Neva-tia [33] assign the responses of multiple part detectorsinto object hypotheses that maximize the joint likelihood.Given the visibility confidences of the parts on an object,our approach reasons about its likelihood in the realworld.

2.4 3D Reasoning

More recently with the advent of Kinect [34] and moreaffordable 3D sensors, there has been increasing workon introducing 3D information for occlusion reason-ing. Having 3D data provides richer information of theworld, such as depth discontinuities and object size.Wojek et al. [35] combine object and part detectors basedon their expected visibility using a 3D scene model.Contemporary with this work, Pepik et al. [36] leveragefine-grained 3D annotated urban street scenes to minedistinctive, reoccurring occlusion patterns. Detectors arethen trained for each of these patterns. Zia et al. [37]model occlusions on a 3D geometric object class modelby enumerating a small finite of occlusion patterns.Wang et al. [38] build a depth-encoded context model

HSIAO AND HEBERT: OCCLUSION REASONING FOR OBJECT DETECTION UNDER ARBITRARY VIEWPOINT 3

Fig. 2: Occlusion model. (left) Example camera view of an object (gray) and occluder (red). (middle) Projected widthof occluder, w, for a rotation of θ. (right) Projected height of occluder, h, and projected height of object, Hobj , foran elevation angle of ψ. Notice that h is the projection of h onto the image plane since this is the maximum heightthat can occlude the object, while Hobj is the apparent height of the object silhouette. An occluder needs a projectedheight of h ≥ Hobj to fully occlude the object.

using RGB-D information and extend Hough voting toinclude both the object location and its visibility pattern.Our approach uses the key idea of reasoning aboutocclusions in 3D, but works directly on a 2D image.

3 OCCLUSION MODEL

Occlusions in real world scenes are often caused by asolid object resting on the same surface as the object ofinterest. In our model, we approximate occluding objectsby their 3D bounding box and demonstrate how tocompute occlusion statistics of an object under differentcamera viewpoints, c, defined by an elevation angle ψand azimuth θ.

Let X = X1, ..., XN be a set of N points on the objectwith their visibility states represented by a set of binaryvariables V = V1, ..., VN such that if Vi = 1, then Xi

is visible. For occlusions Oc under a particular cameraviewpoint c, we want to compute occlusion statisticsfor each point in X . Unlike other occlusion modelswhich only compute an occlusion prior P (Vi|Oc), wepropose to also model the global relationship betweenvisibility states, P (Vi|V-i, Oc) where V-i = V\Vi. Throughour derivation, we observe that P (Vi|Oc) captures theclassic intuition that the bottom of the object is morelikely to be occluded than the top. More interestingis P (Vi|V-i, Oc) which captures the structural layout ofan occlusion. The computation of these two occlusionproperties both reduce to integral geometry [39] (anentire field dedicated to geometric probability theory).

We make a couple of approximations to tractablyderive the occlusion statistics. Specifically, since objectswhich occlude each other are usually physically closetogether, we approximate the objects to be on the samesupport surface and we approximate the perspective ef-fects over the range of object occlusions to be negligible.

3.1 Representation under different viewpoints

The likelihood that a point on an object is occludeddepends on the angle the object is being viewed from.Most methods that learn the structure of occlusions fromdata [9] require a separate occlusion model for each viewof every object. These methods do not scale well whenconsidering detection of many objects under arbitraryview.

In the following, we propose a unified representationof occlusions under arbitrary viewpoint of an object. Ourmethod requires only the statistics of object dimensions,which is obtained once for a given environment and canbe shared across many objects for that environment.

The representation we propose is illustrated in Fig. 2.For a specific viewpoint, we represent the portion of ablock that can occlude the object as a bounding box withdimensions corresponding to the projected height h andthe projected width w of the block.

The object of interest, on the other hand, is representedby its silhouette in the image. Initially, we derive ourmodel using the bounding box of the silhouette withdimensions Hobj and Wobj , and then relax our model touse the actual silhouette (Section 3.4).

First, we compute the projected width w of an occluderwith width w and length l as shown by the top-downview in Fig. 2. In our convention, w = w for an azimuthof θ = 0. Using simple geometry, the projected width is:

w(θ) = w · | cos θ|+ l · | sin θ|. (1)

Since θ is unknown for an occluding object, we obtaina distribution of w assuming all rotations about thevertical axis are equally likely. The distribution of w overθ ∈ [0, 2π] is equivalent to the distribution over anyπ2 interval. Thus, the distribution of w is computed bytransforming a uniformly distributed random variableon [0, π2 ] by (1). The resulting probability density of w is


(a) (b) (c)

Fig. 3: Computation of the occlusion prior. (a) We consider the center positions of a block (red) which occlude theobject. The base of the block is always below the object, since we assume they are on the same surface. (b) Theset of positions is defined by the yellow rectangle which has area AOc

. (c) The set of positions which occlude theobject while keeping Xi visible is defined by the green region which has area AVi,Oc

.

given by:

pw(w) =

2π

(

1− w2

w2+l2

)- 12

, w ≤ w < l

4π

(

1− w2

w2+l2

)- 12

, l ≤ w <√w2 + l2.

(2)

The full derivation of this density is provided in Ap-pendix A.Next, we compute the projected height h of an oc-

cluder as illustrated by the side view of Fig. 2. We defineh to be the projection of h on the image plane as thiscorresponds to the maximum height that can occludethe object given our assumptions. Blocks with differentwidth and length, but the same height will have the sameocclusion of the object vertically. Thus, for an elevationangle ψ and occluding block with height h, the projectedheight h is:

h(ψ) = h · cosψ. (3)

The projected height of the object, Hobj , is slightlydifferent in that it accounts for the apparent height of theobject silhouette. An object is fully occluded verticallyonly if h ≥ Hobj . To compute Hobj , we need the distance,Dobj , from the closest edge to the farthest edge of theobject. Following the computation of the projected widthw, we have Dobj(θ) = Wobj · | sin θ| + Lobj · | cos θ|. Theprojected height of the object at an elevation angle ψ isthen given by:

Hobj(θ, ψ) = Hobj · | cosψ|+Dobj(θ) · | sinψ|. (4)

Finally, the projected width of the object Wobj is com-puted using the aspect ratio of the silhouette boundingbox.

3.2 Occlusion Prior

Given the representation derived in Section 3.1, we wantto compute a probability for a point on the object beingoccluded. Many systems which attempt to address occlu-sions assume that they occur randomly and uniformly

across the object. However, recent studies [17] haveshown that there is structure to occlusions for manyobjects.We begin by deriving the occlusion prior using an oc-

cluding block with projected dimensions (w, h) and thenextend the formulation to use a probabilistic distributionof occluding blocks of different sizes. The occlusion priorspecifies the probability P (Vi|Oc) that a point on theobject Xi = (xi, yi) is visible given an occlusion of theobject. This involves estimating the area, AOc

, coveringthe set of block positions that occlude the object (shownby the yellow region in Fig. 3b), and estimating the area,AVi,Oc

, covering the set of block positions that occludethe object while keeping Xi visible (shown by the greenregion in Fig. 3c). The occlusion prior is then just a ratioof these two areas:

P (Vi|Oc) =AVi,Oc

AOc

. (5)

From Fig. 3b, a block (red) will occlude the object ifits center is inside the yellow region. The area of thisregion, AOc

, is:

AOc= (Wobj + w) · h. (6)

Next, from Fig. 3c, this region can be partitioned intoa region where the occluding block occludes Xi (blue)and a region which does not (green). AVi,Oc

correspondsto the area of the green region and can be computed as:

AVi,Oc= Wobj · h+ w ·min(h, yi). (7)

The derivation is provided in Appendix B.Now that we have derived the occlusion prior using

a particular occluding block, we extend the formulationto a distribution of blocks of different sizes. Let pw(w)and ph(h) be distributions of w and h respectively.To simplify notation, we define µw = Epw(w)[w] and

µh = Eph(h)[h] to be the expected width and height

of the occluders under these distributions, and define


(a) (b) (c) (d) (e)

0.5

0.6

0.7

0.8

0.9

1

Fig. 4: Example of (a) occlusion prior P (Vi|Oc), (b,c) conditional likelihood P (Vi|Vj , Oc) and P (Vi|Vk, Oc) given twoseparate points Xj and Xk individually, (d) approximate conditional likelihood P (Vi|Vj , Vk, Oc) from (12), and (e)explicit conditional likelihood P (Vi|Vj , Vk, Oc) from (10).

βy(yi) =∫

min(h, yi) · ph(h) dh. The average areas, AOc

and AVi,Oc, are then given by:

AOc= (Wobj + µw) · µh, (8)

AVi,Oc= Wobj · µh + µw · βy(yi). (9)

This derivation assumes that the distribution pw,h(w, h)

can be separated into pw(w) and ph(h). For householdobjects, we empirically verified that this approximationholds. In practice, the areas are computed by discretizingthe distributions and Fig. 4(a) shows an example occlu-sion prior. Fig. 5 shows how the distribution changesunder different camera viewpoints. Our model is able tocapture that the top of the object is much less likely tobe occluded when viewed from a higher elevation anglethan from a lower one. This is because the projectedheight of occluders is shorter the higher the elevationangle.

3.3 Occlusion Conditional Likelihood

Most occlusion models only account for local coherencyand the prior probability that a point on the object isoccluded. Ideally, we want to compute a global relation-ship between all visibility states V on the object. Whilethis is usually infeasible combinatorially, we show how atractable approximation can be derived in the followingsection.Let XV-i

be the visible subset of X according to V-i.We want to compute the probability P (Vi|V-i, Oc) that apoint Xi is visible given the visibility of XV-i

. FollowingSection 3.2, the conditional likelihood is given by:

P (Vi|V-i, Oc) =AVi,V-i,Oc

AV-i,Oc

. (10)

This computation involves estimating the areas, AV-i,Oc

covering the set of block positions that occlude the objectwhile keeping XV-i

visible, and AVi,V-i,Occovering the set

of block positions that occlude the object while keepingboth Xi and XV-i

visible.We first consider the case where we condition on

one visible point, Xj (i.e., XV-i= Xj). To compute

P (Vi|Vj , Oc), we already have AVj ,Ocfrom (9), so we just

need AVi,Vj ,Oc. The computation follows from Section

3.2, so we omit the details and just provide the results

below. The detailed derivation is provided in AppendixC. If we let βx(xi, xj) =

∫

min(w, |xi − xj |) · pw(w) dw,then:

AVi,Vj ,Oc= (Wobj − |xi − xj |) · µh

+

(

∫ |xi−xj|

0

(|xi − xj | − w) · pw(w) dw)

· µh

+ βx(xi, xj) · βy(yi) + µw · βy(yj). (11)

We can generalize the conditional likelihood to kvisible points (i.e., |XV-i

| = k) by counting as above,however, the number of cases increases combinatorially.We make the approximation that the point Xj ∈ XV-i

with the highest conditional likelihood P (Vi|Vj , Oc) pro-vides all the information about the visibility of Xi. Thisobservation assumes that given Vj , the visibility of Xi

is independent of the visibility of all other points (i.e.,Vi ⊥ V-i\Vj|Vj) and allows us to compute the globalvisibility relationship P (Vi|V-i, Oc) without combinato-rial explosion. The approximation of P (Vi|V-i, O-i) isthen:

P (Vi|V-i, Oc) ≈ P (Vi|V ∗j , Oc), (12)

V ∗j = argmax

Vj∈V-i

P (Vi|Vj , Oc). (13)

For example, Fig. 4(d,e) shows the approximate condi-tional likelihood and the exact one for |XV-i

| = 2. Fig.5 shows how the distribution changes under differentcamera viewpoints.

3.4 Arbitrary object silhouette

The above derivation can easily be relaxed to use theactual object silhouette. The idea is to subtract the area,As, covering the set of block positions that occlude theobject bounding box but not the silhouette from theareas described in Sections 3.2 and 3.3. Fig. 7 showstwo example block positions. An algorithm to computeAs is provided in Appendix D. The occlusion prior andconditional likelihood are then given by:

P (Vi|Oc) =AVi,Oc

−As

AOc−As

, (14)

P (Vi|V-i, Oc) =AVi,V-i,Oc

−As

AV-i,Oc−As

. (15)


Fig. 5: The occlusion prior and conditional distributionunder different camera viewpoints. We show the (left)model viewpoint, (middle) occlusion prior and (right)occlusion conditional likelihood.

(a) True positive (b) False positive

Fig. 6: Examples of occlusion hypotheses. (a) For a truedetection, the occluded points (red) are consistent withour model. (b) For a false positive, the top of the object ishypothesized to be occluded while the bottom is visible,which is highly unlikely according to our model.

Fig. 7: Using an arbitrary object silhouette. (left) Object.(right) Two blocks in red which occlude the object bound-ing box in gray, but not the silhouette in black.

4 OBJECT DETECTION

Given our occlusion model from Section 3, we augmentan object detection system by (i) a bottom-up stagewhich hypothesizes occluded regions using the objectdetector, followed by (ii) a top-down stage which mea-sures the consistency of the hypothesized occlusion withour model. We explore using the occlusion prior and

occlusion conditional likelihood for scoring and showin our evaluation that both are informative for objectdetection. Initially, we assume that the hypothesizedvisibility labeling Vz is provided for a sliding windowlocation z. We describe in detail in Section 5.3 how toobtain Vz for different instances of algorithms.Given Vz, we want a metric of how well the occluded

regions agree with our model. Intuitively, we shouldpenalize points that are hypothesized to be occludedby the object detector but are highly likely to be visibleaccording to our occlusion model. From this intuition,we propose the following detection score:

scoref (Vz) =1

N

N∑

i=1

V zi − f(Vz), (16)

where f(V) is a penalty function for occlusions. A higherscore indicates a more confident detection, and for de-tections with no occlusion, the score is 1. For detectionswith occlusion, the penalty f(V) is higher the moreoccluded points which are inconsistent with the model.In the following, we propose two penalty functions,fOPP(V) and fOCLP(V), based on the occlusion prior andocclusion conditional likelihood of Section 3.

4.1 Occlusion Prior Penalty

The occlusion prior penalty (OPP) gives high penaltyto locations that are hypothesized to be occluded buthave a high prior probability P (V z

i |Oc) of being visible.Intuitively, once the prior probability drops below somelevel λ, the point should be considered part of a validocclusion and should not be penalized. This corresponds

to a hinge loss function Γ(P, λ) = max(

P−λ1−λ , 0

)

. The

linear penalty we use is then:

fOPP(V) =1

N

N∑

i=1

[(1− Vi) · Γ (P (Vi|Oc), λp)] . (17)

4.2 Occlusion Conditional Likelihood Penalty

The occlusion conditional likelihood penalty (OCLP), onthe other hand, gives high penalty to locations that arehypothesized to be occluded but have a high probabilityP (Vi|V-i, Oc) of being visible given the visibility labelingof all other points V-i. Using the same penalty functionformulation as the occlusion prior penalty, we have that:

fOCLP(V) =1

N

N∑

i=1

[(1 − Vi) · Γ (P (Vi|V-i, Oc), λc)] . (18)

5 EVALUATION

In order to evaluate our occlusion model’s performancefor object instance detection, two sets of experimentswere conducted; the first for a single view of an objectand the second for multiple views of an object. While inpractice, one would only detect objects under multiple


Fig. 8: Example detection results under severe occlusions in cluttered household environments.

views, it is important to tease apart the effect of occlusionfrom the effect of viewpoint.In each set of experiments, we explore the benefits of

(i) using only the bottom-up stage and (ii) incorporatingprior knowledge of occlusions with the top-down stage.When evaluating the bottom-up stage, we hypothesizethe occluded region and consider the score of only thevisible portions of the detection. This score is equivalentto the first term of (16).The parameters of our occlusion model were cali-

brated on images not in the dataset and were kept thesame for all objects and all experiments. The occlusionparameters were set to λp = 0.5 and λc = 0.95. We showin Section 5.8 that our model is not sensitive to the exactchoice of these parameters.

5.1 CMU Kitchen Occlusion Dataset (CMU KO8)

Many object recognition algorithms work well in con-trolled scenes, but fail when faced with real-world con-ditions exhibiting strong viewpoint and illuminationchanges, occlusions and clutter. Current datasets for ob-ject detection under multiple viewpoints either contain

objects on simple backgrounds [40] or have minimalocclusions [6], [8]. For evaluation under a more naturalsetting, the dataset we collected consists of commonhousehold objects in real, cluttered environments undervarious levels of occlusion. Our dataset contains 1600images of 8 objects and is split evenly into two parts; 800for a single view of an object and 800 for multiple viewsof an object. The single-view part contains ground truthlabels of the occlusions and Fig. 9 shows that our datasetcontains roughly equal amounts of partial occlusion (1-35%) and heavy occlusions (35-80%) as defined by [17],making this dataset very challenging.

For multiple-view evaluation, we focus our viewpointvariation to primarily the elevation angle as occlusionspatterns for different azimuth angles are similar. Whileit may be harder to recognize certain objects for certainazimuth angles, we are focused only on the relativeperformance change with using an occlusion model. Forour experiments, we use 25 model images for each objectwhich is the same sampling density as [6]. Each modelimage was collected with a calibration pattern to groundtruth the camera viewpoint (ψ, θ) and to rectify the object


0% 20% 40% 60% 80% 100%0

0.2

0.4

0.6

0.8

1

percent occluded

cum

ulat

ive

prob

abili

ty

Fig. 9: Dataset occlusion statistics. Our dataset containsroughly equal amount of partial occlusions (1-35%) andheavy occlusions (35-80%).

0 10 20 300

0.01

0.02

0.03

0.04

0.05

0.06Height

cm

prob

0 10 20 300

0.02

0.04

0.06

0.08

0.1Length and Width

cm

prob

Fig. 10: Distribution of (left) heights and (right) lengthand width of occluders in household environments.

silhouette to be upright. The test data was collected bychanging the camera viewpoint and the scene arounda stationary object. A calibration pattern was used toground truth the position of the object.

5.2 Validity of Occlusion Model

To derive the occlusion probabilities, we approximatedoccluder objects to be bounding boxes which are on thesame surface as the object. While this approximation isconsistent with the occlusion types observed by Dollar etal. in [17], we further validate the approximation on ourdataset. Given the groundtruth occlusion labels in thedataset, we consider an occluded pixel to be consistentwith the approximation if there are no un-occludedobject pixels below it. From Fig. 11, for 80% of theimages, over 90% of the occluded pixels are consistent.

5.3 Algorithms

We validate our approach by incorporating occlusionreasoning with two state-of-the-art methods for instancedetection under arbitrary viewpoint, LINE2D [6] andGradient Networks (GN) [5]. For fair comparison, we usethe same M sampled edge points xi for all the methods.These points are specified relative to the template center,z. We give a brief description of the algorithms in ourcomparison below.

5.3.1 LINE2D [6]The LINE2D method is a current state-of-the-art systemfor instance detection under arbitrary viewpoint. It rep-resents an object by a template of sampled edge points,

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

percent of valid occluded pixels

perc

ent o

f ins

tanc

es

Fig. 11: Validity of occlusion model. (left) We show inblue, the occluded pixels which satisfy our approxima-tion, and in red, those that do not. (right) For each objectinstance in the dataset, we evaluate the percentage ofoccluded pixels which satisfy our approximation thatthey can be explained by a bounding box with a baselower than the object. For 80% of the images, over 90%of the occluded pixels can be explained with our model.

each with a quantized orientation. For every scanningwindow location, a similarity score is computed betweenthe gradient of each model point and the image. In [6],the gradient orientations are quantized into 8 orientationbins and the similarity for point xi is the cosine of thesmallest quantized orientation difference, ∆θi, betweenits orientation and the image orientations in a 7 × 7neighborhood of xi + z. The score of a window is∑M

i=1 cos(∆θi). We kept all other parameters for LINE2Dthe same as [6]. We tested our implementation on asubset of the dataset provided by the authors of [6] andobserved negligible difference in performance.

5.3.2 robust-LINE2D (rLINE2D) [24]Since the LINE2D method returns continuous values, weneed to threshold it to obtain the occlusion hypothesis.We consider a point to be occluded if the image gradientand the model gradient have different quantized orien-tations. Fig. 6 shows example occlusion hypotheses. Thisproduces a binary descriptor for each scanning windowlocation. The score of a window is

∑Mi=1 δ(∆θi = 0)

where δ(t) = 1 if t is true. We refer to this method asrobust-LINE2D (rLINE2D). In the evaluation, we showthat simply binarizing the descriptor with rLINE2D sig-nificantly outperforms LINE2D in cluttered scenes withsevere occlusions.

5.3.3 Gradient Networks (GN) [5]The LINE2D and rLINE2D methods only consider localgradient information when matching the shape. Sincethese approaches do not account for edge connectivity,this often results in misclassification of occlusion incluttered regions where the local gradient orientationmatches accidentally. To address this issue, our GradientNetworks method captures contour connectivity directlyon low-level image gradients. For each image pixel,the algorithm estimates the probability that it matchesa template shape. Our results in [5] show significantimprovement in shape matching and object detection innatural scenes with severe clutter and occlusions.


TABLE 1: Single view. Average precision.

Object LINE2D rLINE2D rLINE2D+OPP rLINE2D+OCLP GN GN+OPP GN+OCLP

bakingpan 0.12 0.23 0.27 0.36 0.44 0.51 0.61colander 0.26 0.60 0.66 0.74 0.76 0.75 0.80cup 0.25 0.64 0.65 0.75 0.84 0.84 0.89pitcher 0.14 0.66 0.69 0.76 0.71 0.64 0.77saucepan 0.12 0.51 0.52 0.53 0.85 0.83 0.85scissors 0.09 0.21 0.21 0.26 0.42 0.41 0.45shaker 0.05 0.36 0.48 0.59 0.47 0.52 0.64thermos 0.24 0.56 0.68 0.68 0.73 0.79 0.76

Mean 0.16 0.47 0.52 0.58 0.65 0.66 0.72

TABLE 2: Multiple view. Average precision.

Object LINE2D rLINE2D rLINE2D+OPP rLINE2D+OCLP GN GN+OPP GN+OCLP

bakingpan 0.06 0.12 0.12 0.15 0.78 0.61 0.79colander 0.19 0.55 0.58 0.67 0.72 0.71 0.79cup 0.13 0.44 0.46 0.50 0.85 0.86 0.87pitcher 0.08 0.22 0.32 0.34 0.53 0.60 0.67saucepan 0.09 0.40 0.42 0.41 0.90 0.88 0.89scissors 0.05 0.20 0.20 0.26 0.72 0.68 0.81shaker 0.08 0.16 0.25 0.25 0.41 0.50 0.52thermos 0.08 0.30 0.44 0.48 0.66 0.74 0.81

Mean 0.09 0.30 0.35 0.38 0.70 0.70 0.77

The GN algorithm returns a shape similarity ΩS foreach pixel given the template S(z). For fair comparison,we apply a 7 × 7 max spatial filter (i.e., equivalent toLINE2D) to ΩS resulting in ΩS. The template score at zis then

∑Mi=1 Ω

S(xi+z). We use the soft shape model withnormalized gradient magnitudes for the edge potential,and both color and orientation for the appearance. Sincethe similarity measure ΩS(xi + z) is calibrated so thatit can be interpreted as a probability, we generate theocclusion hypothesis by thresholding this value at 0.5.

Our algorithm takes on average 2 ms per location z ona 3GHz Core2 Duo CPU. In practice, we run our algo-rithm only at the hypothesis detections of rLINE2D [24].The combined computation time is about 1 second perimage.

5.4 Distribution of Occluder Sizes

The distribution of object sizes varies in different en-vironments. For a particular scenario, it is natural toonly consider objects as occluders if they appear in thatenvironment. The statistics of objects can be obtainedfrom the Internet [41] or, in the household scenario,simply from 100 common household items. Fig. 10 showsthe distributions for household objects.

From real world dimensions, we can compute theprojected width and height distributions, pw(w) andph(h), for a given camera viewpoint. The projected widthdistribution is the same for all viewpoints and is ob-tained by computing the probability density from (2)for each pair of width and length measurement. Thesedensities are discretized and averaged to give the finaldistribution of w.

The projected height distribution, on the other hand,depends on the elevation angle ψ. From (3), h is a factorcosψ of h. Thus, the projected height distribution, ph(h),is computed by subsampling ph(h) by cosψ.

5.5 Single view

We first evaluate the performance for single view ob-ject detection. An object is correctly detected if theintersection-over-union (IoU) of the predicted boundingbox and the ground truth bounding box is greater than0.5. Each object is evaluated on all 800 images in this partof the dataset and Fig. 12 shows the precision-recall plot.To summarize the performance, we report the AveragePrecision in Table 1. A few example detections are shownin Fig. 8.From the table, the rLINE2D method already signif-

icantly outperforms the baseline LINE2D method. Oneissue with the LINE2D gradient similarity metric (i.e.,cosine of the orientation difference) is that it gives highscore even to orientations that are very different, result-ing in false positives with high scores. The rLINE2Dmet-ric of considering only points with the same quantizedorientation is more robust to background clutter in thepresence of occlusions. However, since rLINE2D onlyconsiders information very locally around each edgepoint, there are still many misclassifications in back-ground clutter. The Gradient Network method performssubstantially better than rLINE2D by incorporating edgeconnectivity directly on low-level gradients. This resultsin an improvement of 18% in average precision.When rLINE2D is augmented with occlusion reason-

ing, there is an absolute improvement of 5% for OPPand 11% for OCLP. For GN, there is a corresponding


0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

recall

prec

isio

nbakingpan

LINE2D (AP=0.12)rLINE2D (AP=0.23)rLINE2D+OPP (AP=0.27)rLINE2D+OCLP (AP=0.36)GN (AP=0.44)GN+OPP (AP=0.51)GN+OCLP (AP=0.61)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

recallpr

ecis

ion

colander


0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

recall

prec

isio

n

cup


0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

recall

prec

isio

n

pitcher


0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

recall

prec

isio

n

saucepan


0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

recall

prec

isio

nscissors


0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

recallpr

ecis

ion

shaker


0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

recall

prec

isio

n

thermos


Fig. 12: Precision-recall results for single view. There is significant improvement in performance by using occlusionreasoning.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

bakingpan


0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

colander


0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

cup


0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

pitcher


0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

saucepan


0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

scissors


0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

shaker


0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

thermos


Fig. 13: Precision-recall results for multiple views. The overall performance is lower than the single view experimentsdue to more false positives, but importantly, we observe similar gains from using our occlusion reasoning

improvement of 1% and 7%. This indicates that bothocclusion properties are informative for object detection.The disparity between the gains of OPP and OCLPsuggests that accounting for global occlusion layout byOCLP is more informative than considering the a prioriocclusion probability of each point individually by OPP.In particular, OCLP improves over OPP when one side

of the object is completely occluded as shown in Fig.14. Although the top of the object is validly occluded,OPP assigns a high penalty. By over-penalizing truedetections, OPP often makes the performance worse.OCLP, on the other hand, always performs at least onpar with the baseline methods and usually performssubstantially better.


Fig. 14: A typical case where OCLP performs better thanOPP. (left) For OPP, the false positives in red have higherscores than the true detection in green. The occludedregion at the top of the true detection is over-penalized.(right) For OCLP, the true detection is the top detection.

0−35% >35%0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

percent occluded

perf

orm

ance

LINE2DrLINE2DrLINE2D+OPPrLINE2D+OCLPGNGN+OPPGN+OCLP

Fig. 15: Performance under different occlusion levels.While our methods improve performance under alllevels of occlusions, we see larger gains under heavyocclusions.

Fig. 15 shows the performance under different levelsof occlusion. Here, the detection rate is the percentageof top detections which are correct. Our occlusion rea-soning improves object detection under both low (0-35%) and high levels (>35%) of occlusions, but providessignificantly larger gains for heavy occlusions. For unoc-cluded objects, we verified that the occlusion model didnot degrade performance.

5.6 Multiple views

Next we evaluate the performance for object detectionunder multiple views. Fig. 13 shows the precision-recall plots and Table 2 reports the Average Precision.Again, we obtain significant improvement gains over theLINE2D system. The performance is lower for rLINE2Ddue to more false positives from increasing the numberof templates, but the relative gains at 5% for OPP and8% for OCLP are similar to the single view case. GN,on the other hand, is a much more robust templatematching technique. With more templates, GN is able tofind better aligned viewpoints while keeping a low falsepositive rate, leading to increased performance over thesingle view case. We again see similar gains for OCLPat 7%. This demonstrates that our model is effective forrepresenting occlusions under arbitrary view.Fig. 6 shows a typical false positive that can only be

Fig. 16: Typical failure cases of OCLP. (left) The pitcher isoccluded by the handle of the pot which is not accuratelymodeled by a block. (right) The scissor is occluded by aplastic bag resting on top of it. In these cases, OCLP overpenalizes the detections.

filtered by our occlusion reasoning. Although a majorityof the points match well and the missing parts arelargely coherent, the detection is not consistent with ourocclusion model and is thus penalized and filtered.Fig. 16 shows a couple of failure cases where our

assumptions are violated. In the first image, the potoccluding the pitcher is not accurately modeled by itsbounding box. In the second image, the occluding objectrests on top of the scissor. Even though we do nothandle these types of occlusions, our model representsthe majority of occlusions and is thus able to increasethe overall detection performance.

5.7 Learning From Data

To verify that our model accurately represents occlusionsin real world scenes, we rerun the above experimentswith occlusion priors and conditional likelihoods learnedfrom data. We use the detailed groundtruth occlusionmasks in the single view portion of the dataset to obtainthe empirical distributions. Fig. 17 compares learningthe occlusion prior and occlusion likelihood using 10,20, 40 and 80 images with our analytical model. Thedistributions from the empirical and analytical model arevery similar.We use 5-fold cross-validation for quantitative eval-

uation and Fig. 18 shows the results using differentnumber of images for learning. The learned occlusionproperties, lOPP and lOCLP, correspond to their explicitcounterparts, OPP and OCLP. The learned occlusionprior, lOPP, performs slightly better than OPP. This isa result of the slightly different distribution seen inFig. 17 where the sides of the object are more likelyto be occluded in the dataset. The learned occlusionconditional likelihood, lOCLP, performs essentially thesame as OCLP, but requires 80 images for every view ofevery object to achieve the same level of performance.

5.8 Parameter Sensitivity

The two parameters of our occlusion reasoning approachare λp and λc corresponding to the hinge loss parametersfor OPP and OCLP. To verify that our approach is notsensitive to the exact choice of these parameters, we


(a) 10 images (b) 20 images (c) 40 images (d) 80 images (e) analytical

Fig. 17: Learning the occlusion distributions from from data. From left to right, columns 1-4 show using 10, 20,40, and 80 images for learning the occlusion prior (top) and the occlusion conditional likelihood (bottom). The lastcolumn shows the distribution of our analytical model.

0 10 20 30 40 50 60 70 80

0.46

0.48

0.5

0.52

0.54

0.56

0.58

0.6

training images

mean

avera

ge

pre

cis

ion rLINE2D+OCLP

rLINE2D+OPP

rLINE2D

rLINE2D+lOPP

rLINE2D+lOCLP

0 10 20 30 40 50 60 70 800.62

0.64

0.66

0.68

0.7

0.72

0.74

training images

mean

avera

ge

pre

cis

ion

GN+OCLP

GN+OPP

GN

GN+lOPP

GN+lOCLP

Fig. 18: Learning the occlusion prior and conditional like-lihood using groundtruth occlusion segmentations fromthe single view portion of the dataset. The dotted linesshow the performance of our analytic approach, whichdoes not depend on the number of training images. Weshow the learned occlusion properties, lOPP (red) andlOCLP (blue), corresponding to OPP and OCLP. WhilelOPP performs slightly better than OPP, it needs about20 images and is only slightly better. lOCLP needs about60 to 80 images to achieve the same level of performanceas OCLP.

evaluated the performance of the occlusion reasoning fora range of parameter values. Fig. 19 shows the sensitivityof λp and λc when augmenting both rLINE2D and GN.From the figure, the performance is relatively constant

0.2 0.3 0.4 0.5 0.6 0.7 0.80.45

0.5

0.55

0.6

0.65

0.7

0.75

λp

me

an

ave

rag

e p

recis

ion

rLINE2D+OPP

GN+OPP

0.7 0.75 0.8 0.85 0.9 0.95 10.5

0.55

0.6

0.65

0.7

0.75

0.8

λc

mean average precision

rLINE2D+OCLP

GN+OCLP

Fig. 19: Parameter sensitivity for (left) OPP and (right)OCLP. The last data point for OCLP is plotted at λc =0.999 which penalizes only points that the model be-lieves are definitely occluded (i.e., P (Vi|V-i, Oc) = 1).The exact choice of the parameter does not affect theperformance of the methods significantly.

for a wide range of parameter values and is thus robustto the exact choice.

6 CONCLUSION

The main contribution of this paper is to demonstratethat a simple model of 3D interaction of objects canbe used to represent occlusions effectively for objectdetection under arbitrary viewpoint without requiringadditional training data. We propose a tractable methodto capture global visibility relationships and show that itis more informative than the typical a priori probabilityof a point being occluded. Our results on a challengingdataset of texture-less objects under severe occlusionsdemonstrate that our approach can significantly improveobject detection performance.

ACKNOWLEDGMENTS

This work was supported in part by the National ScienceFoundation under ERC Grant No. EEEC-0540865.


REFERENCES

[1] D. Lowe, “Distinctive image features from scale-invariant key-points,” International Journal of Computer Vision, 2004.

[2] A. Toshev, B. Taskar, and K. Daniilidis, “Object detection viaboundary structure segmentation,” in Proceedings of IEEE Confer-ence on Computer Vision and Pattern Recognition, 2010.

[3] V. Ferrari, L. Fevrier, F. Jurie, and C. Schmid, “Groups of adja-cent contour segments for object detection,” IEEE Transactions onPattern Analysis and Machine Intelligence, 2008.

[4] V. Ferrari, T. Tuytelaars, and L. Van Gool, “Object detection bycontour segment networks,” in Proceedings of European Conferenceon Computer Vision, 2006.

[5] E. Hsiao and M. Hebert, “Gradient networks: Explicit shapematching without extracting edges,” in Proceedings of AAAI Con-ference on Artificial Intelligence, 2013.

[6] S. Hinterstoisser, C. Cagniart, S. Ilic, P. Sturm, N. Navab, P. Fua,and V. Lepetit, “Gradient response maps for real-time detectionof texture-less objects,” IEEE Transactions on Pattern Analysis andMachine Intelligence, 2011.

[7] E. Hsiao, A. Collet, and M. Hebert, “Making specific featuresless discriminative to improve point-based 3d object recognition,”in Proceedings of IEEE Conference on Computer Vision and PatternRecognition, 2010.

[8] K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical multi-view rgb-d object dataset,” in Proceedings of IEEE InternationalConference on Robotics and Automation, 2011.

[9] T. Gao, B. Packer, and D. Koller, “A segmentation-aware objectdetection model with occlusion handling,” in Proceedings of IEEEConference on Computer Vision and Pattern Recognition, 2011.

[10] X. Wang, T. Han, and S. Yan, “An hog-lbp human detector withpartial occlusion handling,” in Proceedings of IEEE InternationalConference on Computer Vision, 2009.

[11] H. Plantinga and C. Dyer, “Visibility, occlusion, and the aspectgraph,” International Journal of Computer Vision, 1990.

[12] W. Grimson, T. Lozano-Perez, and D. Huttenlocher, Object recog-nition by computer. MIT Press, 1990.

[13] M. Stevens and J. Beveridge, Integrating Graphics and Vision forObject Recognition. Kluwer Academic Publishers, 2000.

[14] R. B. Girshick, P. F. Felzenszwalb, and D. McAllester, “Object de-tection with grammar models,” in Proceedings of Neural InformationProcessing Systems, 2011.

[15] D. Meger, C. Wojek, B. Schiele, and J. J. Little, “Explicit occlusionreasoning for 3d object detection,” in Proceedings of British MachineVision Conference, 2011.

[16] R. Fransens, C. Strecha, and L. Van Gool, “A mean field em-algorithm for coherent occlusion handling in map-estimationprob,” in Proceedings of IEEE Conference on Computer Vision andPattern Recognition, 2006.

[17] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detec-tion: An evaluation of the state of the art,” IEEE Transactions onPattern Analysis and Machine Intelligence, 2011.

[18] S. Kwak, W. Nam, B. Han, and J. H. Han, “Learn occlusion withlikelihoods for visual tracking,” in Proceedings of IEEE InternationalConference on Computer Vision, 2011.

[19] H. Su, M. Sun, L. Fei-Fei, and S. Savarese, “Learning a densemulti-view representation for detection, viewpoint classificationand synthesis of object categories,” in Proceedings of IEEE Interna-tional Conference on Computer Vision, 2009.

[20] A. Thomas, V. Ferrari, B. Leibe, T. Tuytelaars, B. Schiele, andL. Van Gool, “Towards multi-view object class detection,” inProceedings of IEEE Conference on Computer Vision and PatternRecognition, 2006.

[21] S. Bao, M. Sun, and S. Savarese, “Toward coherent object detectionand scene layout understanding,” in Proceedings of IEEE Conferenceon Computer Vision and Pattern Recognition, 2010.

[22] D. Hoiem, A. Efros, and M. Hebert, “Putting objects in perspec-tive,” International Journal of Computer Vision, 2008.

[23] S. Hinterstoisser, V. Lepetit, S. Ilic, P. Fua, and N. Navab, “Dom-inant orientation templates for real-time detection of texture-lessobjects,” in Proceedings of IEEE Conference on Computer Vision andPattern Recognition, 2010.

[24] E. Hsiao and M. Hebert, “Occlusion reasoning for object detectionunder arbitrary viewpoint,” in Proceedings of IEEE Conference onComputer Vision and Pattern Recognition, 2012.

[25] G. Shu, A. Dehghan, O. Oreifej, E. Hand, and M. Shah, “Part-based multiple-person tracking with partial occlusion handling,”in Proceedings of IEEE Conference on Computer Vision and PatternRecognition, 2012.

[26] A. Ess, K. Schindler, B. Leibe, and L. Van Gool, “Improved multi-person tracking with active occlusion handling,” in Proceedings ofIEEE International Conference on Robotics and Automation, Workshopon People Detection and Tracking, 2009.

[27] J. Xing, H. Ai, and S. Lao, “Multi-object tracking through occlu-sions by local tracklets filtering and global tracklets associationwith detection responses,” in Proceedings of IEEE Conference onComputer Vision and Pattern Recognition, 2009.

[28] A. Kowdle, A. Gallagher, and T. Chen, “Revisiting depth layersfrom occlusions,” in Proceedings of IEEE Conference on ComputerVision and Pattern Recognition, 2013.

[29] S. Tang, M. Andriluka, and B. Schiele, “Detection and tracking ofoccluded people.” in Proceedings of British Machine Vision Confer-ence, 2012.

[30] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discrimina-tively trained, multiscale, deformable part model,” in Proceedingsof IEEE Conference on Computer Vision and Pattern Recognition, 2008.

[31] A. Vedaldi and A. Zisserman, “Structured output regressionfor detection with partial truncation,” in Proceedings of NeuralInformation Processing Systems, 2009.

[32] L. Sigal and M. J. Black, “Measure locally, reason globally:Occlusion-sensitive articulated pose estimation,” in Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition, 2006.

[33] B. Wu and R. Nevatia, “Detection and segmentation of multi-ple, partially occluded objects by grouping, merging, assigningpart detection responses,” International Journal of Computer Vision,vol. 82, no. 2, pp. 185–204, 2009.

[34] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio,R. Moore, A. Kipman, and A. Blake, “Real-time human poserecognition in parts from single depth images,” in Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition, 2011.

[35] C. Wojek, S. Walk, S. Roth, and B. Schiele, “Monocular 3d sceneunderstanding with explicit occlusion reasoning,” in Proceedingsof IEEE Conference on Computer Vision and Pattern Recognition, 2011.

[36] B. Pepik, M. Stark, P. Gehler, and B. Schiele, “Occlusion patternsfor object class detection,” in Proceedings of IEEE Conference onComputer Vision and Pattern Recognition, 2013.

[37] M. Z. Zia, M. Stark, and K. Schindler, “Explicit occlusion mod-eling for 3D object class representations,” in Proceedings of IEEEConference on Computer Vision and Pattern Recognition, 2013.

[38] T. Wang, X. He, and N. Barnes, “Learning structured Houghvoting for joint object detection and occlusion reasoning,” inProceedings of IEEE Conference on Computer Vision and PatternRecognition, 2013.

[39] L. Santalo, Integral geometry and geometric probability. Addison-Wesley Publishing Co., Reading, MA, 1976.

[40] M. Sun, G. Bradski, B. Xu, and S. Savarese, “Depth-encodedhough voting for joint object detection and shape recovery,” inProceedings of European Conference on Computer Vision, 2010.

[41] J. F. Lalonde, D. Hoiem, A. A. Efros, C. Rother, J. Winn, andA. Criminisi, “Photo clip art,” in ACM SIGGRAPH, 2007.

[42] R. V. Hogg and E. Tanis, Probability and Statistical Inference, 8th ed.Pearson, 2009.

Edward Hsiao received the BS with honor inelectrical engineering from the California Insti-tute of Technology in 2008. He received the PhDdegree in robotics from The Robotics Instituteat Carnegie Mellon University in 2013. He wasthe recipient of the National Science Founda-tion Graduate Research Fellowship. His currentresearch interests in computer vision includerecognizing texture-less objects under arbitraryviewpoint in cluttered scenes and increasing therobustness of object detection under occlusions.

He is a student member of the IEEE.


Martial Hebert is a professor at The RoboticsInstitute, Carnegie Mellon University. His cur-rent research interests include object recognitionin images, video, and range data, scene un-derstanding using context representations, andmodel construction from images and 3D data.His group has explored applications in the areasof autonomous mobile robots, both in indoorand in unstructured, outdoor environments, au-tomatic model building for 3D content genera-tion, and video monitoring. He has served on the

program committees of the major conferences in the computer visionarea. He is a member of the IEEE.


APPENDICES

Many of the results derived for the Occlusion Prior andOcclusion Conditional Likelihood are from the classicfield of integral geometry [39]. In the following, we showthe detailed derivations.

APPENDIX APROBABILITY DENSITY OF w

We show how to transform a uniform variable overa π

2 interval by (1) using the distribution function tech-nique [42]. First, let’s simplify the equation for the pro-jected width:

w(θ) = w · cos θ + l · sin θ (19)

=√

w2 + l2 ·

w√w2 + l2

· cos θ

+l√

w2 + l2· sin θ

(20)

=√

w2 + l2 ·

cos

[

tan-1(

l

w

)]

· cos θ

+ sin

[

tan-1(

l

w

)]

· sin θ

(21)

=√

w2 + l2 · cos[

θ − tan-1(

l

w

)]

. (22)

Since the transformation over any π2 interval is equiv-

alent, the shift by tan-1(

lw

)

is irrelevant. For simplicityof derivation, consider the interval [θ1, θ2], where θ1 =

cos-1(

l√w2+l2

)

and θ2 = cos-1(

w√w2+l2

)

. We define the

random variable Θ to have a uniform density over thisinterval:

pΘ(θ) =

2/π, θ1 ≤ θ ≤ θ2

0, else.(23)

To compute the probability density of w, we apply thetransformation g(θ) = w(θ)/

√w2 + l2 = cos θ to Θ to

produce the random variable Y (i.e., Y = g(Θ)). Thedistribution function technique calculates the densitypY (y) of Y by first finding the cumulative distributionfunction PY (y) and then taking the derivative. Thereare two cases.

Case 1: cos θ2 < y < cos θ1

PY (y) =

∫ θ2

cos-1 y

2

π· dθ (24)

= − 2

πcos-1 y +

2

πθ2 (25)

pY (y) =2

π√

1− y2(26)

Case 2: cos θ1 < y < 1

PY (y) =

∫ − cos-1 y

−θ1

2

π· dθ +

∫ θ2

cos-1 y

2

π· dθ (27)

= − 4

πcos-1 y +

2

π(θ1 + θ2) (28)

pY (y) =4

π√

1− y2(29)

Thus we have that:

pY (y) =

2

π√

1−y2, cos θ2 < y < cos θ1

4

π√

1−y2, cos θ1 < y < 1

(30)

Substituting in θ1, θ2 and y, we obtain the probabilitydensity function pw(w) in (2).

APPENDIX BOCCLUSION PRIOR

We show how to compute the area AVi,Occovering all

the possible positions of the red block in Fig. 3a whichocclude the object but keep the point Xi visible. Thisregion is specified in green in Fig. 20. To compute thearea AVi,Oc

, we break it up into the area of three parts:

AVi,Oc= ∆1,1 +∆1,2 +∆2. (31)

From the figure, we can see that the purple region hasarea:

∆1,1 +∆1,2 = Wobj · h. (32)

Note that this area does not depend on the position ofXi. On the other hand, the area of yellow region doesdepend on the y coordinate of Xi. If the projected heightof the block h is shorter than yi, the height of the yellowregion is h. If it is taller, there are less possible positionsof the red block and the height of the yellow region isyi. Thus its area is:

∆2 =

w · h, h ≤ yi

w · yi, h > yi(33)

Combining (32) and (33), we get the area AVi,Ocin (7).

APPENDIX COCCLUSION CONDITIONAL LIKELIHOOD

We show how to compute the area AVi,Vj ,Occovering

all the possible positions of the red block in Fig. 3awhich occlude the object but keep both points Xi andXj visible. Without loss of generality, assume that Xi islower than Xj (i.e., yi ≤ yj). If this is not the case, wecan simply switch the points. The region is specified ingreen in Fig. 21. To compute the area AVi,Vj ,Oc

, we breakit up into the area of five parts:

AVi,Vj ,Oc= Λ1,1 + Λ1,2 + Λ2 + Λ3 + Λ4 (34)

From the figure, we can see that the purple region hasarea:

Λ1,1 + Λ1,2 = (Wobj − |xi − xj |) · h (35)


Fig. 20: Detailed illustration of how to compute the area AVi,Occovering all the possible positions where the red

block in Fig. 3a occludes the object while keeping Xi visible.

Fig. 21: Detailed illustration of how to compute the area AVi,Vj ,Occovering all the possible positions where the red

block in Fig. 3a occludes the object while keeping both Xi and Xj visible.

The area of the yellow region is the same as for theocclusion prior, and it does not depend on the locationof Xj :

Λ2 =

w · h, h ≤ yi

w · yi, h > yi(36)

The blue region covers the positions of the block whichcan fit in between Xi and Xj . If the projected width w isgreater than |xi−xj |, it can not fit in this region and thearea is 0. However, if it is less than |xi − xj |, the widthof the blue region is |xi − xj | − w. Thus the area is:

Λ3 =

(|xi − xj | − w) · h, w ≤ |xi − xj |0, w > |xi − xj |

(37)

The orange region covers the positions of the blockthat are below Xj . When the projected width w is lessthan |xi − xj |, the computation of the area is similar toΛ2. However, when it is greater, the possible horizontalpositions of the block is restricted by point Xi. In thiscase, instead of w positions, there are only |xi − xj |

positions. Thus the area is:

Λ4 =

w · h, w ≤ |xi − xj |, h ≤ yj

w · yj, w ≤ |xi − xj |, h > yj

|xi − xj | · h, w > |xi − xj |, h ≤ yj

|xi − xj | · yj , w > |xi − xj |, h > yj .

(38)

Combining (35), (36), (37), (38) and simplifying the equa-tion, we get:

AVi,Vj ,Oc= (Wobj − |xi − xj |) · h

+ w ·min(h, yi)

+ δ(w ≤ |xi − xj |) · (|xi − xj | − w) · h+ min(w, |xi − xj |) ·min(h, yi) (39)

Integrating over the projected width and height distri-butions pw and ph, we get the average area in (11).

APPENDIX DCOMPUTING AREA FOR SILHOUETTE

We show how to compute the area of the silhouette As

used in (14) and (15). Given a mask M , we extract theheight of the lowest point YM (x) relative to bottom ofthe mask for each unique position x ∈ XM . Then for an


Algorithm 1 Mask Sliding Min, Ω(XM ,YM , w, h)

Require: bottom of mask (XM ,YM ), projected width w,projected height h

1: YM = min(YM , h)2: for x = min(XM )− w

2 → max(XM ) + w2 do

3: Z(x) = minx∈[x−w/2,x+w/2] YM (x)4: end for5: return Z

occluder with projected width and height (w, h), the areacovering all the positions that intersect the bounding boxbut not the silhouette is given by,

As =

∫

Ω(XM ,YM , w, h) · dx, (40)

where Ω(XM ,YM , w, h) is the Mask Sliding Min shown inAlgorithm 1. This function considers the highest positionto place an occluder at position x while not intersectingthe mask. The position is lower than the height of theoccluder and lower than the height of all mask pointswithin an interval [−w/2, w/2] of x.For a distribution of occluding blocks pw(w) and ph(h)

for w and h respectively, the average areas are then givenby:

∫∫∫

Ω(XM ,YM , w, h) · pw(w) · ph(h) · dx · dw · dh. (41)

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 2013 1...

Documents

Transcript of IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 2013 1...