End-to-End Integration of a Convolutional Network ... · In this paper, we propose a new model...

1
End-to-End Integration of a Convolutional Network, Deformable Parts Model and Non-Maximum Suppression Li Wan, David Eigen, Rob Fergus Dept. of Computer Science, Courant Institute, New York University Deformable Parts Models and Convolutional Networks each have achieved notable performance in object detection. Yet these two approaches find their strengths in complementary areas: DPMs are well-versed in object compo- sition, modeling fine-grained spatial relationships between parts; likewise, ConvNets are adept at producing powerful image features, having been dis- criminatively trained directly on the pixels. In this paper, we propose a new model (shown in Fig. 1) that com- bines these two approaches, obtaining the advantages of each. We train this model using a new structured loss function that considers all bounding boxes within an image, rather than isolated object instances. This enables the non-maximal suppression (NMS) operation, previously treated as a sep- arate post-processing stage, to be integrated into the model. This allows for discriminative training of our combined Convnet + DPM + NMS model in end-to-end fashion. We evaluate our system on PASCAL VOC 2007 and 2011 datasets, achieving competitive results on both benchmarks. For a given input image x, we first construct an image pyramid of ap- pearance features φ A (x) using the first five layers of a Convolutional Net- work pre-trained for the ImageNet Classification task. We first train an eight layer classification model, which is composed of five convolutional feature extraction layers, plus three fully-connected classification layers. After this network has been trained, we throw away the three fully-connected layers, replacing them instead with the DPM. The five convolutional layers are then used to extract appearance features. The deformation layer finds the optimal part locations, accounting for both apperance and a deformation cost that models the spatial relation of the part to the root. Given appearance scores F part v , p , part location p relative to the root, and deformation parameters w D,v , p for each part, the deformed part responses are the following (input variables (x s , y) omitted): F def v , p = max δ i ,δ j F part v , p [ p i + δ i , p j + δ j ] - w part D,v , p φ D (δ i , δ j ) (1) where F part v , p [ p i + δ i , p j + δ j ] is the part response map F part v , p (x s , y) shifted by spatial offset ( p i + δ i , p j + δ j ), and φ D (δ i , δ j )=[|δ i |, |δ j |, δ 2 i , δ 2 j ] T is the shape deformation feature. w part D,y,v 0 are the deformation weights. Combining the scores of root, parts and object views is done using an AND-like accumulation over parts to form a score F v for each view v, fol- lowed by an OR-like maximum over views to form the final object score F : F v (x s , y) = F root v (x s , y)+ pparts F def v , p (x s , y) (2) F (x s , y) = max vviews F v (x s , y) (3) F (x s , y) is then the final score map for class y at scale s, given the image x as shown in Fig. 2. Final-prediction loss that takes into account the NMS step used in in- ference. In contrast to bootstrapping with a hard negative pool, such as in [2][1], we consider each image individually when determining positive and negative examples, accounting for NMS and the views present in the image itself. NMS stage produces a set of assignments predicted by the model A = {(b i , y i , r i ) i=1...n } from the set B of all possible assignments. We compose the loss using two terms, C(A) and C(A 0 ). The first, C(A), measures the cost incurred by the assignment currently predicted by the model, while C(A 0 ) measures the cost incurred by an assignment close to the ground truth. The This is an extended abstract. The full paper is available at the Computer Vision Foundation webpage. ConvNet Features ... Response Pyramid Deformable Parts Model fprop bprop root AND part1 part9 Def Def ... root AND part1 part9 Def ... OR obj view obj view Image Pyramid Final Prediction Def ... ... bicycle person person ... ... NMS NMS Figure 1: An overview of our system: (i) a convolutional network ex- tracts features from an image pyramid; (ii) a set of deformable parts models (each capturing a different view) are applied to the convolutional feature maps; (iii) non-maximal suppression is applied to the resulting response maps, yielding bounding box predictions. ... root AND part1 part9 Def Def AND OR ... AND Figure 2: Overview of the top part of our network architecture current prediction cost C(A) is: C(A)= (b i ,y i ,r i )A H(r i , y i ) | {z } C P (A) + (b j ,y j ,r j )S(A) H(r j , 0) | {z } C N (A) (4) where H(r , y)= I (y > 0) max(0, 1 - r) 2 + I (y = 0) max(0, r + 1) 2 i.e. a squared hinge error. [1] Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell., 32(9):1627–1645, September 2010. ISSN 0162-8828. doi: 10.1109/TPAMI.2009.167. URL http://dx.doi.org/10.1109/TPAMI.2009.167. [2] Long (Leo) Zhu, Yuanhao Chen, Alan Yuille, and William Freeman. Latent hierarchical structural learning for object detection. 2010 IEEE Conference on Computer Vision and Pattern Recognition, 0:1062– 1069, 2010. doi: http://doi.ieeecomputersociety.org/10.1109/CVPR. 2010.5540096.

Transcript of End-to-End Integration of a Convolutional Network ... · In this paper, we propose a new model...

Page 1: End-to-End Integration of a Convolutional Network ... · In this paper, we propose a new model (shown in Fig.1) that com-bines these two approaches, obtaining the advantages of each.

End-to-End Integration of a Convolutional Network, Deformable Parts Model andNon-Maximum Suppression

Li Wan, David Eigen, Rob FergusDept. of Computer Science, Courant Institute, New York University

Deformable Parts Models and Convolutional Networks each have achievednotable performance in object detection. Yet these two approaches find theirstrengths in complementary areas: DPMs are well-versed in object compo-sition, modeling fine-grained spatial relationships between parts; likewise,ConvNets are adept at producing powerful image features, having been dis-criminatively trained directly on the pixels.

In this paper, we propose a new model (shown in Fig. 1) that com-bines these two approaches, obtaining the advantages of each. We trainthis model using a new structured loss function that considers all boundingboxes within an image, rather than isolated object instances. This enablesthe non-maximal suppression (NMS) operation, previously treated as a sep-arate post-processing stage, to be integrated into the model. This allows fordiscriminative training of our combined Convnet + DPM + NMS model inend-to-end fashion. We evaluate our system on PASCAL VOC 2007 and2011 datasets, achieving competitive results on both benchmarks.

For a given input image x, we first construct an image pyramid of ap-pearance features φA(x) using the first five layers of a Convolutional Net-work pre-trained for the ImageNet Classification task. We first train an eightlayer classification model, which is composed of five convolutional featureextraction layers, plus three fully-connected classification layers. After thisnetwork has been trained, we throw away the three fully-connected layers,replacing them instead with the DPM. The five convolutional layers are thenused to extract appearance features.

The deformation layer finds the optimal part locations, accounting forboth apperance and a deformation cost that models the spatial relation of thepart to the root. Given appearance scores Fpart

v,p , part location p relative tothe root, and deformation parameters wD,v,p for each part, the deformed partresponses are the following (input variables (xs,y) omitted):

Fdefv,p = max

δi,δ j

Fpartv,p [pi +δi, p j +δ j]−wpart

D,v,pφD(δi,δ j) (1)

where Fpartv,p [pi +δi, p j +δ j] is the part response map Fpart

v,p (xs,y) shifted byspatial offset (pi + δi, p j + δ j), and φD(δi,δ j) = [|δi|, |δ j|,δ 2

i ,δ2j ]

T is the

shape deformation feature. wpartD,y,v ≥ 0 are the deformation weights.

Combining the scores of root, parts and object views is done using anAND-like accumulation over parts to form a score Fv for each view v, fol-lowed by an OR-like maximum over views to form the final object scoreF :

Fv(xs,y) = F rootv (xs,y)+ ∑

p∈partsFdef

v,p (xs,y) (2)

F(xs,y) = maxv∈views

Fv(xs,y) (3)

F(xs,y) is then the final score map for class y at scale s, given the image xas shown in Fig. 2.

Final-prediction loss that takes into account the NMS step used in in-ference. In contrast to bootstrapping with a hard negative pool, such as in[2] [1], we consider each image individually when determining positive andnegative examples, accounting for NMS and the views present in the imageitself.

NMS stage produces a set of assignments predicted by the model A ={(bi,yi,ri)i=1...n} from the set B of all possible assignments. We composethe loss using two terms, C(A) and C(A′). The first, C(A), measures the costincurred by the assignment currently predicted by the model, while C(A′)measures the cost incurred by an assignment close to the ground truth. The

This is an extended abstract. The full paper is available at the Computer Vision Foundationwebpage.

ConvNet Features

...

Response Pyramid

Deformable Parts Model

fprop

bprop

root

AN

D

part1

part9

Def

Def

...ro

ot

AN

D

part1

part9

Def

... OR

obj view

obj view

Image Pyramid

Final Prediction

Def

... ...

bicycle

person

person

......

NMS

NMS

Figure 1: An overview of our system: (i) a convolutional network ex-tracts features from an image pyramid; (ii) a set of deformable parts models(each capturing a different view) are applied to the convolutional featuremaps; (iii) non-maximal suppression is applied to the resulting responsemaps, yielding bounding box predictions.

...root

AND

part1 part9

Def Def

AND

OR

...

AND

Figure 2: Overview of the top part of our network architecture

current prediction cost C(A) is:

C(A) = ∑(bi,yi,ri)∈A

H(ri,yi)︸ ︷︷ ︸CP(A)

+ ∑(b j ,y j ,r j)∈S(A)

H(r j,0)︸ ︷︷ ︸CN(A)

(4)

where H(r,y) = I(y > 0)max(0,1− r)2 + I(y = 0)max(0,r + 1)2 i.e. asquared hinge error.

[1] Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester, and DevaRamanan. Object detection with discriminatively trained part-basedmodels. IEEE Trans. Pattern Anal. Mach. Intell., 32(9):1627–1645,September 2010. ISSN 0162-8828. doi: 10.1109/TPAMI.2009.167.URL http://dx.doi.org/10.1109/TPAMI.2009.167.

[2] Long (Leo) Zhu, Yuanhao Chen, Alan Yuille, and William Freeman.Latent hierarchical structural learning for object detection. 2010 IEEEConference on Computer Vision and Pattern Recognition, 0:1062–1069, 2010. doi: http://doi.ieeecomputersociety.org/10.1109/CVPR.2010.5540096.