Abstract - arXiv · 1. Introduction With widespread use of surveillance cameras, the need for...

10
Self-learning Scene-specific Pedestrian Detectors using a Progressive Latent Model Qixiang Ye 1,4 , Tianliang Zhang 1 , Qiang Qiu 4 , Baochang Zhang 2 , Jie Chen 3 , and Guillermo Sapiro 4 1 EECE, University of Chinese Academy of Sciences. 2 ASEE, Beihang University. 3 CMV, Oulu University. 4 ECE, Duke University. [email protected]; [email protected] Abstract In this paper, a self-learning approach is proposed towards solving scene-specific pedestrian detection prob- lem without any human’ annotation involved. The self- learning approach is deployed as progressive steps of object discovery, object enforcement, and label propagation. In the learning procedure, object locations in each frame are treated as latent variables that are solved with a progressive latent model (PLM). Compared with conventional latent models, the proposed PLM incorporates a spatial regu- larization term to reduce ambiguities in object proposals and to enforce object localization, and also a graph-based label propagation to discover harder instances in adjacent frames. With the difference of convex (DC) objective functions, PLM can be efficiently optimized with a concave- convex programming and thus guaranteeing the stability of self-learning. Extensive experiments demonstrate that even without annotation the proposed self-learning approach outperforms weakly supervised learning approaches, while achieving comparable performance with transfer learning and fully supervised approaches. 1. Introduction With widespread use of surveillance cameras, the need for automatically detecting objects, e.g., pedestrians, has significantly increased. Recent methods [9, 13, 18, 27] have achieved encouraging progress for detecting objects in images. However, their performance in video scenes is limited for the following main reasons: 1) Supervised learning of detectors for different scenes requires repeated human effort; 2) Offline-trained detectors usually degrade with changes in the scene or camera; 3) Scene specific cues including object resolution, occlusions, and background structures are not incorporated into the detectors [29]. Learning scene-specific detectors, which aims at model- ing objects in video scenes by incorporating scene-specific A video of pedestrians Negative images Progressive latent model Object discovery Object enforcement Label propagation Iteration Score maps: Input video: Figure 1. Proposed self-learning framework. Given a video where pedestrians are dominant moving objects, self-learning progressively constructs a scene-specific detector using object discovery, object enforcement, and label propagation procedures. discriminative information, has been increasingly inves- tigated [19, 25, 31]. To learn scene-specific detectors with less human supervision, transfer learning and semi- supervised learning are commonly used [19, 25, 31]. Trans- fer learning adapts pre-trained detectors to new specific domains, reduces annotation requirements and improves de- tector performance [35, 36, 37]. Semi-supervised learning saves human annotation effort by initially training detectors with a few annotated examples, and incrementally improv- ing the detectors by extending the sample domains [11, 25, 41]. However, transfer learning is challenged when the object appearance in the target domains has significant differences with that in the source domains; while semi- supervised models might drift away from the intended aims given noisy or unrelated samples [25]. Most importantly, both methods require partial object-level annotations, and therefore, do not fully eliminate human supervision. As a promising direction, recent unsupervised video object discovery techniques [23, 26, 39] had been signif- icantly improved, which are supposed to break the bottle- neck of the self-taught learning in practical applications. This paper discusses the possibility of self-learning pedes- 1 arXiv:1611.07544v1 [cs.CV] 22 Nov 2016

Transcript of Abstract - arXiv · 1. Introduction With widespread use of surveillance cameras, the need for...

Page 1: Abstract - arXiv · 1. Introduction With widespread use of surveillance cameras, the need for automatically detecting objects, e.g., pedestrians, has significantly increased. Recent

Self-learning Scene-specific Pedestrian Detectorsusing a Progressive Latent Model

Qixiang Ye1,4, Tianliang Zhang 1, Qiang Qiu4, Baochang Zhang2, Jie Chen3, and Guillermo Sapiro4

1EECE, University of Chinese Academy of Sciences.2ASEE, Beihang University. 3CMV, Oulu University. 4ECE, Duke University.

[email protected]; [email protected]

Abstract

In this paper, a self-learning approach is proposedtowards solving scene-specific pedestrian detection prob-lem without any human’ annotation involved. The self-learning approach is deployed as progressive steps of objectdiscovery, object enforcement, and label propagation. Inthe learning procedure, object locations in each frame aretreated as latent variables that are solved with a progressivelatent model (PLM). Compared with conventional latentmodels, the proposed PLM incorporates a spatial regu-larization term to reduce ambiguities in object proposalsand to enforce object localization, and also a graph-basedlabel propagation to discover harder instances in adjacentframes. With the difference of convex (DC) objectivefunctions, PLM can be efficiently optimized with a concave-convex programming and thus guaranteeing the stability ofself-learning. Extensive experiments demonstrate that evenwithout annotation the proposed self-learning approachoutperforms weakly supervised learning approaches, whileachieving comparable performance with transfer learningand fully supervised approaches.

1. Introduction

With widespread use of surveillance cameras, the needfor automatically detecting objects, e.g., pedestrians, hassignificantly increased. Recent methods [9, 13, 18, 27]have achieved encouraging progress for detecting objectsin images. However, their performance in video scenesis limited for the following main reasons: 1) Supervisedlearning of detectors for different scenes requires repeatedhuman effort; 2) Offline-trained detectors usually degradewith changes in the scene or camera; 3) Scene specific cuesincluding object resolution, occlusions, and backgroundstructures are not incorporated into the detectors [29].

Learning scene-specific detectors, which aims at model-ing objects in video scenes by incorporating scene-specific

A video of pedestrians Negative images

Progressive latent model

Object discovery Object enforcement Label propagation

Iteration

Score maps:

Input video:

Figure 1. Proposed self-learning framework. Given a videowhere pedestrians are dominant moving objects, self-learningprogressively constructs a scene-specific detector using objectdiscovery, object enforcement, and label propagation procedures.

discriminative information, has been increasingly inves-tigated [19, 25, 31]. To learn scene-specific detectorswith less human supervision, transfer learning and semi-supervised learning are commonly used [19, 25, 31]. Trans-fer learning adapts pre-trained detectors to new specificdomains, reduces annotation requirements and improves de-tector performance [35, 36, 37]. Semi-supervised learningsaves human annotation effort by initially training detectorswith a few annotated examples, and incrementally improv-ing the detectors by extending the sample domains [11,25, 41]. However, transfer learning is challenged whenthe object appearance in the target domains has significantdifferences with that in the source domains; while semi-supervised models might drift away from the intended aimsgiven noisy or unrelated samples [25]. Most importantly,both methods require partial object-level annotations, andtherefore, do not fully eliminate human supervision.

As a promising direction, recent unsupervised videoobject discovery techniques [23, 26, 39] had been signif-icantly improved, which are supposed to break the bottle-neck of the self-taught learning in practical applications.This paper discusses the possibility of self-learning pedes-

1

arX

iv:1

611.

0754

4v1

[cs

.CV

] 2

2 N

ov 2

016

Page 2: Abstract - arXiv · 1. Introduction With widespread use of surveillance cameras, the need for automatically detecting objects, e.g., pedestrians, has significantly increased. Recent

trian detectors in specific and dynamically changing scenes,e.g., a city square, to build a pedestrian detection system ina fully unsupervised manner, given video sequences wherepedestrians are the dominant moving objects and additionalnegative images randomly collected from the Web, Fig.1. The problem of self-learning is decomposed into threemain components: object discovery, object enforcement,and label propagation. Object discovery is implementedwith a latent SVM method [43], which outputs coarsemodels and annotations by minimizing frame-level clas-sification error. Object enhancement targets at enforcingobject localization and reducing ambiguity, i.e., discrim-inate object parts with the objects themselves, by lever-aging spatial regularization objective. Label propagationoptimizes a graph-based objective function to gradually dis-cover harder-positive instances in frames. It also enables theself-learning framework to find complex sample domains,e.g., a manifold space comprising multi-posture and multi-view objects [42]. The three procedures are formulated ina progressive latent model (PLM) with difference of convex(DC) objective functions, which are efficiently optimizedwith concave-convex programming in a progressive man-ner.

The main contributions of this paper consist of: (1)A self-learning pedestrian detection framework, which isdeployed as iterative procedures of object discovery, objectenforcement and label propagation, posing a new directionin the field of (unsupervised) object detection; (2) A pro-gressive latent model (PLM), which uses spatial-temporalregularization to reduce ambiguity of discovered samples,as well as addressing the stability of self-learning; and (3)Extensive experiments on PETS2009, Towncenter, PNN-Parking-Lot2/Pizza, CUHK Square, and 24-Hours datasetsare conducted to verify the performance of the proposedapproach.

2. Related WorksPedestrian detection using supervised methods has been

extensively investigated [4, 10, 21, 32, 42, 45]. Thiswork, however, is more related to scene-specific detectionusing transfer learning, online learning, weakly supervisedlearning, and unsupervised object discovery.

Transfer learning: The motivation behind transferlearning is that contexts and object distributions in targetdomains might be leveraged to improve the performance ofpre-trained detectors in source domains. Researchers haveexplored context cues [35, 37], confidence propagation[37, 44], and virtual-real world adaptation [33] to realizesmooth transfer. Gaussian process regression [40] andsuper-pixel region clustering [29] have been explored toselect “safe” samples in target domains. Large marginembedding [22] and transductive multi-view embedding[15] have been explored to expand detector horizons.

Researchers have also been using domain adaptation toconstruct a self-learning-camera [16].

Transfer learning can obviously reduce human anno-tations. Nevertheless, it suffers from the concept gapproblem, i.e., the major differences of object appearance,viewpoint, and illumination between source and targetdomains. When the gap is significant, the adaptation ofpre-trained models becomes non-smooth or infeasible. Bycontrast, self-learning initializes and improves detectorsin the same scenes, naturally avoiding the concept gapproblem.

Online/semi-supervised learning: Online learning andsemi-supervised learning improves scene-specific detectorsby taking advantage of the continuous incoming data streamfrom the target domains. Classical detection-by-tracking(DBT) [1, 24] initializes the system using offline traineddetectors and leverages temporal cues to extend sampledomains and cancel detection errors. Tracking-Learning-Detection (TLD) [20] initializes the system with a singlesample, and uses tracking and online learning to boostdetectors. Despite the popularity of DBT and TLD ap-proaches, recent studies [25] demonstrated that the simplecombination of detection with tracking might introducepoor detectors because the errors from both detection andtracking could be amplified in a coupled system. A P-Nexpert [20] is used in TLD to control precision and recallrates, guaranteeing the learning stability as a linear dynamicsystem. The learning stability of our approach can alsobe guaranteed as the difference of convex (DC) objectivefunctions of PLM converge at each learning iteration.

Weakly supervised learning: The inputs of WSL areimage/video level tags (object category), and the algorithmdiscovers objects when learning detectors [23, 30]. Ageneral assumption behind WSL is that objects of the samecategory are from a potential cluster while the backgroundsare diverse. Under such an assumption, clustering [8, 34],tracking [23], boosting [38], region matching [6], graphlabeling [30], and multi-instance learning [7, 28] areused to find the correspondence of objects, depress thebackgrounds and learn detectors.

WSL alternates between sample labeling and detectorlearning in a way similar to Expectation Maximizationoptimization. Due to the missing annotations, however,this optimization is non-convex and therefore prone togetting stuck in a local minimum and outputting wronglabelings [3]. Cinbis et al. [7] use a multi-fold splittingof the training set while Bilen et al. [3] use convexclustering to prevent getting stuck to wrong labels. Thiswork alleviates the local optima problem with a morereasonable way by introducing regularization terms aboutdomain knowledge, i.e., intra-frame hard-negative miningand inter-frame similarity propagation.

Unsupervised video object discovery: An early ap-

Page 3: Abstract - arXiv · 1. Introduction With widespread use of surveillance cameras, the need for automatically detecting objects, e.g., pedestrians, has significantly increased. Recent

proach developed in [38] learns scene-specific object de-tector by online boosting of part detectors, but it requiresgeneral seed detectors learned offline. Recent research[23, 39] formulates unsupervised video object discoveryas a combination of two complementary steps: discoveryand tracking. The first step establishes correspondencesbetween prominent regions across video frames, and thesecond step associates successive similar object regionswithin the same video. Xiao et al. [39] propose a fullyunsupervised video object proposal approach which firstdiscovers a set of easy-to-group instances by clusteringand then updates its appearance model to gradually detectharder instances by the initial detector and temporal consis-tency. This unsupervised approach can automatically gen-erate object proposals, but cannot output precise detections.

3. Proposed Self-learning Framework

In the supervised object detection setting, the locationsof training samples would simply be given, while in self-learning, the annotations of object locations are not avail-able. The primary objective of self-learning is guiding themissing annotations to a solution that disentangles objectsamples from noisy object proposals, as shown in Fig. 2.

3.1. Progressive Latent Model

Modeling: The self-learning framework is decomposedinto three basic procedures: object discovery, object en-hancement, and label propagation. Given a set of objectproposals that have salient object-like appearance and mo-tion, Fig. 2a and Fig. 2b, the object discovery step aimsto find object windows from video frames that best dis-criminates positive video frames from the negative images.The object enhancement discovers hard negatives that helpreducing falsely localized object parts, as well as improvingobject localization. The label propagation step mines harderinstances of the corresponding object and throughout theentire video, Fig. 2c and Fig. 2d. The three proceduresiterate until an error rate based stability criteria is met.

Let x ∈ X denotes a video frame or a negative image,y ∈ Y,Y = 0, 1 are labels denoting if x contains apedestrian object. y = 1 indicates that there is at leastone pedestrian in the frame while y = 0 indicates a framewithout pedestrian object or a negative image. The self-learning is formulated with a multi-objective function thattargets at jointly determining the latent object h and a latentmodel β in a progressive optimized procedure,

h∗, β∗ = minβ,hF(X ,Y)(β, h)

= minβ,hFl(β, h)− λFs(β) + γFg(β, h),

(1)

(a) (b)

(d)(c)

Figure 2. Object discovery from noisy proposals. (a) The scoremap in the first learning iteration and (b) candidate objects (redboxes) discovered. (c) The score map and in the fifth learningiteration. (d) Candidate objects (red boxes) and hard negatives(yellow boxes). (Best viewed in color.)

where Fl(β, h), Fs(β) and Fg(β, h) 1, as defined below,are the objectives for object discovery, spatial regularizationand score propagation respectively. λ and γ are regulariza-tion factors.

Object Discovery: The object discovery procedure isimplemented with a latent SVM (LSVM) model to chooseobject proposals that best discriminate positive frames fromnegative images,

y∗, h∗, β∗ = arg maxy∈Y,h∈H,β

βT · v (x, y, h) , (2)

where v(x, y, h) denotes a normalized feature vector, i.e.,HOG features. H denotes the set of object proposals,made up of proposals Hi, i = 1, ..., N from video frames.Basically, solving Eq. 2 produces a high score βT ·v(x, y, h)for each positive frame (y = 1) and a low score for eachnegative image (y = 0). Concretely, we learn the modelβ on a collection of video frames and negative imagesX = (xi, yi), i = 1, ..., N with

minβ,hFl(β, h) = min

β,h

1

2||β||2 + C

N∑i=1

l(β, xi, yi, h), (3)

where C is a regularization factor and l is a difference-convex loss function defined as

l(β, xi, yi, h) = maxy,h

(βT · v(xi, y, h) + ∆(yi, y)

)−max

hβT · v(xi, yi, h),

(4)

where ∆(yi, y) = 0 if y = yi, and 1 otherwise. Eqs. 3 and4 target at choosing and discriminating the highest scoring

1(X ,Y) is omitted for short.

Page 4: Abstract - arXiv · 1. Introduction With widespread use of surveillance cameras, the need for automatically detecting objects, e.g., pedestrians, has significantly increased. Recent

proposals h from the other configurations, defining a max-margin formulation to measure the mismatch between theimage, label, and proposals.

Object Enforcement: The object discovery procedureaims at optimizing the image-level classification instead ofthe sample-level classification. Once the image-level classi-fication objective function reaches optimization, whether ornot the sample-level classification is optimized, the learningprocedure stops [43]. Considering that all positive imagescontain the object parts but none of negative images does,LSVM could falsely select object parts as positive samplessince Eq. 3 is non-convex and is easy to get stuck to localminimum.

Motivated by the success of hard negative mining [17],we propose using spatial regularization to enforce the lo-calization of objects and the model. Denoting by Hiobject proposals in frame i and h′ the hard negativescorresponding to an object h in a video frame, we definea function to maximize the distance between the potentialobject and its spatial neighbors,

maxβFs(β)=

N∑i=1

∑h∈Hi

h′∈ΩHi,h

||βT ·(v(xi, h)−v(xi, h

′))||2,

(5)where ΩHi,h denote the spatial neighbors of h in Hi. Thespatial neighbors are high score object parts and surround-ing image patches that have IoU (Intersection of Union)with h in the interval (0.0 0.25). Eq. 5 optimizes the modelβ using fixed h, and thus is a convex regularization function.Such a function enforces the latent model, yielding aconsistent and significant boosts in object localization witha progressive learning procedure.

Label Propagation: The object discovery procedureoutputs only one sample for each frame. To mine morepositives and negatives, we propose using the inter-framelabel propagation for incremental learning.

Suppose there are l labeled samples from previous learn-ing iterations. We select u = l × (r − 1.0) high-scoredproposals as unlabeled samples, where r > 1.0 is thelearning rate, related to the expected density of pedestrians.Given labeled samples hi, i = 1, ..., l, and unlabeledproposals hj, j = l, ..., l + u, a kNN graph in thefeature space is first constructed. The graph vertex definesthe nearest neighbor vertices of samples. hi and hj areconnected if one of them is among the others kNN [46].The graph-based label propagation procedure is defined asg(β, hj) =

∑lk=l wjkg(β,hk)∑l

k=l wjk, j = l+ 1, ..., l+ u, where wik

denotes the edge weight defined with a Gaussian Functionon Euclidean distance between hi and hk. This is equivalent

to a convex optimal problem [46],

ming(β,h)

Fg(β, h) = ming(β,h)

l∑i=1

l+u∑j=l

wij(g(β, hi)− g(β, hj)

)2s.t. g(β, hi) = yi, i = 1, ..., l,

(6)where g(β, hj) is the propagated score of proposal hj andyi is the label of the frame/image that hi belongs to.

Progressive Optimization: In the learning procedure,the optimization of Fs(β) (object enforcement) andFg(β, h) (label propagation) depends on the results ofFl(β, h). Eq. 1 is thus a progressive model, where Fl ,Fs and Fg are alternatively optimized. According to Eq.4, Fl could be written as A(x) − B(x) and F could bewritten as A(x) − B(x) + C(x) −D(x). This means thatthe objective functions of Eq. 1 could be written as thedifference of convex functions. This allows us to optimizeit with a two-step Concave-Convex Procedure (CCCP)[43]. The first-step CCCP for Fl discovers potentialpedestrian objects in frames and initializes the latentmodel, the second-step CCCP for γFg − λFs performsobject enforcement and label propagation. The two-stepsCCCP progressively optimizes the PLM until the changeof the estimated sample error rate is negligible. CCCPalgorithms guarantee the optimization with difference ofconvex objective functions converges to a local minimumor saddle point [43]. Therefore, iterative usage of thetwo-steps CCCP algorithm and keeping the decreasing ofthe sample error rates (discussed in Sec. 3.3) can guaranteethe stability of self-learning.

3.2. Self-learning a Detector

With the proposed PLM, a self-learning approach isimplemented as described in Fig. 3. The proposal gener-ation component localizes potential objects using object-ness, motion, and appearance cues. The proposal rankingcomponent chooses the high-ranked proposals as positivecandidates, and low-ranked proposals as negatives. Theproposal tracking component helps in finding proposals insuccessive video frames. The PLM identifies positives andhard negatives from given proposals. With mined positivesamples, a DPM detector fβ(h) is trained to performpedestrian detection.

Given a video of static background, a motion scoremap is calculated for each video frame with a backgroundmodeling algorithm. On the motion score map, detectionproposals (as shown in Fig. 2b) are extracted using theEdgeBoxes approach [47], according to which edge mapsare computed first, and contours, i.e., edge groups, areobtained by aggregating high affinity edges. On the con-tours, the regions of high confidence are extracted as objectproposals using a sliding window strategy in locations,scales, and aspect ratios. From the second iteration, with

Page 5: Abstract - arXiv · 1. Introduction With widespread use of surveillance cameras, the need for automatically detecting objects, e.g., pedestrians, has significantly increased. Recent

Video sequence

Backgroundmodeling

PLMScene-specific

detector

Proposal generation

Proposal ranking

Proposal tracking

Negative images

[+] [-] Spatial & temporal proposals

Labeled samples

Figure 3. Block diagram of the proposed self-learning approach

an initialized detector, a sliding window strategy is used togenerated object proposals, as shown in Fig. 2d. To extendthe proposals in the temporal domain, a KLT trackingalgorithm is employed to track and collect proposals fromframe t to frame t+ τ , where τ is empirically set to10. Before feeding these spatial-temporal proposals to thelearning algorithm, their aspect ratios are normalized to theaverage aspect ratio. To prevent falsely choosing staticbackgrounds in videos of sparse pedestrians, the averagebackground probability of a proposal is required to be largerthan a threshold, empirically set to 0.20 in our experiments.

We propose using a combinatorial score, i.e., f(h) =αT · (fβ(h), fm(h), fo(h)), to choose high-ranked propos-als, where αT is a ranking weight vector. fβ(x), fm(h)and fo(h), respectively, are the detection, motion, andobjectness scores. The motion score fm(h) of a proposalis defined as the averaged motion scores of all pixels inits image region. Objectness score fo(h) is defined bycalculating contours in the proposal regions [47]. Alarger score gives higher confidence that the proposal isan object. Detection score fβ(h) is calculated from thesecond learning iteration, by the learned detector. Fromthis iteration, the proposal region centers are set as rootlocations, around which we use sliding window to localizeproposals.

In each learning iteration, the ranking weight vector αT

is updated using a zero-space regression method [5], whichperforms learning without using output values. It basicallyminimizes the regression error of all samples, as well asmaximizing the distance from a hyperplane to the origin.This results in a weight vector which captures regions in theinput sample space where the probability density of the datais found, and enables the proposal ranking to be adaptive.

3.3. Error Rate Discussion

PLM incorporates a label propagation procedure, whichiteratively introduces new samples and updates the model.In this procedure, the primary problems to be solved areavoiding model drift and reducing the error rate. Eq. 6implies that a larger γ value introduces more newly labeledsamples, as well as a larger error rate ξ, and vice versa.The number of newly labeled samples u is determined to

be an implicit function of γ, u(γ). The value of γ needsto essentially guarantee that the error rate of newly labeledsamples is smaller than that of existing samples, meaningthe error rate of the training set is monotonically non-increased. It is also expected that there is a large γ, whichimplies that more samples could be labeled in each iteration.To decide the value of γ, an optimization objective functionis defined:

maxγ,β,yj

γ

s.t. ξu(γ) ≤ ξl

u1

l + u(γ)

l+u(λ)∑j=1

(fβ(hj)− yj) ≤1

l

l∑i=1

(fβ(hi)− yi),

(7)where l and u(γ), respectively, denote the numbers of la-beled samples in previous iterations and unlabeled samplesin current iteration.

The optimization of Eq. 7 guarantees that the estimatederror rate of newly labeled samples ξu(γ) is smaller than thatof labeled samples ξl by finding a proper γ in each learningiteration. γ is optimized with a linear searching algorithm[12], which searches in the interval [0.0, 1.0] with step size0.1 and updates fβ(hj) to fβ(hj) at each step. Meanwhile,yj is estimated with yj = fβ(·), with which the error rateξu(γ) is calculated.

4. Experiments

4.1. Datasets and Performance Metrics

The proposed approach is evaluated on five real-worlddatasets (six sequences) captured with surveillance cam-eras. The datasets involve challenges from object occlu-sions, low resolution, and/or moving distractors2.PETS2009 [14]: A crowded video sequence captured in apublic space, with 720×576 resolution.Towncenter [2]: A moderately crowded video sequence ofa town center, with 1920×1080 resolution.PNN-Parking-Lot2/Pizza [29]: Moderately crowdedvideo sequences including groups of pedestrians walking inqueues with complex motion and similar appearance, with1920×1080 resolution. It is challenging due to the largeamounts of pose variations and occlusions.CUHK Square [37]: A 60-minutes long video of sparsepedestrians and other moving distractors, e.g., movingvehicles. The resolution of the video is 704×576. Theresolution of pedestrian objects is much lower than thoseof other datasets. As the camera has an approximately 45-degree bird-view, objects have perspective deformation.

2A demo video has been included in the supplementary materials.

Page 6: Abstract - arXiv · 1. Introduction With widespread use of surveillance cameras, the need for automatically detecting objects, e.g., pedestrians, has significantly increased. Recent

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Prec

isio

n

Effect of

WithWithout

(a)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Prec

isio

n

Progressive Optimization

Iteration 1Iteration 2Iteration 5Iteration 10Final

(b)

Figure 4. Model effect.

24Hours: 3 A 24-hours long video of sparse/dense pedes-trians, 24-hour illumination change and other moving dis-tractors, e.g., moving vehicles, which allows to asses modeldrift. The resolution of the video is 704×576. 6000 frameswere uniformly sampled from the long video for learningand 2600 frames for testing.

For all datasets except the 24Hours, half of the videoframes are used for learning while the other annotatedframes are used for testing. The proposed approach isevaluated and compared against the following supervisedlearning, transfer learning, and weakly supervised learningapproaches.Offline-DPM [13]: A DPM detector off-line trained on thePASCAL VOC person class.Supervised-DPM: A supervised DPM detector trainedwith human annotated samples on specific scenes andadditional negative samples mined from negative images.Supervised-SLSV [19]: A state-of-the-art scene-specificpedestrian detector learned from virtual pedestrians whoseappearance is simulated in the specific scene under consid-eration. Without public available source code, SLSV is onlycompared on the Towncenter dataset using author reportedresults.Transfer-DPM [29]: A scene-specific detection approachbased on transfer learning. Detections are originally ob-tained with a DPM detector off-line trained using PASCALVOC person class and then improved using super-pixelbased clustering and classification.Transfer-SSPD [37]: A state-of-the-art scene-specificpedestrian detector with transfer learning.Weakly-MIL [7]: A widely used weakly supervised ap-proach based on multi-instance learning. A DPM learner isthen learned from annotated positive samples.

4.2. Model Effect

In Fig. 4a and Fig. 4b, we respectively evaluate the ef-fects of object enforcement and label propagation, showingthat the PLM is more effective than the conventional LSVMmodel.

3It will be a publicly available dataset.

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Iteration

Sam

ple

erro

r ra

te

Evolution of Sample Error Rate

Pets2009TowncenterCUHK SquarePNN−ParkingLotPNN−Pizza24Hours

(a)

0 2 4 6 8 10 12 14 160

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8Evuolution of Proposal Ranking Weights

Iteration

Wei

ght

Detector scoreObjectness scoreMotion score

(b)

Figure 5. Validation of learning stability. (a) Monotonicaldecrease of sample error rates. (b) Evolution of proposal rankingweights.

Object enforcement: Considering that the objectivefunction in Eq. 3 is non-convex, learning tends to get stuckinto local minimum in the optimization procedure. By usingthe object enforcement procedure, Eq. 5, the performanceof the learned detector significantly improved, Fig. 5a. Thereason is that pedestrians are more precisely localised andmost falsely detected object parts are depressed. Giventhe 0.7 recall rate, the precision improved more than 10%when using such a regularization term, which shows thatthe convex objective function does help the non-convexoptimization to escape from poor local minimum.

Label propagation: Combined with the proposal rank-ing strategy, label propagation can incrementally annotatepedestrian samples without supervision. Fig. 5b clearlyshows that the detection model is iteratively improved,showing the effectiveness of the graph-prorogation basedincremental learning. After tens of iterations of learning,no additional positives are labeled and the performance isobserved to be stable.

Stability: Fig. 5a shows that the error rates of labeledtraining samples basically monotonically decreased, show-ing the stability of the proposed self-learning approach. Fig.5b shows the evolution of proposal ranking weights in thelearning procedure of the PETS2009 dataset. The weightfor the objectness score quickly decays to zero, whichimplies that the objectness score is not as discriminativeas the detection and the motion scores. The weight forthe detection score keeps increasing in learning, whichindicates that the detector is progressively improved. Theweight for motion cue decreases to a value that is similar tothe detection cue, which implies that the motion feature isalso discriminative.

Tab. 4.2 shows the largest γ values for the four datasets.γ of the Towncenter dataset is the largest, while γ of theCUHK dataset is the smallest. Larger γ implies that theobject proposals have fewer noises. The Towncenter datasetis a video with little illumination variance and few movingdistracters, and therefore use a larger γ. The CUHK and24Hours datasets have many moving distracters, so theyneed a smaller γ.

Page 7: Abstract - arXiv · 1. Introduction With widespread use of surveillance cameras, the need for automatically detecting objects, e.g., pedestrians, has significantly increased. Recent

Table 1. Label propagation parameters on different datasets.

Dataset PETS Towncenter PNN CUHK 24Hoursγ 0.50 0.70 0.60 0.30 0.30

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Precision

PETS2009

Offline DPM AP=0.612Supervised DPM AP=0.698Transfered DPM AP=0.678

AP=0.624Multi−instance LearningPLM AP=0.695

(a)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Precision

TownCenter

Offline DPM AP=0.543Supervised DPM AP=0.724Supervised SLSV AP=0.852Transfered DPM AP=0.934

AP=0.675Multi−instance LearningPLM AP=0.797

(b)

0.3 0.4 0.5 0.6 0.7 0.8 0.90.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Precision

PNN−Parking−Lot2

Offline DPM AP=0.570Supervised DPM AP=0.722Transfered DPM AP=0.598Multi−instance Learning AP=0.546PLM AP=0.744

(c)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Recall

Precision

PNN−Pizza

Offline DPM AP=0.633Supervised DPM AP=0.702Transfered DPM AP=0.667Multi−instance Learning AP=0.565PLM AP=0.694

(d)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

FPPI

Reca

ll

CUHK Square

Offline DPMSupervised DPMTransfered SSPDMulti−instance LearningPLM

(e)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Precision

24Hours

Offline DPM AP=0.521Supervised DPM AP=0.600Transfered DPM AP=0.552Multi−instance Learning AP=0.554PLM AP=0.615

(f)

Figure 6. Performance of our approach and comparisons withweakly supervised, supervised, and transfer learning approaches.On five datasets the Precision-Recall metric is adopted to evaluatethe approach and compare it with other approaches. On the CUHKdataset the FPPI-Recall metric is adopted, consistent with thestate-of-the-art scene-specific detection approach [37].

4.3. Performance

The PR and FR curves in Fig. 7 show that our ap-proach significantly outperforms the off-line learned DPMdetector on all datasets. It also significantly outperformsthe Weakly-MIL approach. On the PETS2009 and PNN-Parking-Lot2 datasets, our approach outperforms all of thecompared approaches. On the CUHK dataset our approachsignificantly outperforms the scene-specific approach withtransfer learning [37], which reports the state-of-the-artperformance on this dataset. It is even comparable to thesupervised learning approach (Supervised-DPM). On the

Towncenter dataset, our approach outperforms the MILapproach as well. However, it shows lower performancethan the fully supervised approach SLSV [19] and thetransfer learning approach [29]. The reason could be thatthe pedestrians in that video scene are sparse, thus ourapproach could not label sufficient positive samples. Itshould be stressed once again that our proposed approachdoes not use any annotated training sample.

On the 24Hours dataset, the AP (average precision) ofour approach is highest among all compared approaches,Fig. 7e. It is about 6% higher than the transfer learningmethod, validating our previous analysis: transfer learningsuffers from the concept gap problem, e.g., adapt a modeltrained on day-time captured images to a video sequence of24-hours illumination changes. By contrast, the proposedself-learning approach just applies the learned detectorsfrom the same scenes, naturally avoiding the concept gapproblem. More surprisingly, using additional motion cues,the proposed approach outperforms the fully supervisedapproaches in this dataset.

In Fig. 7, we use key frames in each row to illustratethe incremental learning procedure. It can be seen thatthe positive samples are incrementally labeled and noisesamples are reduced. On the crowded PES2009 datasetand the PNN-Pizza dataset of significant occlusions ourapproach accurately labels samples, demonstrating thatthe learned detector has incorporated scene-specific dis-criminative information. On the Towncenter and CUHKdatasets, although there exist moving distractors, e.g., bi-cycles and vehicles, the proposed approach correctly local-ize the pedestrians, demonstrating its robustness in noisyenvironments. In the 24Hours dataset, some video frameshave dense pedestrians (daytime) but others have sparsepedestrians (at night). Learning from the early morning tothe middle of the night, our approach could progressivelyimprove its performance, without model drift. In thelast column of Fig. 7, the detection results show that thelearned scene-specific detectors are discriminative, showingrobustness to occlusions, low resolution, and appearancevariations. In Fig. 8, it can be seen that the self-learningapproach is adaptive to view variance and 24-hours illumi-nation changes, but transfer leaning suffers from those.

5. ConclusionsSupervised learning of detectors for all scenes requires

significant human effort on sample annotation. Commonlyused transfer learning and semi-supervised learning do noteliminate human supervision, as they require partial object-level annotations. We show that by leveraging extremelyweakly annotated video data, it is possible to automaticallylearn customized pedestrian detectors for specific scenes. Anew progressive latent model is proposed by incorporatingdiscriminative and incremental functions. A self-learning

Page 8: Abstract - arXiv · 1. Introduction With widespread use of surveillance cameras, the need for automatically detecting objects, e.g., pedestrians, has significantly increased. Recent

Pets2009 (crowd)

Towncenter (moving distracters)

PNN-Parking-Lots2

PNN-Pizza(crowd)

CUHK Square (low resolution video with moving distracters)

24Hours(long video with moving distracters))

Figure 7. Illustration of learning and detection. First three columns: score maps in the first, firth and tenth learning iterations, respectively.Fourth column: annotated positive samples (red boxes). Last column: detection examples in the test sets. (Best viewed in color)

Our proposed self-learning approach The transfer-learning approach

Figure 8. Detection results on 24Hours dataset. The self-learning detection correctly detects all pedestrians from the daytime (left) andnight (right), but transfer learning has missed and false detections.

approach is implemented by optimizing the model overspatio-temporal proposals. Experiments demonstrated that

the self-learned detectors are comparable to supervisedones, taking a step towards self-learning cameras [16].

Page 9: Abstract - arXiv · 1. Introduction With widespread use of surveillance cameras, the need for automatically detecting objects, e.g., pedestrians, has significantly increased. Recent

AcknowledgementThe partial support of this work by ONR, NGA, ARO,

NSF, NSFC under Grant 61271433 and 61671427, and Bei-jing Municipal Science & Technology Commission underGrant Z161100001616005 is gratefully acknowledged.

References[1] M. Andriluka, S. Roth, and B. Schiele. People-tracking-

by-detection and people-detection-by-tracking. IEEE CVPR,2008. 2

[2] B. Benfold and I. D. Reid. Stable multi-target tracking inreal-time surveillance video. IEEE CVPR, 2011. 5

[3] H. Bilen, M. Pedersoli, and T. Tuytelaars. Weakly supervisedobject detection with convex clustering. IEEE CVPR, 2015.2

[4] Z. Cai, M. Saberian, X. Wang, and N. Vasconcelos. Learningcomplexity-aware cascades for deep pedestrian detection.IEEE ICCV, 2015. 2

[5] C. Chang and C. Lin. Libsvm:a library for support vectormachines. ACM Trans. Intell. Sys. and Tech., 2(3):27, 2011.5

[6] M. Cho, S. Kwak, C. Schmid, and J. Ponce. Unsupervisedobject discovery and localization in the wild Part-basedmatching with bottom-up region proposals. IEEE CVPR,2015. 2

[7] R. G. Cinbis, J. J. Verbeek, and C. Schmid. Weaklysupervised object localization with multi-fold multipleinstance learning. IEEE Trans. Pattern Anal. Mach. Intell.,DOI: 10.1109/TPAMI.2016.2535231, 2016. 2, 6

[8] S. K. Divvala, A. Farhadiy, and C. Guestrin. Learning ev-erything about anything Webkly-supervised visual conceptlearning. IEEE CVPR, 2015. 2

[9] P. Dollar, R. Appel, S. Belongie, and P. Perona. Fast featurepyramids for object detection. IEEE Trans. Pattern Anal.Mach. Intell., 36(8):1532–1545, 2014. 1

[10] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestriandetection: An evaluation of the state of the art. IEEE Trans.Pattern Anal. Mach. Intell., 34(4):743–761, 2012. 2

[11] J. Donahue, J. Hoffman, E. Rodner, K. Saenko, andT. Darrell. Semi-supervised domain adaptation with instanceconstraints. IEEE CVPR, 2013. 1

[12] K. Donald. Sorting and searching. The Art of ComputerProgramming 3 (3rd ed.). Addison-Wesley., 1997. 5

[13] P. F. Felzenszwalb, R. B. Girshick, D. A. McAllester, andD. Ramanan. Object detection with discriminatively trainedpart-based models. IEEE Trans. Pattern Anal. Mach. Intell.,32(9):1627–1645, 2010. 1, 6

[14] J. Ferryman and A. Shahrokni. Pets2009: Dataset andchallenge. Twelfth IEEE Int’l workshop on performanceevaluation of tracking and surveillance, 2009. 5

[15] Y. Fu, T. M. Hospedales, T. Xiang, Z. Y. Fu, andS. Gong. Transductive multi-view embedding for zero-shotrecognition and annotation. ECCV, 2014. 2

[16] A. Gaidon, G. Zen, and J. A. R. Serrano. Self-learning camera autonomous adaptation of object detectorsto unlabeled video streams. CoRR, 2014. 2, 8

[17] R. Girshick. Fast r-cnn. IEEE ICCV, 2015. 4[18] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich

feature hierarchies for accurate object detection and semanticsegmentation. IEEE CVPR, 2014. 1

[19] H. Hattori, V. N. Boddeti, K. Kitani, and T. Kanade.Learning scene-specific pedestrian detectors without realdata. IEEE CVPR, 2015. 1, 6, 7

[20] Z. Kalal, K. Mikolajczyk, and J. Matas. Tracking-learning-detection. IEEE Trans. Pattern Anal. Mach. Intell.,34(7):1409–1422, 2012. 2

[21] W. Ke, Y. Zhang, P. Wei, Q. Ye, and J. Jiao. Pedestriandetection via pca filters based convolutional channel features.IEEE ICASSP, 2015. 2

[22] A. Kuznetsova, S. J. Hwang, B. Rosenhahn1, and L. Sigal.Expanding object detectors horizon Incremental learningframework for object detection in videos. IEEE CVPR, 2015.2

[23] S. Kwak, M. Cho, I. Laptev, J. Ponce, and C. Schmid.Unsupervised object discovery and tracking in videocollections. IEEE ICCV, 2015. 1, 2, 3

[24] Y. Mao and Z. Yin. Training a scene-specific pedestriandetector using tracklets. IEEE WACV, 2015. 2

[25] I. Misra, A. Shrivastava, and M. Hebert. Watch andlearn:semi-supervised learning of object detectors fromvideos. IEEE CVPR, 2015. 1, 2

[26] A. Papazoglou and V. Ferrari. Fast object segmentation inunconstrined video. IEEE ICCV, 2013. 1

[27] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towardsreal-time object detection with region proposal networks.IEEE Trans. Pattern Anal. Mach. Intell., 2016. 1

[28] W. Ren, K. Huang, D. Tao, and T. Tan. Weakly supervisedlarge scale object localization with multiple instance learningand bag splitting. IEEE Trans. Pattern Anal. Mach. Intell.,38(2):405–416, 2016. 2

[29] G. Shu, A. Dehghan, and M. Shah. Improving an objectdetector and extracting regions using superpixels. IEEECVPR, 2013. 1, 2, 5, 6, 7

[30] H. O. Song, R. B. Girshick, S. Jegelka, J. Mairal,Z. Harchaoui, and T. Darrell. On learning to localize objectswith minimal supervision. ICML, 2014. 2

[31] S. Stalder, H. Grabner, and L. V. Gool. Exploring contextto learn scene specific object detectors. IEEE Workshop onPETS, 2009. 1

[32] Y. Tian, P. Luo, X. Wang, and X. Tang. Deep learning strongparts for pedestrian detection. IEEE ICCV, 2015. 2

[33] D. Vazquez, A. M. Lopez, J. Mar?n, D. Ponsa, andD. Geronimo. Virtual and real world adaptation forpedestrian detection. IEEE Trans. Pattern Anal. Mach.Intell., 36(4):797–809, 2014. 2

[34] C. Wang, W. Ren, K. Huang, and T. Tan. Weakly supervisedobject localization with latent category learning. ECCV,2014. 2

[35] M. Wang and X. Wang. Automatic adaptation of a genericpedestrian detector to a specific traffic scene. IEEE CVPR,2015. 1, 2

[36] X. Wang, G. Hua, and T. X. Han. Detection by detections:Non-parametric detector adaptation for a video. IEEE CVPR,2012. 1

Page 10: Abstract - arXiv · 1. Introduction With widespread use of surveillance cameras, the need for automatically detecting objects, e.g., pedestrians, has significantly increased. Recent

[37] X. Wang, M. Wang, and W. Li. Scene-specific pedestriandetection for static video surveillance. IEEE Trans. PatternAnal. Mach. Intell., 36(2):361–374, 2014. 1, 2, 5, 6, 7

[38] B. Wu and R. Nevatia. Improving part based object detectionby unsupervised via online boosting. IEEE CVPR, 2007. 2,3

[39] F. Xiao and Y. J. Lee. Track and segment: An iterativeunsupervised approach for video object proposals. IEEECVPR, 2016. 1, 3

[40] J. Xu, S. Ramos, D. Vazquez, and A. M. Lopez. Domainadaptation of deformable part-based models. IEEE Trans.Pattern Anal. Mach. Intell., 36(12):2367–2380, 2014. 2

[41] Y. Yang, G. Shu, and M. Shah. Semi-supervised learningof feature hierarchies for object detection in a video. IEEECVPR, 2008. 1

[42] Q. Ye, Z. Han, J. Jiao, and J. Liu. Human detection inimages via piecewise linear support vector machines. IEEETransactions on Image Processing, 22(2):778–789, 2013. 2

[43] C. J. Yu and T. Joachims. Learning structural svmswith latent variables. Proceedings of the 26th AnnualInternational Conference on Machine Learning, ICML 2009,pages 1169–1176, 2009. 2, 4

[44] X. Zeng, W. Ouyang, M. Wang, and X. Wang. Deep learningof scene-specific classifier for pedestrian detection. ECCV,2014. 2

[45] S. Zhang, R. Benenson, M. Omran, J. Hosang, andB. Schiele. How far are we from solving pedestriandetection. IEEE CVPR, 2016. 2

[46] X. Zhu and A. B. Goldberg. Introduction to semi-supervisedlearning. MIT Press, 2009. 4

[47] C. L. Zitnick and P. Dollar. Edge boxes: Locating objectproposals from edges. ECCV, 2014. 4, 5