Automatic Multiple Human Detection and Tracking for Visual...

IEEE/OSAIIAPR International Conference on Informatics, Electronics & Vision

Automatic Multiple Human Detection and Tracking

for Visual Surveillance System

Alok K. Singh Kushwaha Department of Computer Engineering and Applications,

GLA University, Mathura, INDIA [email protected]

Chandra Mani Sharma Department of Computer Applications,

ITS, Ghaziabad, INDIA [email protected]

Manish Khare, Rajneesh Kumar Srivastava and Ashish Khare Department of Electronics & Communication, University of Allahabad, Allahabad, INDIA

e-mail-{mkharejk.rkumarsau}@[email protected]

Abstract-Object Tracking is an important task in video

processing because of its variety of applications in visual

surveillance, human activity monitoring and recognition, traffic

flow management etc. Multiple object detection and tracking in

outdoor environment is a challenging task because of the

problems raised by poor lighting conditions, variation in poses of

human object, shape, size, clothing, etc. This paper proposes a

novel technique for detection and tracking of multiple human

objects in a video. A classifier is trained for object detection using

Haar-like features from training image set. Human objects are

detected with help of this trained detector and are tracked using

particle filter. The experimental results show that the proposed

technique can detect and track multiple humans in a video

adequately fast in the presence of poor lighting conditions,

variation in poses of human objects, shape, size, clothing etc. and

the technique can handle varying number of human objects in a

video at various points of time.

Keywords- Human detection; Multiple object tracking; Haar-like

features; Machine learning; Particle filter.

I. INTRODUCTION

Detecting humans and analyzing their activities by vision is a key for today's computer systems to interact intelligently in some human inhabited environment. Visual surveillance systems are used for the real-time observation of targets, such as humans or vehicles, which can further lead to the description of objects' activities in that environment. Visual surveillance has been used for security monitoring, anomaly detection, intruder detection, traffic flow measuring, accident detection on highways, and routine maintenance in nuclear facilities [1-4]. Hu et al. [1], provided a good survey on visual surveillance and its various applications. Detection and tracking of humans in a video is an initial step towards analysis and prediction of their behavior and intention. The problem of multiple object tracking is more complex and challenging than single object tracking [5] because of sudden appearance and disappearance of object. Viola et al.[6] proposed an object detection framework and used it for detection of human faces in an image. The adaptive boosting technique has been used to speed up a binary classifier [7] and this can be used in machine learning for creating the

978-1-4673-1154-0112/$31.00 ©20 12 IEEE

real-time detectors. After detection of an object in a video sequence captured by a surveillance camera, the next step is to track these objects (human, vehicle etc.) in the subsequent frames of the video stream. Particle filter based tracking techniques are very popular because of their easy implementation and capability to represent a non-linear object tracking system in presence of non-Gaussian nature of noise. Various object tracking techniques based on particle filtering are found in literature [8-12]. These techniques fall under two main categories: single object tracking techniques and multiple object tracking techniques. Lanvin et al. [8], proposed an object detection and tracking technique and solved the non-linear state equations using particle filtering. Algorithms which attempt to fmd the target of interest without segmentation have been proposed for single target tracking based on cues such as color, edges and textures [l3].

The problem of tracking multiple objects using particle filters can be solved in two ways. One is by creating multiple particle filters for each track and another one is by having a single particle filter for all tracks. The second approach works fme as long as the objects under tracking are not occluded but in case of occlusion when objects come close by, this technique fail to track the objects. In [9] a multi-object tracking technique using multiple particles has been proposed. Chen et al. [10] proposed a color based particle filter for object tracking. Their technique is based on Markov Chain Monte Carlo (MCMC) particle filter and object color distribution. In [11], object tracking and classifications are performed simultaneously. Particle filter based multiple object tracking schemes rely on hybrid sequential state estimation. The particle filter developed in [14] has multiple models for the objects motion, and comprises of an additional discrete state component to denote which motion model is active. The Bayesian multiple-blob tracker [15] presents a multiple tracking system based on statistical appearance model. Multiple blob tracking is managed by incorporating the number of objects present in the state vector and the state vector is augmented as in [16] when a new object appears in the view.

The joint task of object detection and tracking comes with a heavy computational overhead and for visual surveillance

ICIEV 2012

IEEE/OSA/IAPR International Conference on Informatics, Electronics & Vision

applications the speed is one of the most important factors. Most of the above described techniques are slow in speed so cannot be good for real time visual surveillance purpose and fail to cope with dynamic outdoor environment conditions. This paper presents a novel technique based on machine learning and particle filtering to detect and track the humans in video. The humans are detected in video frames using the object detection framework proposed by Viola et al.[6] using Haar-like features and then tracked using a simple particle filter in the subsequent video frames. The combination of binary adaptive boosting [7] reduces the training time of the human detector. Earlier detection of the objects simplifies the process of tracking and sheds the load of detection from tracker. In addition the exhaustive dataset used for training the detector makes the system to be able to detect and track the multiple objects in critical lighting conditions, in dynamic background.

The rest of the paper is organized as follows. Section2 discusses the proposed technique for human detection and tracking, in section3 the various experimental results using the proposed technique are given which prove the validity of the method. Conclusions are given in section 4.

II. THE PROPOSED METHOD

Here we solve two sub-problems. One is to detect the humans in the video and another one is to track them in the subsequent video frames. The sub problem of object detection has been solved using machine learning approach for which we train our human detector using binary adaptive boosting and for tracking a particle filter has been used.

A. Sample creation and Human Detector Training

We used a machine learning approach for object detection based on Haar-like features, originally proposed by Viola et al.[5]. The training is performed using binary adaptive boosting to speed up the process. We have collected positive and negative image samples for training. The positive images are those that contain the humans and negative images are those that do not contain any human object. Our positive dataset consist of 2,000 images while the negative dataset consist of 2,700 images. Fig.l shows some cropped human images from positive samples used for training of human detector. This human detector is a binary classifier having two classes: human and non-human. Second step after the sample collection is Haar-like feature extraction from these samples. We use the rectangular Haar-like features and these features have their intuitive similarity to the Haar - wavelets. Integral image

Figure 1. Sample positive human images used for detector training.

327

representation is used for fast feature evaluation. After computation of integral image representation, we have used adaptive boosting technique [6] for training of the human detector system. The main motivation for using Haar-like features is their easy evaluation. Since, extracting (evaluating) the required features from training data is a challenging task and this becomes even more complex and cumbersome when the size of training dataset is very large. With Haar-like features, the concept of Integral Image facilitates easy and fast calculation of features and considerably reduces the computation effort. So, we have opted to use Haar-like features. The image dataset is used at training time. The advantage of using this image dataset is that we can include different images of human beings incorporating various constraints (such as images of persons in different poses, sizes, clothing etc.). This training makes system capable to work in realistic scenarios where these constraints are encountered.

B. Particle Filter creation for tracking

For tracking purpose, particle filter has been used. In the proposed approach, uncertainty about human's state (position) is represented as a set of weighted particles, each particle representing one possible state. The filter propagates particles from frame i-I to frame i using a motion model, computes a weight for each propagated particle using an appearance model, then re-samples the particles according to their weights. The initial distribution for the filter is centered on the location of the object, detected first time. The steps are given as below -

(i). Prediction

Tracking of a human object is done from one video frame to another by predicting its position in the next frame when the

position of the frame is known in the previous frame. If aj,i

and aj,l_l represent the positions of an object j in i and i-I

frames then a distribution O(aj" I aj,H ) ,over human objectj's

position in frame i, is predicted using the belief of its position in frame i-I. We define a motion model for predicting the position of a human object from frame to frame, We use a second-order auto-regressive dynamical motion model for object's position prediction in frames. Non-Gaussian noise is considered in our motion model by a random noise variable. In

this autoregressive motion model we assume the next state A..t of the system as a function of some previous states and a noise

random variable 81, as given below:

A..t = f(A..1-1 ,A..1_2 , ............ ·,A..I_p ,8; ) (1)

We construct a second-order linear autoregressive model for estimating the position of an object in current frame by using the information of its position in the last two frames as shown in equation (2).

aj,1 =2 * aj ,i-1 -aj,I-2 +8; (2)

(ii). Likelihood-Measurement for Particles

We measure the likelihood O(\f/k I a)�;) ,\f/j ) , for each

propagated particle k using a color histogram-based

appearance model, where \f/j is the color histogram in

ICIEV 2012

IEEE/OSAIIAPR International Conference on Informatics, Electronics & Vision

appearance model of trajectory j and IfFk is the observed

histogram. The likelihoods of propagated particles, treated as weights, are normalized in such a manner that their sum equals to 1.

(iii). Re-sampling Over a period of time, the highest weighted particle may

tend to a weight of one and the weights of other particles may tend to zero. To avoid this degeneracy in weights, the particles are re-sampled. In re-sampling many of the low weight particles are removed and higher-weight particles are replicated in order to obtain a new set of equally-weighted particles. We use the re-sampling technique as described in [13].

C. Appearance Model

Color histograms are used in our appearance model. For

every newly detected human object a color histogram IfFj is

computed. The particle likelihoods, in the future frames, are calculated using these color histograms so we save all the computed histograms. A particle's likelihood is computed using the Bhattacharya similarity coefficient between the

model histogram IfFj and observed histogram lfF(k) assuming

that there are n bins in the histogram. The likelihood

e(lfFk I a)�) 'IfFJ of particle k is given as:

e(lfFk I a)�) , lfFj)= e -d (lfFj#,(k » ) (3)

d(lfFj,lfF(k) )=1-t �lfFj,blfF�k) b=l (4)

Where IfFj,b and lfF�k) denote bin b of IfFj and lfF(k) ,

respectively.

D. Complete Algorithm

1. Let Z be the input video

1.1. In fIrst frame 170 of Z, detect humans using the

human detector. Let OJ be the number of detected humans.

1.2. Initialize trajectories 0j ,l:S; j :s; OJ with initial

positions aj,o of the human beings detected by the

detector and also set the occlusion count rjJj for each

of these trajectories to O. 1.3. Initialize the appearance model IfFj for each

trajectory from the region around aj,o,

2. For each subsequent frame 17i of input vide Z,

For each existing trajectory OJ ,

2.1. Use motion model to predict the distribution

e(aj,i I aj,i-l), over locations for humanj in frame i,

creating a set of candidate particles a)�) ,1 :s; k :s; K .

328

2.2. Use appearance model to compute the color

histogram lfF(k) and likelihood e(lfFk I a)�) ,IfFJ for

each particle k. 2.3. Acquire k

*, the index of the most likely particle, after

re-sampling the particles according to their likelihoods.

2.4. Now run the human detector on the location a)�*) . If

the location is classified as a human, reset rjJj f- 0 ;

otherwise increase rjJj f- rjJj + 1.

2.5. If rjJj exceeds a threshold, remove trajectory j.

3.Now for frame 17i' search the new human objects and

compute the Euclidean distance Q j,k

between each newly

detected human k and each existing trajectory OJ. When

Q j,k

> r for allj, initialize a new trajectory for detection

k. where r is a threshold in pixels whose value is less than the width of the tracking window.

III. EXPERIMENTS AND RESEULTS

We have tested the proposed automatic human detection and tracking technique on a number of videos. We have created a video repository consisting of 50 odd videos. Many of them shot by us and others collected from various sources. These videos are the representatives of various real world scenarios deemed important for visual surveillance. These videos have been grouped into the following categories: 1. Single Human Videos - In all these videos, only a single person appears. Person may randomly change his pose and trajectory of movement. Videos in this group have different backgrounds and lighting conditions. 2. Multiple Human Videos with Occlusion - Videos under this category have multiple human beings. These human beings may move in random directions and can occlude each other. Also, the number of objects in video at a particular instance of time may vary. Objects may disappear and new objects may appear in video. 3. Multiple Human Videos without Occlusion - The videos do have multiple human beings but they rather move in such a way that occlusion does not occur. 4. Multiple Objects Videos - Multiple classes of objects (moving and stationery) are present in these videos. Some example object classes are humans, vehicles, animals etc. The objects may occlude each other and may appear in any pose, shape and size.

The human detection and tracking results with some of the representative videos are given in fig. 2 and fig. 4. Human detection and tracking using the proposed method starts automatically without providing any initialization parameters, unlike many other existing object tracking techniques in which operator intervention is required [5,17]. In fig. 2 we have shown the human detection and tracking results with a realistic video consisting of three human objects walking towards the camera. The video was shot in evening time in dim lighting

ICIEV 2012


conditions at a resolution of 640x480 and a frame rate of 17 frames per second. The video consisted of total 500 frames. In fig. 3 the detection and tracking results for 16 frames, from frame 1 to 375, at a difference of 25 frames are given. One can observe from fig. 2 that the proposed method detects and tracks all the three human objects from one frame to another with accuracy.

Frame I Frame 25

Frame 50 Frame 75

Frame 125

Frame 150 Frame 175

329

Frame 200 Frame 225

Frame 250 Frame 275

Frame 300 Frame 325

Frame 375

Figure 2. Detection and tracking results of the proposed method on a realtime video in outdoor environment.

Table I gives the ground truth and calculated values of centroids of the bounding boxes, around the three objects in the video results shown figure 2. One can observe from the given table 1 that the maximum centroid distance between the ground truth value and the calculated value is nearly 14 pixels(l3.89) for the video of frame resolution 640x480, which is not too much with respect to the frame size. Most of the times the centroid distance is either 0 or is too negligible to cause any big deviation from ground truth values, showing algorithm's accuracy which is also evident from the centroid distance shown in figure 3.

ICIEV 2012


TABLET

Calculated Object Centroids Ground Truth values for Object

Frame Object 1 Object 2 Object 3 Object 1

No. 1 (473,307) (368,297) (201,283) (473,307)

25 (471,304) (367,297) (200,282) (461,307) 50 (479,316) (373,306) (214,317) (479,316) 75 (489,310) (370,314) (209,308) (478,305)

100 (481,310) (381,316) (218,309) (479,311) 125 (490,322) (388,306) (215,310) (494,320) 150 (486,340) (384,312) (217,300) (484,340) 175 (489,314) (372,313) (232,304) (487,311) 200 (489,341) (376,315) (227,320) (488,344) 225 (477,312) (388,334) (217,303) (473,310) 250 (478,313) (390,337) (220,300) (478,313) 275 (471,304) (390,337) (220,300) (468,304) 300 (473,307) (368,334) (201,283) (475,308) 325 (470,302) (368,334) (220,300) (468,300) 350 (470,302) (368,334) (220,300) (468,296) 375 (466,298) (367,332) (220,300) (462,290)

Figure 3. Distance between the object-centroid estimated using the proposed method and the ground truth. x-axis shows the video frame number and yaxis shows the distance between centroids( in pixels) for three objects shown in fig. 3 video results .

Figure 4 shows the hwnan detection and tracking results where the hwnan may fall in different poses. There are four human beings in the video frame sequence in different poses. The fIrst one is walking, second one is standing, third human is sitting and the fourth one is bending. Results show that the proposed algorithm can detect and track all four human objects effIciently. After applying the proposed method on a set of several videos, we found that the average detection and tracking accuracy of the proposed technique is 87.44% and the average execution speed is 18 frames per second.

330

Centroids

Object 2 Object 3

(368,295) (213,290) (364,295) (212,289) (373,306) (214,317) (372,319) (210,307) (381,316) (218,309) (384,306) (215,310) (385,311) (237,309) (375,314) (231,309) (375,314) (224,321) (387,335) (215,304) (390,337) (220,300) (390,337) (220,300) (368,334) (200,281) (368,334) (219,298) (368,334) (220,300) (366,331) (220,300)

Frame 100

Centroid Distance

Object 1 Object 2 Object 3

0.00 2.00 13.89 10.44 3.60 13.89 0.00 0.00 0.00

12.00 5.38 2.23 2.20 0.00 0.00 4.47 4.00 0.00 2.00 l.41 13.45 3.60 3.l6 5.09 3.16 l.41 3.16 4.47 l.41 2.23 0.00 0.00 0.00 3.00 0.00 0.00 2.23 0.00 2.23 2.82 0.00 2.23 6.32 0.00 0.00 8.94 l.41 0.00

Frame 125

ICIEV 2012


Frame 150 Frame 175

Figure 4. The human detection and tracking results with the proposed method when the humans appear in different poses.

Figure 4 shows the human detection and tracking results where the human may fall in different poses. There are four human beings in the video frame sequence in different poses. The fIrst one is walking, second one is standing, third human is sitting and the fourth one is bending. Results show that the proposed algorithm can detect and track all four human objects effIciently. After applying the proposed method on a set of several videos, we found that the average detection and tracking accuracy of the proposed technique is 87.44% and the average execution speed is 18 frames per second.

IV. CONCLUSIONS

With the increasing number of crimes and terrorist activities the need for developing viable visual surveillance systems is being felt worldwide. Object detection and tracking is the fundamental step in visual surveillance. In this paper we have proposed an automatic mUltiple human detection and tracking technique using Haar-Iike features and a simple particle fIlter. The human detector can easily be trained using the Haar-like features extracted from the training samples consisting of the images of the humans and the background. The operation of the proposed technique is fully automatic and does not require any operator intervention unlike other methods. Experimental results demonstrate that the technique can detect and track the humans in outdoor environment in poor lighting conditions and in mUltiple poses. The average detection accuracy is 87.44% and the average processing speed of 18 frames per second, are the factors which make this

technique robust for real-time visual surveillance application.

ACKNOWLEDGMENT

Authors are thankful to the Department of Science and Technology, New Delhi for providing research grant vide grant no. SRiFTP/ET A-023/2009.

331

REFERENCES

[I] W. Hu and T. Tan, "A Survey on Visual Surveillance of Object Motion and Behaviors ", IEEE Trans. Systems, Man, and Cybernetics, Vol. 34, No. 3, pp. 334-352,2006.

[2] E.Stringa and C.S. Regazzoni, "Real-time Video Shot Detection for Scene Surveillance Application," IEEE Trans. Image Processing, Vol. 9, pp. 69-79, 2009.

[3] 1. Haritaoglu, D. Harwood, and L. Davis, "W4: Who? When? Where? What? a Real-time System for Detecting and Tracking People," In the Proc. IEEE Conference on Face Gesture Recognition, pp. 222-227, 1998.

[4] D. Koller, J. Weber, T. Huang, J. Malik, B. Rao G. Ogasawara, and S. Russell, 'Toward Robust Automatic Traffic Scene Analysis in Realtime," In the Proc. IEEE Conference on Pattern Recognition, Vol. 1, pp. 126- 13 1, 1994,.

[5] S. Nigam and A. Khare, "Object tracking based on curvelet transform ", accepted to publish in lET Computer Vision, 20 1 1.

[6] P. Viola and M. Jones, "Rapid Object Detection using a Boosted Cascade of Simple Features," In the Proc. of IEEE Int. Conference on Computer vision and Pattern Recognition pp. 5 1 1-5 18, 200 1.

[7] Y.Freund, and R. Schapire, " A decision theoretic generalization of online learning and an application to boosting," Journal of Computer and System Sciences 55, pp. 1 19-139, 2007

[8] P. Lanvin, J-C.Noyer, M. Benjelloun," Object Detection and Tracking using the Particle Filtering," In the Proc. of IEEE 371h Asilomar Conference on Signals, Systems and Computers, Vol. 2,pp. 1595-I599, 2003.

[9] J. Wang, Y. Ma, C. Li, H. Wang, J. Liu, "An Efficient Multi-Object Tracking Method using Multiple Particle Filters " In the Proc. of World Congress on Computer Science and Information Engineering, pp. 568-572,2009.

[ 10] Y. Chen, S. Yu, J. Fan, W. Chen, H. Li, "An Improved Color-based Particle Filter for Object Tracking," In the Proc. of IEEE Second Int. Conference on Genetic and Evolutionary Computing, pp. 360-363, 2008.

[ 1 1] F. Bardet, T. Chateau, D.Ramadasan, "Illumination Aware MCMC Particle Filter for Long-tenn Outdoor Multi-Object Simultaneous Tracking and Classification," In the Proc. of IEEE 12th Int. Conference on Computer Vision, pp. 1623-1630,2009.

[ 12] C. Luo, X. Cai, J. Zhang, "Robust Object Tracking using the Particle Filtering and Level Set Methods: A Comparative Experiment," In the Proc. of IEEE 10lh Workshop on Multimedia Signal Processing, pp. 359-364,2008.

[ 13] P. Brasnett, L. Mihaylova, N. Canagarajah, and D. Bull, "ParticleFiltering for Multiple Cues for Object Tracking in Video Sequences," In the Proc. of the 17th SPIE Annual Symposium on Electronic Imaging, Science and Technology, vol. 5685, pp. 430-440,2005

[ 14] M. Isard and A. Blake, "A Mixed-State Condensation Tracker with AutmaticModel-Switching," in Proc. Int. Conf Computer Vision, pp. I07- 1 12, 1998

[ 15] M. Isard and J. MacCormick, "Bramble: a Bayesian Multiple Blob Tracker," in Proc. Int. Conf. Computer Vision, pp. 34-41,200 1

[ 16] J. Czyz, B. Ristic, and B. Macq, "A Color-Based Particle Filter for Joint Detection and Tracking of Multiple Objects," In the Proc. of the ICASSP, pp. 2 17-220, 2005.

[ 17] N.T. Binh and A. Khare, "Object Tracking Of Video Sequence in Curvelet Domain ", Int. Journal of Image and Graphics, pp. 1-20 ,20 1 1

ICIEV 2012

Automatic Multiple Human Detection and Tracking for Visual...

Documents

Transcript of Automatic Multiple Human Detection and Tracking for Visual...