Automatic Multiple Human Detection and Tracking for Visual...
-
Upload
truongkhue -
Category
Documents
-
view
218 -
download
2
Transcript of Automatic Multiple Human Detection and Tracking for Visual...
IEEE/OSAIIAPR International Conference on Informatics, Electronics & Vision
Automatic Multiple Human Detection and Tracking
for Visual Surveillance System
Alok K. Singh Kushwaha Department of Computer Engineering and Applications,
GLA University, Mathura, INDIA [email protected]
Chandra Mani Sharma Department of Computer Applications,
ITS, Ghaziabad, INDIA [email protected]
Manish Khare, Rajneesh Kumar Srivastava and Ashish Khare Department of Electronics & Communication, University of Allahabad, Allahabad, INDIA
e-mail-{mkharejk.rkumarsau}@[email protected]
Abstract-Object Tracking is an important task in video
processing because of its variety of applications in visual
surveillance, human activity monitoring and recognition, traffic
flow management etc. Multiple object detection and tracking in
outdoor environment is a challenging task because of the
problems raised by poor lighting conditions, variation in poses of
human object, shape, size, clothing, etc. This paper proposes a
novel technique for detection and tracking of multiple human
objects in a video. A classifier is trained for object detection using
Haar-like features from training image set. Human objects are
detected with help of this trained detector and are tracked using
particle filter. The experimental results show that the proposed
technique can detect and track multiple humans in a video
adequately fast in the presence of poor lighting conditions,
variation in poses of human objects, shape, size, clothing etc. and
the technique can handle varying number of human objects in a
video at various points of time.
Keywords- Human detection; Multiple object tracking; Haar-like
features; Machine learning; Particle filter.
I. INTRODUCTION
Detecting humans and analyzing their activities by vision is a key for today's computer systems to interact intelligently in some human inhabited environment. Visual surveillance systems are used for the real-time observation of targets, such as humans or vehicles, which can further lead to the description of objects' activities in that environment. Visual surveillance has been used for security monitoring, anomaly detection, intruder detection, traffic flow measuring, accident detection on highways, and routine maintenance in nuclear facilities [1-4]. Hu et al. [1], provided a good survey on visual surveillance and its various applications. Detection and tracking of humans in a video is an initial step towards analysis and prediction of their behavior and intention. The problem of multiple object tracking is more complex and challenging than single object tracking [5] because of sudden appearance and disappearance of object. Viola et al.[6] proposed an object detection framework and used it for detection of human faces in an image. The adaptive boosting technique has been used to speed up a binary classifier [7] and this can be used in machine learning for creating the
978-1-4673-1154-0112/$31.00 ©20 12 IEEE
real-time detectors. After detection of an object in a video sequence captured by a surveillance camera, the next step is to track these objects (human, vehicle etc.) in the subsequent frames of the video stream. Particle filter based tracking techniques are very popular because of their easy implementation and capability to represent a non-linear object tracking system in presence of non-Gaussian nature of noise. Various object tracking techniques based on particle filtering are found in literature [8-12]. These techniques fall under two main categories: single object tracking techniques and multiple object tracking techniques. Lanvin et al. [8], proposed an object detection and tracking technique and solved the non-linear state equations using particle filtering. Algorithms which attempt to fmd the target of interest without segmentation have been proposed for single target tracking based on cues such as color, edges and textures [l3].
The problem of tracking multiple objects using particle filters can be solved in two ways. One is by creating multiple particle filters for each track and another one is by having a single particle filter for all tracks. The second approach works fme as long as the objects under tracking are not occluded but in case of occlusion when objects come close by, this technique fail to track the objects. In [9] a multi-object tracking technique using multiple particles has been proposed. Chen et al. [10] proposed a color based particle filter for object tracking. Their technique is based on Markov Chain Monte Carlo (MCMC) particle filter and object color distribution. In [11], object tracking and classifications are performed simultaneously. Particle filter based multiple object tracking schemes rely on hybrid sequential state estimation. The particle filter developed in [14] has multiple models for the objects motion, and comprises of an additional discrete state component to denote which motion model is active. The Bayesian multiple-blob tracker [15] presents a multiple tracking system based on statistical appearance model. Multiple blob tracking is managed by incorporating the number of objects present in the state vector and the state vector is augmented as in [16] when a new object appears in the view.
The joint task of object detection and tracking comes with a heavy computational overhead and for visual surveillance
ICIEV 2012
IEEE/OSA/IAPR International Conference on Informatics, Electronics & Vision
applications the speed is one of the most important factors. Most of the above described techniques are slow in speed so cannot be good for real time visual surveillance purpose and fail to cope with dynamic outdoor environment conditions. This paper presents a novel technique based on machine learning and particle filtering to detect and track the humans in video. The humans are detected in video frames using the object detection framework proposed by Viola et al.[6] using Haar-like features and then tracked using a simple particle filter in the subsequent video frames. The combination of binary adaptive boosting [7] reduces the training time of the human detector. Earlier detection of the objects simplifies the process of tracking and sheds the load of detection from tracker. In addition the exhaustive dataset used for training the detector makes the system to be able to detect and track the multiple objects in critical lighting conditions, in dynamic background.
The rest of the paper is organized as follows. Section2 discusses the proposed technique for human detection and tracking, in section3 the various experimental results using the proposed technique are given which prove the validity of the method. Conclusions are given in section 4.
II. THE PROPOSED METHOD
Here we solve two sub-problems. One is to detect the humans in the video and another one is to track them in the subsequent video frames. The sub problem of object detection has been solved using machine learning approach for which we train our human detector using binary adaptive boosting and for tracking a particle filter has been used.
A. Sample creation and Human Detector Training
We used a machine learning approach for object detection based on Haar-like features, originally proposed by Viola et al.[5]. The training is performed using binary adaptive boosting to speed up the process. We have collected positive and negative image samples for training. The positive images are those that contain the humans and negative images are those that do not contain any human object. Our positive dataset consist of 2,000 images while the negative dataset consist of 2,700 images. Fig.l shows some cropped human images from positive samples used for training of human detector. This human detector is a binary classifier having two classes: human and non-human. Second step after the sample collection is Haar-like feature extraction from these samples. We use the rectangular Haar-like features and these features have their intuitive similarity to the Haar - wavelets. Integral image
Figure 1. Sample positive human images used for detector training.
327
representation is used for fast feature evaluation. After computation of integral image representation, we have used adaptive boosting technique [6] for training of the human detector system. The main motivation for using Haar-like features is their easy evaluation. Since, extracting (evaluating) the required features from training data is a challenging task and this becomes even more complex and cumbersome when the size of training dataset is very large. With Haar-like features, the concept of Integral Image facilitates easy and fast calculation of features and considerably reduces the computation effort. So, we have opted to use Haar-like features. The image dataset is used at training time. The advantage of using this image dataset is that we can include different images of human beings incorporating various constraints (such as images of persons in different poses, sizes, clothing etc.). This training makes system capable to work in realistic scenarios where these constraints are encountered.
B. Particle Filter creation for tracking
For tracking purpose, particle filter has been used. In the proposed approach, uncertainty about human's state (position) is represented as a set of weighted particles, each particle representing one possible state. The filter propagates particles from frame i-I to frame i using a motion model, computes a weight for each propagated particle using an appearance model, then re-samples the particles according to their weights. The initial distribution for the filter is centered on the location of the object, detected first time. The steps are given as below -
(i). Prediction
Tracking of a human object is done from one video frame to another by predicting its position in the next frame when the
position of the frame is known in the previous frame. If aj,i
and aj,l_l represent the positions of an object j in i and i-I
frames then a distribution O(aj" I aj,H ) ,over human objectj's
position in frame i, is predicted using the belief of its position in frame i-I. We define a motion model for predicting the position of a human object from frame to frame, We use a second-order auto-regressive dynamical motion model for object's position prediction in frames. Non-Gaussian noise is considered in our motion model by a random noise variable. In
this autoregressive motion model we assume the next state A..t of the system as a function of some previous states and a noise
random variable 81, as given below:
A..t = f(A..1-1 ,A..1_2 , ............ ·,A..I_p ,8; ) (1)
We construct a second-order linear autoregressive model for estimating the position of an object in current frame by using the information of its position in the last two frames as shown in equation (2).
aj,1 =2 * aj ,i-1 -aj,I-2 +8; (2)
(ii). Likelihood-Measurement for Particles
We measure the likelihood O(\f/k I a)�;) ,\f/j ) , for each
propagated particle k using a color histogram-based
appearance model, where \f/j is the color histogram in
ICIEV 2012
IEEE/OSAIIAPR International Conference on Informatics, Electronics & Vision
appearance model of trajectory j and IfFk is the observed
histogram. The likelihoods of propagated particles, treated as weights, are normalized in such a manner that their sum equals to 1.
(iii). Re-sampling Over a period of time, the highest weighted particle may
tend to a weight of one and the weights of other particles may tend to zero. To avoid this degeneracy in weights, the particles are re-sampled. In re-sampling many of the low weight particles are removed and higher-weight particles are replicated in order to obtain a new set of equally-weighted particles. We use the re-sampling technique as described in [13].
C. Appearance Model
Color histograms are used in our appearance model. For
every newly detected human object a color histogram IfFj is
computed. The particle likelihoods, in the future frames, are calculated using these color histograms so we save all the computed histograms. A particle's likelihood is computed using the Bhattacharya similarity coefficient between the
model histogram IfFj and observed histogram lfF(k) assuming
that there are n bins in the histogram. The likelihood
e(lfFk I a)�) 'IfFJ of particle k is given as:
e(lfFk I a)�) , lfFj)= e -d (lfFj#,(k » ) (3)
d(lfFj,lfF(k) )=1-t �lfFj,blfF�k) b=l (4)
Where IfFj,b and lfF�k) denote bin b of IfFj and lfF(k) ,
respectively.
D. Complete Algorithm
1. Let Z be the input video
1.1. In fIrst frame 170 of Z, detect humans using the
human detector. Let OJ be the number of detected humans.
1.2. Initialize trajectories 0j ,l:S; j :s; OJ with initial
positions aj,o of the human beings detected by the
detector and also set the occlusion count rjJj for each
of these trajectories to O. 1.3. Initialize the appearance model IfFj for each
trajectory from the region around aj,o,
2. For each subsequent frame 17i of input vide Z,
For each existing trajectory OJ ,
2.1. Use motion model to predict the distribution
e(aj,i I aj,i-l), over locations for humanj in frame i,
creating a set of candidate particles a)�) ,1 :s; k :s; K .
328
2.2. Use appearance model to compute the color
histogram lfF(k) and likelihood e(lfFk I a)�) ,IfFJ for
each particle k. 2.3. Acquire k
*, the index of the most likely particle, after
re-sampling the particles according to their likelihoods.
2.4. Now run the human detector on the location a)�*) . If
the location is classified as a human, reset rjJj f- 0 ;
otherwise increase rjJj f- rjJj + 1.
2.5. If rjJj exceeds a threshold, remove trajectory j.
3.Now for frame 17i' search the new human objects and
compute the Euclidean distance Q j,k
between each newly
detected human k and each existing trajectory OJ. When
Q j,k
> r for allj, initialize a new trajectory for detection
k. where r is a threshold in pixels whose value is less than the width of the tracking window.
III. EXPERIMENTS AND RESEULTS
We have tested the proposed automatic human detection and tracking technique on a number of videos. We have created a video repository consisting of 50 odd videos. Many of them shot by us and others collected from various sources. These videos are the representatives of various real world scenarios deemed important for visual surveillance. These videos have been grouped into the following categories: 1. Single Human Videos - In all these videos, only a single person appears. Person may randomly change his pose and trajectory of movement. Videos in this group have different backgrounds and lighting conditions. 2. Multiple Human Videos with Occlusion - Videos under this category have multiple human beings. These human beings may move in random directions and can occlude each other. Also, the number of objects in video at a particular instance of time may vary. Objects may disappear and new objects may appear in video. 3. Multiple Human Videos without Occlusion - The videos do have multiple human beings but they rather move in such a way that occlusion does not occur. 4. Multiple Objects Videos - Multiple classes of objects (moving and stationery) are present in these videos. Some example object classes are humans, vehicles, animals etc. The objects may occlude each other and may appear in any pose, shape and size.
The human detection and tracking results with some of the representative videos are given in fig. 2 and fig. 4. Human detection and tracking using the proposed method starts automatically without providing any initialization parameters, unlike many other existing object tracking techniques in which operator intervention is required [5,17]. In fig. 2 we have shown the human detection and tracking results with a realistic video consisting of three human objects walking towards the camera. The video was shot in evening time in dim lighting
ICIEV 2012
IEEE/OSA/IAPR International Conference on Informatics, Electronics & Vision
conditions at a resolution of 640x480 and a frame rate of 17 frames per second. The video consisted of total 500 frames. In fig. 3 the detection and tracking results for 16 frames, from frame 1 to 375, at a difference of 25 frames are given. One can observe from fig. 2 that the proposed method detects and tracks all the three human objects from one frame to another with accuracy.
Frame I Frame 25
Frame 50 Frame 75
Frame 125
Frame 150 Frame 175
329
Frame 200 Frame 225
Frame 250 Frame 275
Frame 300 Frame 325
Frame 375
Figure 2. Detection and tracking results of the proposed method on a realtime video in outdoor environment.
Table I gives the ground truth and calculated values of centroids of the bounding boxes, around the three objects in the video results shown figure 2. One can observe from the given table 1 that the maximum centroid distance between the ground truth value and the calculated value is nearly 14 pixels(l3.89) for the video of frame resolution 640x480, which is not too much with respect to the frame size. Most of the times the centroid distance is either 0 or is too negligible to cause any big deviation from ground truth values, showing algorithm's accuracy which is also evident from the centroid distance shown in figure 3.
ICIEV 2012
IEEE/OSA/IAPR International Conference on Informatics, Electronics & Vision
TABLET
Calculated Object Centroids Ground Truth values for Object
Frame Object 1 Object 2 Object 3 Object 1
No. 1 (473,307) (368,297) (201,283) (473,307)
25 (471,304) (367,297) (200,282) (461,307) 50 (479,316) (373,306) (214,317) (479,316) 75 (489,310) (370,314) (209,308) (478,305)
100 (481,310) (381,316) (218,309) (479,311) 125 (490,322) (388,306) (215,310) (494,320) 150 (486,340) (384,312) (217,300) (484,340) 175 (489,314) (372,313) (232,304) (487,311) 200 (489,341) (376,315) (227,320) (488,344) 225 (477,312) (388,334) (217,303) (473,310) 250 (478,313) (390,337) (220,300) (478,313) 275 (471,304) (390,337) (220,300) (468,304) 300 (473,307) (368,334) (201,283) (475,308) 325 (470,302) (368,334) (220,300) (468,300) 350 (470,302) (368,334) (220,300) (468,296) 375 (466,298) (367,332) (220,300) (462,290)
Figure 3. Distance between the object-centroid estimated using the proposed method and the ground truth. x-axis shows the video frame number and yaxis shows the distance between centroids( in pixels) for three objects shown in fig. 3 video results .
Figure 4 shows the hwnan detection and tracking results where the hwnan may fall in different poses. There are four human beings in the video frame sequence in different poses. The fIrst one is walking, second one is standing, third human is sitting and the fourth one is bending. Results show that the proposed algorithm can detect and track all four human objects effIciently. After applying the proposed method on a set of several videos, we found that the average detection and tracking accuracy of the proposed technique is 87.44% and the average execution speed is 18 frames per second.
330
Centroids
Object 2 Object 3
(368,295) (213,290) (364,295) (212,289) (373,306) (214,317) (372,319) (210,307) (381,316) (218,309) (384,306) (215,310) (385,311) (237,309) (375,314) (231,309) (375,314) (224,321) (387,335) (215,304) (390,337) (220,300) (390,337) (220,300) (368,334) (200,281) (368,334) (219,298) (368,334) (220,300) (366,331) (220,300)
Frame 100
Centroid Distance
Object 1 Object 2 Object 3
0.00 2.00 13.89 10.44 3.60 13.89 0.00 0.00 0.00
12.00 5.38 2.23 2.20 0.00 0.00 4.47 4.00 0.00 2.00 l.41 13.45 3.60 3.l6 5.09 3.16 l.41 3.16 4.47 l.41 2.23 0.00 0.00 0.00 3.00 0.00 0.00 2.23 0.00 2.23 2.82 0.00 2.23 6.32 0.00 0.00 8.94 l.41 0.00
Frame 125
ICIEV 2012
IEEE/OSA/IAPR International Conference on Informatics, Electronics & Vision
Frame 150 Frame 175
Figure 4. The human detection and tracking results with the proposed method when the humans appear in different poses.
Figure 4 shows the human detection and tracking results where the human may fall in different poses. There are four human beings in the video frame sequence in different poses. The fIrst one is walking, second one is standing, third human is sitting and the fourth one is bending. Results show that the proposed algorithm can detect and track all four human objects effIciently. After applying the proposed method on a set of several videos, we found that the average detection and tracking accuracy of the proposed technique is 87.44% and the average execution speed is 18 frames per second.
IV. CONCLUSIONS
With the increasing number of crimes and terrorist activities the need for developing viable visual surveillance systems is being felt worldwide. Object detection and tracking is the fundamental step in visual surveillance. In this paper we have proposed an automatic mUltiple human detection and tracking technique using Haar-Iike features and a simple particle fIlter. The human detector can easily be trained using the Haar-like features extracted from the training samples consisting of the images of the humans and the background. The operation of the proposed technique is fully automatic and does not require any operator intervention unlike other methods. Experimental results demonstrate that the technique can detect and track the humans in outdoor environment in poor lighting conditions and in mUltiple poses. The average detection accuracy is 87.44% and the average processing speed of 18 frames per second, are the factors which make this
technique robust for real-time visual surveillance application.
ACKNOWLEDGMENT
Authors are thankful to the Department of Science and Technology, New Delhi for providing research grant vide grant no. SRiFTP/ET A-023/2009.
331
REFERENCES
[I] W. Hu and T. Tan, "A Survey on Visual Surveillance of Object Motion and Behaviors ", IEEE Trans. Systems, Man, and Cybernetics, Vol. 34, No. 3, pp. 334-352,2006.
[2] E.Stringa and C.S. Regazzoni, "Real-time Video Shot Detection for Scene Surveillance Application," IEEE Trans. Image Processing, Vol. 9, pp. 69-79, 2009.
[3] 1. Haritaoglu, D. Harwood, and L. Davis, "W4: Who? When? Where? What? a Real-time System for Detecting and Tracking People," In the Proc. IEEE Conference on Face Gesture Recognition, pp. 222-227, 1998.
[4] D. Koller, J. Weber, T. Huang, J. Malik, B. Rao G. Ogasawara, and S. Russell, 'Toward Robust Automatic Traffic Scene Analysis in Realtime," In the Proc. IEEE Conference on Pattern Recognition, Vol. 1, pp. 126- 13 1, 1994,.
[5] S. Nigam and A. Khare, "Object tracking based on curvelet transform ", accepted to publish in lET Computer Vision, 20 1 1.
[6] P. Viola and M. Jones, "Rapid Object Detection using a Boosted Cascade of Simple Features," In the Proc. of IEEE Int. Conference on Computer vision and Pattern Recognition pp. 5 1 1-5 18, 200 1.
[7] Y.Freund, and R. Schapire, " A decision theoretic generalization of online learning and an application to boosting," Journal of Computer and System Sciences 55, pp. 1 19-139, 2007
[8] P. Lanvin, J-C.Noyer, M. Benjelloun," Object Detection and Tracking using the Particle Filtering," In the Proc. of IEEE 371h Asilomar Conference on Signals, Systems and Computers, Vol. 2,pp. 1595-I599, 2003.
[9] J. Wang, Y. Ma, C. Li, H. Wang, J. Liu, "An Efficient Multi-Object Tracking Method using Multiple Particle Filters " In the Proc. of World Congress on Computer Science and Information Engineering, pp. 568-572,2009.
[ 10] Y. Chen, S. Yu, J. Fan, W. Chen, H. Li, "An Improved Color-based Particle Filter for Object Tracking," In the Proc. of IEEE Second Int. Conference on Genetic and Evolutionary Computing, pp. 360-363, 2008.
[ 1 1] F. Bardet, T. Chateau, D.Ramadasan, "Illumination Aware MCMC Particle Filter for Long-tenn Outdoor Multi-Object Simultaneous Tracking and Classification," In the Proc. of IEEE 12th Int. Conference on Computer Vision, pp. 1623-1630,2009.
[ 12] C. Luo, X. Cai, J. Zhang, "Robust Object Tracking using the Particle Filtering and Level Set Methods: A Comparative Experiment," In the Proc. of IEEE 10lh Workshop on Multimedia Signal Processing, pp. 359-364,2008.
[ 13] P. Brasnett, L. Mihaylova, N. Canagarajah, and D. Bull, "ParticleFiltering for Multiple Cues for Object Tracking in Video Sequences," In the Proc. of the 17th SPIE Annual Symposium on Electronic Imaging, Science and Technology, vol. 5685, pp. 430-440,2005
[ 14] M. Isard and A. Blake, "A Mixed-State Condensation Tracker with AutmaticModel-Switching," in Proc. Int. Conf Computer Vision, pp. I07- 1 12, 1998
[ 15] M. Isard and J. MacCormick, "Bramble: a Bayesian Multiple Blob Tracker," in Proc. Int. Conf. Computer Vision, pp. 34-41,200 1
[ 16] J. Czyz, B. Ristic, and B. Macq, "A Color-Based Particle Filter for Joint Detection and Tracking of Multiple Objects," In the Proc. of the ICASSP, pp. 2 17-220, 2005.
[ 17] N.T. Binh and A. Khare, "Object Tracking Of Video Sequence in Curvelet Domain ", Int. Journal of Image and Graphics, pp. 1-20 ,20 1 1
ICIEV 2012