Hoip10 articulo counting people in crowded environments_univ_berlin

1

Counting People in Crowded Environments: An Overview

Michael Pätzold (1) / Rubén Heras Evangelio (1) / Thomas Sikora(1)

(1) Communication Systems Group, Technische Universität Berlin, Germany

ABSTRACT

Counting the number of persons in a crowded scene is of big interest in many applications. Most of the proposed approaches in the literature tackle the task of counting people in an indirect, statistical way. Recently, we presented a direct, counting-by-detection method based on fusing shape information obtained from an adapted Histogram of Oriented Gradients algorithm (HOG) with temporal information. The use of temporal information reduces false positives by considering the characteristics of motion of different human body parts. A subsequent tracking and coherent motion detection of the human hypotheses enhance the performance of this system additionally. The performance obtained by this system is comparable to state-of-the-art systems while allowing not only counting people but also providing valuable information for a tracking approach. In this paper we present an overview of relevant state-of-the-art methods for counting people in crowded environments, paying special attention to the method proposed by our group and showing results based on standard video sequences.

1. INTRODUCTION

The estimation of the number of people within a scene is an essential component for higher level video analysis layers. The availability of a system that measures the density of a crowd is inevitable for security applications like prevention of overcrowded areas at public places. Furthermore, information gathered by such systems can be taken as data basis for economic applications like optimization of public transport schedules, reduction of waiting time in supermarkets or to assess the effectiveness of advertising. Since static video cameras are available at most public places, there is a big interest in solutions for counting people in video. In most cases the cameras are already installed, so that algorithms should handle a wide range of perspectives and varying lightning conditions.

A classic people tracking system extracts foreground pixels of an image by subtracting the current image from a learned statistical background model and aggregates the resulting foreground pixels to form objects by means of connected components [16]. The number of objects can be derived directly by counting them. Such systems are restricted to areas with slowly changing lightning conditions and the count of objects can only be determined, if they do not interact with or occlude each other.

Due to these restrictions people counting specific algorithms have been developed, which can be divided into two groups. Under the assumption of the impossibility of isolating every entity of a crowd, the first group of algorithms extracts low level information (e.g. foreground pixels or moving points) and uses it in a further step to estimate a value for the density of the crowd. The second group of algorithms searches for the object of interest (e.g. persons) based on a model and counts directly the number of found entities. Object models can be built out of different types of information like motion or shape. Furthermore, the confidence of a detection of an object can be increased by imposing a temporal consistency constraint. This constraint requires the association of detected objects in consecutive frames of a video sequence. Data association might be a challenging task in crowded environments due to multiple detections and inter-object-occlusion.

In Section 2 we elaborate our grouping in detail and subdivide relevant state-of-the-art approaches according to their type of analysis. In the subsequent section we describe our approach. Experimental results are shown in Section 4 and Section 5 concludes the paper.

2. PEOPLE COUNTING METHODS

2.1. LOW LEVEL CROWD ANALYSIS

Assuming that a crowd is dense enough so that individuals cannot be separated, low-level based methods infer the number of people in the scene by some kind of mapping from data acquired by low level computer vision techniques into an estimate of the crowd density.

Hou and Pang [10] extract foreground pixels of an image by subtracting the current image from a learned statistical background model and map the foreground pixels to the number of people of a scene by using a neural

2

network previously trained for this particular scene. Paragios and Ramesh [12] propose a method that also extracts foreground pixels, but they handle the influence of perspective by explicitly weighting the pixels according to geometric information. Foreground areas computed by subtraction from a statistical background model can be distorted because of uncontrolled lightning conditions in open environments. Albiol et al. [2] address this problem by counting only moving corner points and assume that the number of these points is related linearly to the number of people in the scene. Conte et al. [6] claim that the relation of detected points and number of people in a scene is more complex than a direct linear mapping. Therefore, they propose the use of an epsilon support vector regressor for this task. Furthermore, they achieve a high stability of the tracked points by applying a scale-invariant point descriptor. For all the algorithms of this group it should be noted, that a mapping is scene-depended and thus has to be relearned for every particular scene. This implies the existence of representative training data for every camera setup.

2.2. FOREGROUND SEGMENTATION MODEL BASED ANALYSIS

Besides the methods based on low-level information there exist a small number of approaches that tackle the significant challenge of finding the configuration of people within a crowd (including their number, position and also articulation) by only using foreground masks of a video sequence [20, 9]. In [20] Zhao and Nevatia develop a person model based on human shapes observed under different perspectives, which defines the amount of pixels occupied by a person given its position and articulation. A given configuration of people is evaluated by comparing their occupied area with the area of the foreground mask. An appropriate solution for a given foreground mask is found by sampling the resulting high-dimensional space of configurations by means of Markov-chain-Monte-Carlo methods. The efficiency of this method depends heavily on the number of samples to process. This number is decreased by incorporating additional information into the proposal probability of the Markov chain (e.g. proposing head positions based on a shape model).

2.3. MOTION MODEL BASED CROWD ANALYSIS

Distinctive motion of individiual humans is used by several algorithms for crowd counting [15, 4, 14]. By modeling an entity as a region with coherent motion it is possible to distinguish them by analyzing the flow of characteristic points.

Rossi and Bozzoli [15] build trajectories out of tracking characteristic points in areas with detected temporal changes and estimate the number of people in an image by agglomerativelly clustering these points according to their motion. Since this method does not handle occlusion cases, they only applied it to ceiling mounted cameras with vertical viewing direction. Brostow and Cipolla publish in [4] a method that tracks characteristic features with the help of optical flow and uses an unsupervised data driven Bayesian clustering algorithm. This method achieves good counting results on over-head camera setups with nearly vertical viewing direction. Applying this method to camera setups with lower tilt angle might give results of worse quality, since from this viewing direction human limbs, whose motion is non-coherent, are more visible. Therefore, it is difficult to achieve an accurate clustering. Furthermore, a lower tilt angle leads to inter-object occlusions which complicate the analysis of motion.

2.4. SHAPE MODEL BASED ANALYSIS

It is not always possible to separate people only incorporating motion information, for instance if they are walking in unison. Analysis of shape and appearance of an object can also be considered to count objects of interest (e.g. humans) and distinguish them from objects belonging to another class (e.g. bicycles or cars). In the case of humans it is reasonable that models based on shape should be preferred to models containing color information. In [7] Dalal and Triggs published a method which evaluates gradient information of still images by means of a machine learning algorithm. This method is able to detect humans in still images under a wide range of perspectives with a good reliability, but do not provide any means for handling partially occlusions, which is a common issue in the detection of persons in crowded environments. Wu and Nevatia [18] tackled this problem by designing human part detectors and combining the detected parts of a human in a joined likelihood model.

2.5. MULTI TARGET TRACKING IN CROWDED SCENES

Temporal association of found targets from one image to the next image can provide information about the track of people and, furthermore, the gained temporal consistency improves the confidence of the presence of a person. In very sparse crowds or in camera setups with perfect viewing direction the output correspondence of human-detectors can be assigned without ambiguity. If these conditions are not met it is possible that objects get occluded by static scene items or occlude each other temporarily. In this case straightforward association of detections by basic techniques is hampered and sophisticated data association methods and appearance models are required. While our algorithm uses a basic association technique between two time steps, the tracking and thus the counting performance is increased by using one of the following methods.

3

Breitenstein et al. propose in [3] a method that accounts the uncertainty of association by applying a multi-

modal particle filter for every target. Furthermore, the association is improved by learning an appearance model for every target by online-boosting. This system is able to keep track of multiple persons even under full occlusion. But initializing a tracker requires the target being previously detected for multiple frames with high confidence. This initialization problem can be prevented by global data association based tracking, which analyzes the full video sequence at once and, hence, is able to associate the detections globally.

As the search space of global data association methods is combinatorial a naive search is inappropriate, because of its exponential computational complexity. Zhang et al. [19] propose an approach to find an optimal solution in an efficient manner by formulating the problem as a cost flow network and integrating an explicit occlusion model. To associate smaller track fragments to final trajectories a suitable affinity measure is required. While these measures and its parameters are chosen manually in most cases, Li et al. develop in [11] a learning-based algorithm that tracks people in crowded scenes and provides efficiency by hierarchically assembling track fragments over multiple stages and automatically choosing appropriate affinity measures for each stage.

3. OUTLINE OF OUR APPROACH

Recently, we published a model-based algorithm for counting persons in crowded scenarios [13]. This method is described in the following section. Figure 1 depicts how the various modules and their interaction provide a count of individuals observed by a single stationary camera. We trained a detector to find the upper body region of a human (1). Since the human head and torso contain only marginal shape information, a detector only based on this approach with acceptable detection rate would create a vast amount of false positives. To avoid these false positives we combine the shape model with a uniform motion model (3), generated by using optical flow information between consecutive frames (2), which leads to a combined probability map with uniform motion and characteristic shape. By seeking the modes of this probability map (4) discrete detections are obtained. These detections are associated to trajectories by using motion information (5). In parallel by enforcing temporal consistency of detections false positives are rejected. Finally, we apply an algorithm which validates trajectories for coherent motion indicating that they belong to the same human body and thus, by keeping only one of the matches, the false detection rate is reduced again (6). By counting the number of finally resulting detections the number of people in the scene per frame is obtained.

3.1. SHAPE MODEL

Due to the high class variation of humans with respect to color and texture, most of the recent detectors use gradient information as feature descriptors to find human-like shape in images. Dalal and Triggs in [7] develop the Histogram of Oriented Gradients method, which is robust to the variable appearance of humans. This robustness is achieved by collecting the gradients within a small region (cells) of an image and representing them as histograms of their orientation. After normalization and concatenation of the histograms of adjacent regions, they obtain a descriptor which is classified by a support vector machine. Dalal and Triggs show results with good detection rates for humans in different articulations and different perspectives. But, since the detector is learnt on pictures containing the complete human body area, it has difficulties to detect partially occluded

Figure 1: Overview of system modules and their interaction

4

persons, which are common in a dense crowd scenario. In this approach, we overcome the mentioned occlusion-shortcomings by learning only the head-shoulder region of a human. Thereto, the cell dimension is changed and a training database of the head-shoulder region of humans is established. The detector is applied on all areas of the image and thus a probability can be built with the help of the gained confidence values. By using a linear SVM-kernel for classifying the samples the probability map is written as �� max �� · �� , 0�,

where �� is the HOG-descriptor at location ��, �� is the trained decision hyper-plane and b the trained bias. Figure 2(b) depicts the main active blocks of the descriptor and Figure 2(c) reveals that mainly the horizontal gradients on the head decide whether a sample contains a head or not. Figure 3(b) shows a computed probability map of a sample input frame. Compared to the HOG-detector proposed by Dalal and Triggs this detector has to make a decision based on less information and thus is not able to classify every test sample reliably. It detects all structures with so called omega-shape, mainly on heads, but also on feet and background structures. Hence, the shape model has to be combined with other information cues to reduce the false positive rate.

3.2. UNIFORM MOTION MODEL

Motion information is the second cue of the framework to detect individual humans in video sequences. This kind of motion information is obtained from a dense optical flow field and is used to identify potential image areas containing a head. By combining this information with the confidence gained from the shape model described in Section 3.1, the detection rate is increased. Figure 3(c) shows the optical flow for two consecutive frames. The direction of motion is color-coded by using different colors for the direction of flow and different brightness values for its magnitude. The significant green spots are mainly related to the upper body of people, while the small regions with varying hue-values correspond to the limbs of the bodies. This observation leads us to assume that a human upper body (torso and head) moves uniformly, while we can observe a non-uniform motion in limb regions. Furthermore, the non-uniformity of flow between people can reveal the borders between them in some cases, as shown in the picture by slightly different green values. By measuring the uniformity of motion inside an image region we can reason about the likelihood of that region to contain a human body.

A hypothesized human body region centered at is defined by a binary mask �� with a head-shoulder shape. By sliding this mask over the original image we define an area for every pixel where we hypothesize a human upper body. Since the dense optical flow field provides a motion vector �� , �� for every pixel location � � ��, ��, we can compute the mean motion vector

�� 1" # �$��%�&' ,

inside the mask �� located at where " is the number of pixels of the mask region. Now we can measure the probability that a region surrounding a pixel contains uniform motion by computing the average endpoint error of every particular vector �$�� with the mean vector ��

��( � )"�*+,-� � .1" # /�� 0 �$��/%�&'1

23.

By sliding the binary mask over the whole image we build the probability map seen in Figure 3(c). As shown, limb regions of humans correspond to areas with a low probability being a head candidate.

Figure 2: Upper body HOG-detector: (a): An upper body image sample of the training database. (b): The main active blocks of our trained HOG-detector are located at the head

outline. (c): A sample HOG descriptor weighted by the positive SVM weights.

5

3.3. INFORMATION FUSION AND MODE SEEKING

In the previous section, shape and motion information is used to compute probability maps. Considering both cues individually it is not possible to detect heads securely, but by fusing the knowledge of both domains we are able to detect most heads reliably. We simply merge the probability maps by a weighted linear combination �5�� 6�78 � 9 · �5�� )"�*+,-8 � �� ,

where 9 is set empirically.

The final step is to detect heads from the probability map �5�� 6�78. We search for detections as maxima in the probability map by Mean-Shift Mode Estimation as described in [5]. The mean shift procedure provides local smoothing of the detections and clusters overlapping detections. Each cluster corresponds to particular head detections with a confidence which is gained by accumulating the probabilities within the mean-shift kernel support. A trade-off between missing rate and false positive detections is made by choosing an adequate threshold :;<=<>=%?@empirically for the detection confidence.

3.4. VALIDATION BY TRACKING

So far only information of two subsequent frames is used. As described in Section 2.5 incorporating tracking information helps to enhance the reliability of an object detector by enforcing coherent detections over several frames. At first we propagate detections into the subsequent frame by displacing it by their corresponding mean motion vector �� and afterwards we associate the propagated detections with the current ones according to their minimal distance in a greedy manner. If no propagated detection is associated to a current one, a new trajectory is spawned. Every trajectory owns a confidence measure which is calculated from the detection confidences and the distance of the associated detections. The measure increases by multiple successful associations and decreases in the case of missing detections. By applying one threshold on this confidence value for labeling a track as valid person track and another threshold for deleting an unreliable track, we enhance the robustness of the system.

3.5. LINKING DETECTIONS USING COHERENT MOTION DETECTION

By applying the detector as yet described, the system detects regions with characteristic shape and uniform motion. Sometimes the detector gets distracted by objects like carried luggage or clothes with head shape-like texture as shown in figure 6. In these cases multiple trajectories on one pedestrian are established. By following these trajectories over time, we assume that they lay on a rigid body, if they are located in a constant distance for every time step. This assumption is based in the observation that different persons change their distance to each other even if they walk in groups based on individual walking cycles. Following an idea published in [4] we can approximate that coherent motion is more likely, if the variance in distance between two trajectories is small. If we find multiple trajectories with coherent motion characteristics, we only keep the one with maximum height in the image, since we suppose the head of a person at this position.

Figure 3: Probability maps: A scene with high crowd density shows the concept of information fusion to find head candidates: (a): Input image. (b): Due to using only gradient information of one frame there are many regions with a high

probability of being heads �5�� 8. (c): Dense optical flow field is computed between two consecutive frames. (d): Using motion information creates another probability map �5�� )"�*+,-8. (e): Fusing both information, the modes of

the resulting map �5�� 6�78 can be distinguished easily and represent the head detections.

6

4. EXPERIMENTAL RESULTS

The system was evaluated by using the sequence S1-L1-Time-13-57, view 1 from the Pets2009-database. The ground truth person-count was generated by annotating every person in every frame, even the occluded ones. For the training of the HOG-detector we cropped 158 heads from the INRIA database [1]. The dense optical flow field was computed by the public available implementations of Marzat et al. [11] and Werlberger et al. [17]. The number of people was estimated in a buffered way: If the system labels a trajectory as a valid person we increase the person count back in time to the frame where the detection was seen first.

Figure 6 shows the counting results. Regardless of the used optical flow method the system counts people accurately in crowds with different densities. The experiments showed that both optical flow methods have advantages in different modules of the system. The flow obtained by [11] is more suited for detecting uniform motion, while the implementation of [17] allows more accurate tracking results due to smoothing the flow field by a diffusion process.

We submitted the results of the overall system to the counting task of the PETS2009-workshop. This workshop evaluates the algorithms of all participants on the same dataset and thus provides the possibility to compare different approaches. The results indicate that the performance of our system is comparable to state-of-the-art methods [8].

5. CONCLUSION

In this paper we gave an overview of relevant state-of-the-art methods for counting people in crowded environments and described a counting-by-detection method developed by our group which is based on a model that considers the characteristic shape and motion of a person, enhanced by means of tracking information. The resulting trajectories of potential detections are analyzed regarding coherent motion and thus the number of false positives is decreased. In further work we will apply a more sophisticated data association method in order to tackle the challenge of tracking individuals in crowds.

Figure 6: The estimated person count of the system generated by using different optical flow methods.

Figure 4: Linking detections by coherent motion detection: (a): False positive detections are generated caused by characteristic shape and motion. (b): Due to their coherent motion trajectories on the same object are grouped and only one

detection is kept. (Linked trajectories are marked with red lines.)

7

6. REFERENCES

[1] Inria dataset. Available online at http://lear.inrialpes.fr/data.

[2] Antonio Albiol, Maria J. Silla, Alberto Albiol, and Jose Manuel Mossi. Video analysis using corners motion analysis. In Proc. International Workshop on Performance Evaluation of Tracking and Surveillance (PETS2009), pages 31–38, Miami, FL, USA, June 2009.

[3] M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. Van Gool. Robust tracking-by-detection using a detector confidence particle filter. In Proc. IEEE 12th Int Computer Vision Conf, pages 1515–1522, 2009.

[4] G.J. Brostow and R. Cipolla. Unsupervised bayesian detection of independent motion in crowds. In Proc. IEEE Computer Society Conference on Computer Vision & Pattern Recognition (CVPR2006), pages I: 594–601, 2006.

[5] Dorin Comaniciu and Peter Meer. Distribution free decomposition of multivariate data. Pattern Analysis and Applications, 2:22–30, 1998.

[6] Donatello Conte, Pasquale Foggia, Gennaro Percannella, Francesco Tufano, and Mario Vento. A method for counting moving people in video surveillance videos. EURASIP Journal on Advances in Signal Processing, page 10 pages, 2010.

[7] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Proc. International Conference on Computer Vision & Pattern Recognition (CVPR2005), volume 2, pages 886–893, INRIA Rhône-Alpes, ZIRST-655, av. de l’Europe, Montbonnot-38334, June 2005.

[8] A. Ellis and J. Ferryman. Pets2010 and pets2009 evaluation of results using individual ground truthed single views. Advanced Video and Signal Based Surveillance, IEEE Conference on, 0:135–142, 2010.

[9] Weina Ge and Robert T. Collins. Marked point processes for crowd counting. In Proc. IEEE Computer Society Conference on Computer Vision & Pattern Recognition (CVPR2009), pages 2913–2920, 2009.

[10] Ya li Hou and G. K. H. Pang. Automated people counting at a mass site. In Proc. IEEE Int. Conf. Automation and Logistics (ICAL 2008), pages 464–469, 2008.

[11] J. Marzat, Y. Dumortier, and A. Ducrot. Real-time dense and accurate parallel optical flow using cuda. In Proceedings of the 17th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision (WSCG), pages 105–111, 2009.

[12] N. Paragios and V. Ramesh. A mrf-based approach for real-time subway monitoring. In Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR 2001), volume 1, 2001.

[13] Michael Pätzold, Rubén Heras Evangelio, and Thomas Sikora. Counting people in crowded environments by fusion of shape and motion information. In IEEE Computer Society, editor, Proceedings of the IEEE International Conference on Advanced Video and Signal Based Surveillance, PETS 2010 Workshop, pages 157–164, Boston, USA, August 2010. IEEE.

[14] Vincent Rabaud and Serge Belongie. Counting crowded moving objects. Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR 2006), 1:705–711, 2006.

[15] M. Rossi and A. Bozzoli. Tracking and counting moving people. In Proceedings of the IEEE International Conference on Image Processing (ICIP 1994), volume 3, pages 212–216, 1994.

[16] Chris Stauffer and W.E.L. Grimson. Adaptive background mixture models for real-time tracking. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR1999), pages 246–252, 1999.

[17] Manuel Werlberger, Werner Trobin, Thomas Pock, Andreas Wedel, Daniel Cremers, and Horst Bischof. Anisotropic huber-l1 optical flow. In British Machine Vision Conference, 2009.

[18] Bo Wu and Ram Nevatia. Detection and tracking of multiple, partially occluded humans by bayesian combination of edgelet based part detectors. Int. J. Comput. Vision, 75(2):247–266, 2007.

[19] Li Zhang, Yuan Li, and R. Nevatia. Global data association for multi-object tracking using network flows. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR 2008), pages 1–8, 2008.

[20] Tao Zhao and Ram Nevatia. Bayesian human segmentation in crowded situations. Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR 2003), 2:459, 2003.

Hoip10 articulo counting people in crowded environments_univ_berlin

Technology

Transcript of Hoip10 articulo counting people in crowded environments_univ_berlin