A Simple and Fast Surveillance System for Human Tracking...

A Simple and Fast Surveillance System for Human Tracking and Behavior Analysis

Chen-Chiung Hsieh and Shu-Shuo Hsu Department of Computer Science and Engineering, Tatung University, Taipei, Taiwan, R.O.C.

[email protected]

Abstract

In this paper, we designed a simple and fast visual

surveillance system to track human position and to determine if any abnormal behavior like wall climbing and falling happened. By taking both time and background difference into considerations, illumination effects could be greatly reduced while calculating motion masks. Refinements including holes filling, shadow removal, and noise reduction are done to obtain much more reliable motion masks. However, motion masks corresponding to occluded moving people, greater than a given width, are segmented recursively into smaller ones by bi-modal thresholding. Meanwhile, background could also be updated by the refined motion masks. Integrated location-based and weighted block-based matching is done for object tracking. A similarity is defined from these weighted matched block for object classification. Finally, a couple of criterions are defined to analyze whether objects stop, disappear, climb, or fall. Experimental results are given to demonstrate the robustness of our system.

KEY WORDS Video Surveillance; Motion Detection; Object Tracking; Behavior Analysis; 1. Introduction

More and more people pay attention to visual surveillance systems for the purpose of security. There have been many surveillance systems applied to our surrounding environments, such as airports, train stations, shopping malls, and even private residential areas. Motion detection and object tracking are the most significant tasks in a video surveillance system. Meanwhile, many motion detection and object tracking schemes have been proposed. W4 [1], a real time surveillance system, detects and locates people through a combination of shape analysis and tracking. A video

monitoring system designed by Kim and Kim [2] utilizes a method for region-based motion segmentation to extract each moving object. Sakbot proposed by Cucchiara et al. [3] adopts statistics and knowledge of segmented objects to improve background modeling and moving object detection. Moreover, Derek Anderson et al. [4] presented a fall detection system achieved by silhouette analysis. A fence climbing detection system [5] was also proposed to deal with climbing situations by decoding the state sequence of the block based HMM.

There are two typical approaches for motion detection: background subtraction and temporal differencing [6,7]. Background subtraction refers to a robust background model while temporal differencing focuses on two consecutive frames. Background subtraction can extract complete motion masks, but it usually takes much time to maintain the background model. On the contrary, the drawback of temporal differencing is the incomplete motion masks. We integrate these two methods for moving regions detection. Source frame referencing is utilized to fill the holes. For each motion mask, vertical projection analysis is applied to segment each moving object. A fast object tracking method based on location estimation and weighted block-based similarity measurement is proposed to track all the moving objects. Finally, segmented motion mask corresponding to each moving object will be analyzed by size, location, and horizontal projection to classify its behavior such as stopping, disappearing, climbing or falling. The overall system architecture is as shown in Fig. 1.

Section 2 focuses on extraction and refinement of the motion masks. Object extraction by recursive bi-modal thresholding of the vertical projection is discussed in Section 3. Object tracking by location estimation and weighted block matching is described in Section 4. A couple of criteria are defined in Section 5 for behavior analysis. Experiments are then given in Section 6 to demonstrate the feasibility and robustness

Third International IEEE Conference on Signal-Image technologies and Internet-Based System

978-0-7695-3122-9/08 $25.00 © 2008 IEEEDOI 10.1109/SITIS.2007.128

818

Third International IEEE Conference on Signal-Image Technologies and Internet-Based System

978-0-7695-3122-9/08 $25.00 © 2008 IEEEDOI 10.1109/SITIS.2007.128

818


978-0-7695-3122-9/08 $25.00 © 2008 IEEEDOI 10.1109/SITIS.2007.128

812


978-0-7695-3122-9/08 $25.00 © 2008 IEEEDOI 10.1109/SITIS.2007.128

812

of the proposed system. Finally, conclusions are made in the last section.

Figure 1. System block diagram.

2. Motion Mask Refinement

Before start tracking objects, possible moving regions are extracted by the following process: frame differencing, holes filling, shadow removal, connected components labeling, and noise removal. Raw motion masks as shown in Fig. 2(c) are firstly produced by intersecting both time difference and background difference [2]. However, there are always a couple of vacant areas appearing in the motion masks due to uniform regions within objects.

2.1. Holes Filling

This problem is frequently encountered especially in

the situation that people are dressed in a uniform color. Holes exist in motion masks because some motion pixels are misjudged as non-motion ones. In this paper, source frame referencing is utilized to fill these holes. Non-motion pixels or holes adjacent to explicit motion pixels would be re-classified if they have the same intensity as the explicit ones. The algorithm is stated as follows:

Input: Raw motion mask, P Output: Refined motion mask

Step1. For each motion pixel P(x, y) in the motion mask, check all the adjacent pixels of P(x, y), which are denoted as Padj(i, j). Here, the eight connected pixels are used.

Step2. Padjt(i, j) is set as a motion pixel if Padj(i, j) is a non-motion pixel and | Padj(i, j)－P(i, j) | is less than a specified threshold.

Step3. Repeat Step2 until all Padj(i, j) are visited and re-classified.

Step4. Repeat Step1 until no new motion pixels are found.

The hole-filling procedure stops when no new

motion pixels are added to the motion masks. The experimental result is shown in Fig. 2 (d). Most holes are successfully filled. Some regions are not recovered because their sizes are too small.

(a) Frame at time t-1. (b) Frame at time t.

(c) Raw motion mask. (d) Refined motion mask.

Figure 2. Motion masks refinement. 2.2. Shadow Removal

The appearance of shadows is due to the light being cut off by objects. It is frequently encountered, especially in outdoor environments. However, the shadows will make it difficult to extract exact motion masks. Here, we adopt a shadow detection algorithm proposed by Cucchiara et al [8]. Because Red-Green-Blue (RGB) color space is less sensitive to brightness changes, the Hue-Saturation-Value (HSV) color space is used instead. Luminance (V component) and chrominance information (H and S components) are more powerful for detection of brightness changes.

Assume the HSV color of each pixel in the current frame are PH(x, y), PS(x, y), and PV(x, y), respectively and the pixels in background model are BH(x, y), BS(x, y), and BV(x, y). The shadow mask SD is defined as follows:

Time/Background Frame Difference

Motion Mask Refinement

Motion Object Extraction

Object Tracking

Object Behavior Analysis

Output Abnormal Events Messages

819819813813

≤−

≤−

≤≤

=

OtherwiseTyxByxPand

TyxByxPandyxByxPif

yxSD

HHH

sss

v

v

0),(),(

)),(),(((),(),(

1),(

βα (1)

The value of α depends on the strength of the light

source. On the other hand, β is always less than one providing the flexibility to avoid the small changes in the background. According to Eq. (1), motion masks with SD=1 are excluded. Fig. 3 demonstrates that the algorithm is effective to remove shadows.

(a)

(b) (c)

Figure 3. Motion masks refinement. (a) Original frame. (b) Without shadow removal. (c) With shadow removal. 2.3. Noise Removal

Frame subtraction produces motion mask as well as noises caused by illumination changes. The noises should be removed in order to obtain more accurate motion masks. Morphological opening and closing operations, corresponding to erosion and dilation, are performed to remove noises. However, morphological operations are not guaranteed to remove all the noises. To be more robust, the size filter is applied to remove all small size noises.

3. Object Extraction 3.1. Recursive Bi-modal Thresholding

Motion masks, corresponding to one or more moving objects, are connected if objects are occluded. Thus, a bounding region of the motion mask probably includes multiple moving objects. For example, two people walk passing by each other. To tackle this problem, vertical projection analysis is developed to extract each individual moving object. The vertical projection, formed by projecting a motion mask vertically, is assumed to be bell-shaped for a single

person. By formulating as a normal distribution, the standard deviation could be used to define the boundary for each object. If the standard deviation of a peak is greater than a threshold, the area is regarded as a moving region containing multiple objects. The standard deviation is defined as follows:

( ) , 1Deviation Standard1

2∑=

−=n

ii MP

n

(2)

where M is the mean value of all pi. Referring to the proposed bi-modal thresholding [9],

the original vertical projection H is divided into several sub-intervals Hx by sliding a window of size 2k+1. The total pixel number S(Hx) within each Hx is computed. If S(Hx) is smaller than S(Hx-k) and S(Hx+k), x is where a valley or local minimum will be located. Multiple objects can be separated into individuals when a local minimum is found.

, otherwise ,not valley 0

N to1k allfor ),()( and )()( if , valley1

=<

<= +

−

kxx

kxx

x HSHSHSHS

H(3)

where S(Hx) is the total pixel number within Hx and N, the maximum of k, is approximating half of the interval number. In Fact, the parameter k determines the level of fault tolerance. False valley resulted from aliasing noises can be eliminated by giving a proper N. Fig. 4 illustrates the vertical projection analysis for individual object segmentation.

Figure 4. Vertical projection analysis. S(Hx) is the total pixel number within Hx

In real situations, the vertical projection of a motion

mask may contain more than two moving objects. Therefore, multiple objects could be extracted by applying the bi-modal thresholding recursively if the object width is known. A real example is also given in Fig. 5 to show its feasibility. Each extracted object is described by its minimum bounding rectangle (MBR) as shown in Fig. 5(d).

S(Hx)

Hx+2 Hx+1 Hx-1 Hx Hx-2

Valley

…………

820820814814

(a) (b)

(c) (d)

Figure 5. Object segmentation by analyzing vertical projection of motion masks. (a) The source image. (b) Extracted motion masks. (c) Vertical projection corresponding to (b). (d) The final result of multiple objects segmentation. 3.2. Background Updating

In real situations, the background image changes over time. For example, tree branches swing slightly in the wind or a stationary object starts moving. As time varies, the original background image will become less and less powerful. Thus, it is necessary to update the background over time. The main concept is to find the scene changes in the non-motion areas and to update the current intensity to the background image. If the refined mask indicates a pixel is a non-motion one but the criterion Db= ),(),( yxByxPt − ≥ a given threshold, the pixel is regarded as a scene change. The intensity value will be then updated to the background image to form a new one.

Different from the scheme proposed by Kim and Kim [2] which updates its background immediately after background subtraction, our system updates the background image after the extracted motion masks are refined completely. A complete and accurate motion mask can be combined with the criterion defined above to form the background update function. Assume each pixel within the refined motion mask is denoted as Rt(x, y). The updated background B’(x, y) can be constructed by the following equation:

=−

=otherwiseyxB

yxDyxRifyxPyxB btt

),(1),(),(),,(

),(' , (4)

where Db(x,y) is the calculated scene change. Figure 6 illustrates how the system updates the background image over time. In the test video streams, a person lay on the ground for a long time and was regarded as a part of the background. Then, the person started

moving again as shown in Fig. 6(a). Figure 6(b) shows the recovery of the background image as time varies. Eventually, the background image was updated correctly as shown in the last picture of Fig. 6(b). 4. Object Tracking

Each extracted moving object would be recorded and matched with all existing models for the purpose of tracking [10,11]. In order to match objects, the similarity between a moving object and each recorded object model is calculated. The moving object is identified as the model with the largest similarity. Here, we proposed a weighted block-based similarity measurement. However, object tracking could be quite simple if there is only one moving object found in the previous frame near that location, these two objects X and Y are recognized as the same one.

(a)

(b)

Figure 6. Demonstration of background updates. (a) A person lay on the ground for a long time and then moved away; (b) The updated background over time. 4.1. Weighted Block-based Similarity Measurement

An unlabelled moving object is firstly divided into blocks of size 8 × 8. There are some well-known distance measurement methods, such as MSE (Mean Square Error), MAD (Mean Absolute Difference), and NCCF (Normalized Cross Correlation Function). To consider both computational cost and correctness, we adopted NMAD (Normalized Mean Absolute Difference) for the distance measurement.

821821815815

( ) , 255),(),(1NMAD1 1

2211∑∑= =

++−++=m

i

n

j

jyixPjyixPmn

(5)

where P(x1, y1) and P(x2 , y2) are the intensity of the pixel located in (x1, y1) and (x2, y2), respectively. The parameters m and n, denote the block size.

Experimental results showed that the corresponding blocks would distribute uniformly if we search in the correct model. On the contrary, if we search the corresponding blocks in a wrong model, most of the blocks found would distribute disorderly and overlap each other. Therefore, each block is given a weight to represent its reliability. The area of the overlapped pixels is counted for each matched block. The greater the overlapped area is, the greater the weight is as shown in Eq. (6):

blockaofAreapixelsoverlappedofAreawxy = (6)

The similarity is defined by Eq. (7). Each extracted moving object is assigned as the model with the largest similarity.

, %100 )),(1(1Similarity1 1

×

−= ∑∑

= =

M

x

N

yxy yxDw

MN (7)

where D(x, y) represents the minimum NMAD between a block of a moving object and the corresponding block in the model. Fig. 7 gives an example that both moving objects were accurately labelled by this weighted block-based matching method.

(a) (b)

Figure 7. Similarity measurement for object 1. (a) The correct one with larger similarity of 49.03 (b) The incorrect one with smaller similarity of 42.12. 4.2. Occlusion Detection

The object tracking proposed in the pervious section works even occlusion occurs because we have saved the object models before occlusion. However, the significant issue is how to determine the exact time for saving objects as models. The models must be saved before the objects overlap. Therefore, an occlusion detector described by Eq. (7) is developed. An alarm will be triggered when two objects overlap. In the equation, the MBRs of the objects at time t are compared with those at time t-1.

, ),()( if 1

initially 0 ,in object each For

11

∀⊃+

=

=

−−

x

tttxx

xt

CframeinyobjectyMBRxMBRC

C

Cframex (8)

If the value of Cx is greater than 2, occlusion occurs within object x. When occlusion is detected, the tracking system still keeps the model of each object and turns to track with a temporary overlapping object model. As soon as the overlapping objects separate again, each of them can be detected and tracked accurately by the saved models in the tracking system. 5. Behavior Analysis

In order to recognize abnormal behaviors, a couple of criteria are defined in our system. Once the suspicious behaviors are detected, the system will set off an alarm to the security officers. Several typically abnormal behaviors, including wall climbing, stopping, disappearing, and falling, are discussed in this section.

5.1. Stopping and Disappearing

In real situations, pedestrians are not always moving. They possibly stop in a sudden. However, frame difference does not work well when objects keep still. In our system, each moving object’s location is recorded. If a location is occupied by object i but released in the next frame, object i is recognized as a stationary object. Moreover, if object i keeps still for a while and its location is close to the boundaries, the system would consider object i disappeared.

=∈∃=

⊆=

=

, , L if , Normal 0otherwise , Stopping 1

)(

otherwise , Normal and 1 if , ngDisappeari

)(

1-t illiS

S(i)MBR(i)SiD

t

BDttt

(9) (10)

where Lt is a set of existing label number at time t and SBD is a set of predefined boundary areas.

5.2. Wall Climbing

Obviously, the motion vectors tend to be upward when an object is climbing. The center of an object i at time t is defined as Ct(i). By comparing Ct(i) with the previous ones, the motion vector MVt(i) could be calculated. If MVt(i) is upward, object i would be judged as a climbing object. To differentiate the wall climbing from a small jump, MVt(i) must bigger than a threshold, TSc.

822822816816

,otherwise , Normal

) ( if , Climbing)(

≥=

= ctt-ktt

TS(i).y(i).y - CC(i)MViCL (11)

where Ct(i).y is the y coordinate of object i’s center at time t.

5.3. Falling

As mentioned in the previous section, vertical projection could be used to extract multiple moving objects. Likewise, horizontal projection could be applied as well to detect whether a monitored person falls. The horizontal projection of a falling person, formed by projecting the motion mask horizontally, is assumed to be bell-shaped and could be considered as normal distribution which means the standard deviation is less than a threshold.

,otherwise , Normal

if , Falling)(

≤

= fdtt

TS(i)SDiFD (12)

where SDt(i) is the standard deviation according to the horizontal histogram of object i at time t.

We tested three different abnormal behaviors including stopping, falling, and wall climbing. As shown in Fig. 6.3, all abnormal behaviors were detected by the proposed algorithms. The warning messages were shown on the screen when abnormal behaviors were detected.

(a) (b) (c)

Figure 8. Abnormal behavior detection. (a) Two people walked toward each other, and then stopped to shake hands; (b) A person was walking and fell all a sudden; (c) Someone climbed over a fence. 6. Experimental Results

A series of scenarios were tested in order to demonstrate the robustness of the proposed system. Videos were captured outdoors with resolution of 352x240 in several different environments. Our program ran on Intel® Pentium® 4 3.4 GHz processor with 512MB RAM and 60GB hard disk drive.

The proposed system is designed to detect and track moving objects in real-time. In order to verify the accuracy and efficiency of the proposed system, five video sequences were tested. Table 1 shows the successful matching rates of location based estimation

and weighted block-based matching. The errors for location estimation were caused by un-detected occlusion near edges and wrong motion masks. As to the errors for weighted block-based matching, they were caused by similar objects like people wearing clothes in same color. Table 2 shows the elapsed processing time and the processing frame rate.

Table 1: Accuracy verification.

The above experimental results show that our system

can successfully detect and track moving objects for most situations. The accuracy would be higher if object behaviors are taken into consideration. On the other hand, the frame rate is around 6 frames per second. It can be used in most normal environments. However, the frame rate can still be raised by upgrading the hardware equipment.

Table 2: Efficiency verification. Frame

Number Execution Time(s)

Frame Rate(fps)

Vedio1 439 38 11.55 Vedio2 877 85.6 10.62 Vedio3 1017 97.6 10.42 Vedio4 2790 258 10.8 Vedio5 4012 400 10.03

7. Conclusions and Future Works

In this paper, we designed a simple and fast visual surveillance system which successfully detects moving objects and continuously tracks and locates them. A novel approach by source frame reference was used to refine the binary motion mask. Most vacant areas in the raw motion masks were patched well by this step. Then, multiple-object segmentation was accomplished by analyzing the vertical projection of motion masks. Finally, all extracted moving objects were accurately tracked by the integration of location estimation and weighted block-based similarity measurement among saved models in the tracking system.

To demonstrate the feasibility and robustness of our system, video data were captured outdoors in several environments such as pathway, corridor, and entrance. Experimental results showed that the system efficiently detected and tracked multiple moving objects even if

823823817817

occlusion occurred. Illumination variations and small changes in the background were tolerable as well. Improper behaviors like wall climbing, falling, stopping, and disappearing were also recognized correctly.

However, the system can be extended with more future work to deal with people and vehicle counting, suspicious behavior analysis, theft detection, and so on. Tracking rigid objects like cars differs significantly from tracking semi-rigid objects. Appearance-adaptive models proposed by Zhou et al. [12] can be applied to achieve this goal. If all rigid and semi-rigid objects are detected and tracked, the system can count people and vehicles respectively. With this possible, traffic control could be realized. The system could be improved to recognize more suspicious behaviors in the future, such as car stealing, bank robbery, firearm activities, or abnormal wandering. More intelligent schemes can be applied to our system to make it more powerful in behavior analysis. A human star skeletonization motion analysis scheme proposed by Fujiyoshi et al. [13] could be utilized to recognize abnormal behaviors. With more useful features added into the system, it will bring forth benefits of a more complete system. Acknowledgments The authors would like to thank the National Science Foundation (NSC) of the Republic of China (ROC) for financially supporting this research under project No: NSC 95－2221－E－036－003. References [1] I. Haritaoglu, D. Harwood, and L. S. Davis, “W4: Real-Time Surveillance of People and Their Activities”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, No. 8, pp. 809-830, August 2000. [2] J. B. Kim and H. J. Kim, “Efficient Region-Based Motion Segmentation for a Video Monitoring System”, Pattern Recognition Letters, Vol. 24, pp. 113-128, 2003.

[3] R. Cucchiara, C. Grana, M. Piccardi, and A. Preti., “Detecting Moving Objects, Ghosts, and Shadows in Video Streams”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 25, No. 10, pp. 1337-1342, October 2003. [4] D. Anderson, J. M. Keller, M. Skubic, X. Chen, and Z. He, “Recognizing Falls from Silhouettes”, Proceedings of the 28th IEEE EMBS Annual International Conference, pp. 6388-6391, New York, Sep., 2006. [5] E. Yu, and J. K. Aggarwal, “Detection of Fence Climbing from Monocular Video”, Proceedings of the 18th International Conference on Pattern Recognition, pp. 375-378, 2006. [6] S. Joo and Q. Zheng, “A Temporal Variance-Based Moving Target Detector”, IEEE VS-PETS, Jan. 2005. [7] Q. Wu, H. Cheng, and B. Jeng, “Motion Detection via Change-Point Detection for Cumulative Histograms of Ratio Images”, Pattern Recogmition Letters, Vol. 26, pp. 555-563, 2005. [8] R. Cucchiara, C. Grana, M.Piccardi, A. Prati, and S. Sirotti , “Improving Shadow Suppression in Moving Object Detection with HSV Color Information”, Proceedings of IEEE Intelligent Transportation System Conference (ITSC 2001), Oakland, CA, USA, pp.334-339, Aug., 2001. [9] H. Shen and C. R. Johnson, “Semi-Automatic Image Segmentation: A Bimodal Thresholding Approach”, Technical Report UUCS-94-019, Un. of Utah, Dept. of Comp. Science, 1994. [10] M. F. Abdelkader, R. Chellappa, and Q. Zheng, “Intergrated Motion Detection and Tracking for Visual Surveillance”, The Fourth IEEE Conference on Computer Vision Systems, pp. 28-34, Jan. 2006. [11] A. Gyaourova, C. Kamath and S. Cheung, “Block Matching for Object Tracking”, Tech. Rep. UCRL-TR-. 200271, Lawrence Livermore National Laboratory, Oct. 2003. [12] S. K. Zhou and R. Chellappa, “Visual Tracking and Recognition Using Appearance-Adaptive Models in Particle Filters”, IEEE Transactions on Image Processing, Vol. 13, No. 11, pp. 1491-1506, November 2004. [13] H. Fujiyoshi and A. J. Lipton, “Real-time Human Motion Analysis by Image Skeletonization”, Proc of IEEE Workshop on Application of Computer Vision, pp. 15-21, October 1998.

824824818818

A Simple and Fast Surveillance System for Human Tracking...

Documents

Transcript of A Simple and Fast Surveillance System for Human Tracking...