People Tracker Report

7/28/2019 People Tracker Report

1/16

1

TRACKING PEOPLE FROM ASTATIONARY CAMERA

Marc Kelly Robins, Arvind Antonio de Menezes Pereira and Abhay V NadkarniDepartment of Electrical Engineering, USC

ABSTRACT

An algorithm for tracking people based on the color statistics of their attire is described. People are identifiedas moving objects using a combination of background subtraction and edge detection. Encompassing of

people within tracking boxes is based on histograms of large blob masses. The tracking aspect isimplemented using the unique color statistics of a particular individual. Two major methods of using colorstatistics are discussed. The system showed excellent results with two people in a frame for method I. MethodII showed better results in terms of speed, robustness and consistency.

INTRODUCTION

Applications in tracking people via real-time systems are prevalent in modern society, ranging

anywhere from surveillance to event recognition. Following the research of McKenna et al.

[2], we developed an algorithm for real-time people tracking using the C6416 DSK by Texas

Instruments. First, adaptive background subtraction is performed using both changes in

intensity and edge detection. Then we aggregate neighboring foreground pixels to constitute

moving objects. Finally, individual people are identified by their unique chromatic features

and followed from one frame to the next. During the design of the system certain assumptions

were made about the environment. Since the algorithm begins with foreground segmentation,

the camera must remain stationary. Due to the low frame rate, motion must be moderately

slow and continuous. Furthermore, the background should stay in effect placid. Implementing

this system on the C6416 involved certain limitations. For example, the C6416 uses fixed

point processing, which requires careful attention to numerical precision in parts of the

algorithm. Also, the video equipment available for this project incurred random frames of

impulsive noise, which we found insurmountable. The following report presents our

algorithm, code optimizations, and final results.

GUIDING PRINCIPLES USED IN ALGORITHM DESIGN

1. People moving in the video sequence: In a video sequence, our algorithm identifies people

by motion detection as opposed to feature detection (Corner detection, SIFT) [4,5] or template

matching [6]. These methods are quite computationally intensive when compared to our

simplistic method. In our method, we detect a moving person by noting changes in the mean


2/16

2

values of pixels corresponding to the entire frame. Once the values cross a particular threshold,

we consider that a person has moved.

2. People occupy a significant portion of the frame: Size is used as one criterion to reject non-

human motion and ambient noise. The algorithm should reject any small movement as noise.

Only substantial motion is considered as a person moving. For example, leaves moving in a

video sequence are not considered as motion.

3. People may stop moving for a while: Our algorithm should not rely completely on motion

detection. Temporal tracking of people is a desired trait. Tracking in our algorithm is done

using a score based mechanism wherein a person who stays in a frame has a fixed score or

incremented score. People exiting the frame gradually get a reduced score and are slowly

phased out of the screen.

4. Collectedcolor Statistics can be used for further filtering and tracking: A person is uniquely

identified by what he wears. Each person would have a unique mean chrominance value based

on what he/she is wearing. This value is used for distinguishing people from one another in the

frame.

APPROACH

We have worked on two methods to perform people tracking. Method I uses information

available in the entire video sequence. The framework and algorithm is discussed in the next

few sections. Method II is used on a sub-sampled video sequence since we had the aim of

improving the tracking response. The initial stages for both are similar and differ only in the

methodology of finding the color statistics. Method I uses a simplistic approach of finding the

color statistics once a person is encompassed within a box. Method II has a more sophisticated

approach and works outside the box, i.e. it searches for similar chromatic content outside the

box encompassing a person in a lateral fashion within consecutive frames.


3/16

3

DESCRIPTION OF THE ALGORITHM

Shown below is a flowchart of Method I with its corresponding stages. The algorithm is

explained in detail in a stage based approach in the following sections.

Fig 1. Flowchart for Method I.


4/16

4

Stage 1: Adaptive Background Subtraction

Adaptive background subtraction is done to isolate the foreground from the background . Since

were tracking people in this project, we consider them to be a part of the foreground. The

means and variances of each pixel is recursively estimated using the equations from Mckenna

et al.[2,p 4 ]

t+1= t + (1-) zt+1 (1)

2t+1=(2t+(t+1-t+1)2)+ (1-)(zt+1-t+1)2 (2)

where t+1 refers to the weighted mean of the pixel at time t+1 , t refers to the mean value of

the pixel in the previous frame andzt+1 gives the current value of the pixel. Equation 2 gives a

similar version of the recursive update for the variance. The coefficient ofis chosen to be 0.8

for our project. The background subtraction in our case is done based on the values of

luminance i.e. in the Y channel. If the difference between the current value of the pixel and its

temporal mean value i.e. Yt+1 - Yt+1 is greater than three times Yt+1 , then the pixel is

considered to be a part of the foreground. Pixels that do not show changes larger than three

standard deviations are considered background pixels [2,p 4].The adaptive backgroundsubtraction also includes edge detection based on a sobel mask. The edge detection gives us the

silhouette of a person. A similar method of updating first and second order statistics is

recursively found out to determine moving edges within the sequence of frames. Sobel edge

detection was used since we thought it would be the fastest kind of edge detection. The

background subtraction in [2] is done using RGB channels. In our method, we had YCbCr as

the luminance and chrominance components. We used only the Y channel for background

subtraction because we found that the use of chroma channels was computationally intensive

while not providing an information gain that would justify their requirements. The results that

we achieved were quite good with just the Y channel.

Stage 2: Finding the Region of Interest

Our background subtraction routine is not expected to produce a complete silhouette of a

person. Instead it must combine segments of objects in the foreground, such as a head and


5/16

5

lower torso, to infer the presence of a person. McKenna et al. [2, p. 47] use connected

components to aggregate foreground pixels in the image. According to Jain et al. [3, p. 44],

connected component algorithms usually form a bottleneck in a binary vision system.

Therefore in order to maintain low processing overhead we chose to implement a simpler

method using histograms.

First, the foreground pixels are projected onto the x-axis to create a simple histogram of

horizontal motion. Small gaps in the foreground silhouette of a person are common; therefore

the histogram is smoothed using a window of length 8. Scanning from left to right, the system

looks for separable hills in the histogram to distinguish as different people. This yields a

bound for the left and right sides of each region of interest. Within the horizontal ROI, the

system creates a vertical histogram of the foreground pixels. In a similar fashion, this

histogram allows us to compute a lower and upper bound on the ROI.

The mass of an object, or the total area of foreground pixels in a ROI, serves as the chief

discriminant between people and other moving objects. In other words, if the mass of an

object is too small the ROI is discarded. The remaining objects are considered candidates for

moving people. Many authors choose to recognize skin pigment or skeletal forms of the

human body for further distinction between people and other objects, however in our

environment the only moving objects are people.

Stage 3: Matching People based on Nearest Neighbors

After we have a set of candidate target regions (boxes) which we would like to consider as

people, we can use the color information from the chroma channels (cb and cr) to help

uniquely identify each person, so that we can track them more accurately. In order to do this,

we compute two 16-tuple vectors, containing the averaged cb and cr values for 16 smaller

boxes within the region of interest for each target box.

Our algorithm requires that we initialize a similar set of vectors for each person we have

correctly identified in the previous stages. To ensure that we do this reliably, we begin

computing color-statistics only after at least 20 frames have elapsed. Such a time-frame allows

the back-ground subtraction results to settle down to more accurately detect human forms in

the image sequences. We store the location of each person in a structure which associates the

locations as well as the color vectors for the ROI for that person we would like to track.

Given a set of people we are interested in tracking, and a set of candidate targets, we can then


6/16

6

perform a nearest neighbors classification by which we attempt to match each candidate target

to the nearest tracked person in the color-statistics sense. We do this by using an absolute

difference distance measure between the color-statistics of each candidate targets color

statistics and those of each person we have been tracking.

For each candidate box, we compute its distance to the closest tracked boxes using the

following distance measure.

If the minimum distance dcb and dcrbetween the box and all the other tracked boxes is

smaller than a minimum threshold distance, we assign this box the same id as that of the

tracked person which it matched. We use this threshold mechanism to ensure that we are

reasonably confident that this is almost definitely the person we have been tracking upto that

point of time.

A further improvement that we have performed attempts to reduce the variance in the colors

statistics for the persons we are trying to track. Here we try to look more at the center of the

person rather than the overall region.

Stage 4: Updating the Feature Vectors of Known Individuals

To facilitate the arrival of new people in the frame, and to forget people who leave for more

than several seconds, the system must learn to update the feature vectors of known people.

When a new person arrives he or she is given an initial score. For every successive frame

containing this person, his or her score is incremented until the maximum score is reached. For

every frame that does not contain this person, his or her score is decreased. When a score

reaches zero the feature vector for that person is removed from memory, and that person is

forgotten. Similarly, when a person arrives who does not match the known feature vectors he

or she is added to memory. In this way new people are recognized by the system, and absent

people are forgotten.

Due to the non-uniform distribution of light across the room, our system must learn to adapt

when the chromatic values of a known person slightly change. Assuming the chromatic values


7/16

7

of a person are still recognizable, a weighted average is taken with the latest chromatic feature.

This operation weights the old chromatic features more heavily than the new values. In our

design of the system we kept this weight parameter on a GEL slider, usually at 0.1 or 0.2

depending on the lighting conditions.

People having scores above a particular threshold are tracked and have boxes drawn around

them with a corresponding color for a person.

RESULTS FORMETHOD I.

The results we obtained for method 1 are shown below. There were two main demos that were

displayed. Demo 0 shows the output for the adaptive background subtraction so that one gets

an idea of what the output looks like after this stage. Demo 1 shows the actual output that we

obtained from method 1. If there are at most two people in a frame, the algorithm is able to

track people quite well. In our final demo, when two people entered the frame, there were

boxes that were drawn around them .The boxes had a unique color assigned to them via which

we identified a particular person. When people crossed each others paths, the algorithm was

still able to track people well and maintained the same box color assigned to the person. Our

results were not good for three people since the foreground and background seem to be almost

similar and thereby prevented reliable tracking as compared to that for the case when there

were two people. The frame rate is very slow in this algorithm (1 fps).


8/16

8

Fig 2. Output of Demo 0.

Fig 3, Output of Demo 1.

METHOD II:AFORWARD-BACKWARD APPROACH

A fusion of color statistics and adaptive background subtraction

MOTIVATION

Our previous method relied heavily upon adaptive background subtraction and that

people should move within each frame to be detected. This while being a practical method

for detecting people, tends to fail when people stop moving. Often people enter the frame

and stop moving, thus resulting in a gradual fading out of the outline of each person

being tracked, until we lose that person. This is obviously a problem, and we would like to

use a way to work around having to rely solely upon motion detection to locate people.

Using the colors from peoples clothing to track them is certainly not a very new idea, as

we did use it to track people from frame to frame even in our preceding tracking method.

The interesting change which we felt would help us track people involved computing

color-statistics within the region which we believe contains a person (based on change-

detection as we have done earlier), and then to use this information for each tracked

person to look around their current locations to determine their exact locations in the

frame. This can be done by performing a coarse scan around the last known locations of

each tracked person in the frame, and then finding the location that produces the

minimum distance between the color-statistics !om the scan and those for that person.

This will naturally speed up the process of searching for the person and produce a much

needed increase in our output frame-rate.


9/16

9

Another feature that we might want to add to our algorithm is the ability to update the

color statistics for each tracked person. As a person moves from a dimly lit area to one

which is brighter, the color spectrum that is reflected from him/her results in a change in

both the chrominance and luminance values. This results in a variation in the underlying

color statistics that we have gathered for that person initially. Merging the changes

gradually into that model will help us adjust to varying intensities more robustly.

This brings us to the question - When should we update the color statistics for a person?

We think a good solution would be to collect statistics from the frame by looking in a

region which should ideally show smaller variance both spatially and temporally, as well as

one which is easy to compare with statistics we might gather in the future. The easiest way

to know we have found a person is to look for a large moving blob. We have already

discussed how to develop such a detector and we can employ this to locate a person who is

moving. By looking in the center of this region and averaging the values over a decent area,

we can form a feature vector with color statistics from the two chroma channels -cb and

cr. However, we would certainly want to avoid updating these statistics initially because we

would like our Background subtraction algorithm to settle down to a stable state, at which

point of time only people (modulo camera measurement/acquisition noise) cause changes

in our frames.

Is there another situation which we might want to be weary of? We believe the answer is

yes! Although a large blob is a good indicator of a well localized person in the image

(assuming it really is a person), things could get complicated if there are two people who

are moving in the view and cross each other. This results in occlusion, and if we attempt to

update color-statistics at this point of time, we will be introducing errors in the values for

both people, which is something we certainly want to avoid.

Background subtraction is susceptible to trail-effects when a person moves rapidly

between subsequent frames. This results in poorly localized boxes, producing erroneouscolor-averages because we might pick up cb and crvalues which are part of the background

through which the person just passed. A solution to this would involve modeling the speed

of the persons movement, which would allow us to know when the movement might be

producing a trail-effect, during which color-updates should be avoided. At the time of

writing this report we are not going to attempt solving this problem.


10/16

10

DESCRIPTION OF THE ALGORITHM

Figure 4 contains aflowchart of the algorithm which we employ in this second method of

tracking people. The motion detection stage based on adaptive background subtraction

has been discussed in the previous algorithm. We use the next stage from the previous

algorithm which allows us to locate regions in the image which contain movements to

detect people who are moving.

These locations are used to update color statistics for each person as well as to update the

locations of each tracked person during the fusion stage.

Fig 4 : Flowchart for the Fused people tracker.

Since many of the intial stages from the new algorithm are essentially the same as those

used earlier, we will only cover those sections which are functionally different here.


11/16

11

Stage 3: Color Updates

In this stage, we compute the color statistics for each location that has been detected by

the motion-detection stage. We compare these statistics with those of each tracked

person whose statistics we have already computed. If we find that these are very similar

and that the location of this person is very close to the previous known location of the

tracked person, we merge the statistics with those of that person. A special case that we

take care of is to ensure that we do not merge statistics for boxes that are very far away

from each other in the image to take care that two people wearing similar clothing are

tracked individually as the case should be. If no match has been found between the people

we have been tracking and the present one, we will want to add this person to the list of

people we have been tracking and store the persons color statistics and location in the

structure.

Stage 4 : Finding the location of people based on their color statistics

We scan around and at the last known location of each tracked person, and compute a

SAD (sum of absolute differences) score between the averaged values of each scanned

location and those from the tracked person. This gives us a set of scores, and we assume

that the location with the lowest score is the most likely to be the location of the person

based on a similar color search in the image. Another advantage of this approach is it

constrains the search space to a smaller area which results in faster algorithm execution. If

not, we would have to scan a much larger area of the image, which would increase the

computational requirements. To ensure that we are not dealing with occlusions we make

sure that we do not merge color-statistics of the present tracked person with its previous

statistics if the locations of another tracked person with very similar color-statistics is less

than a certain threshold. Occluded color statistics are not very helpful in estimating the

location of the person in the frame reliably.

Stage 5 : Fusion of color statistics and Background subtraction

The tracked people are maintained in an array that contains the color statistics vectors

associated with that person, an id for the person, a score and aflag that is used to indicate

if the person is being tracked. We employ a scoring mechanism such as that described

earlier to allow new people to be added to those already being tracked as well as to forget

people who have left the cameras view. The algorithm implemented in the code which we

are submitting at the time of writing this report is assumed to implicitly fuse the


12/16

12

information from the outputs of our motion detection and color statistics matcher

modules through the updation of the location of tracked people. We would like to use a

more intelligent method of performing this fusion in the future.

\

RESULTS FORMETHOD IIThe results we obtained in method II , are as shown below. The algorithm per se performed

better in terms of speed and accuracy. The sub-sampling of the frames sped up the entire

process quite significantly. The tracking is more consistent, but has a jittery output since we

perform a coarse sampling around the tracked location. However, we have very few cases of

sporadic boxes appearing on the screen that belong to a person who is being tracked. In

general, this method seemed more robust and consistent than method I. This algorithm is very

sensitive to occlusions, which we have not handled well. When people pass each other, theircolor statistics may get updated with values belonging to the other person and this prevents the

algorithm from subsequently tracking at least one of these people accurately. Also, noisy

frames sometimes cause the background itself to be tracked. The biggest advantage of this

algorithm is that it can track people who are stationary after they have been captured by the

tracker.


13/16

13

Figure 5: Display for Method II

OPTIMIZATIONS

THE SOBEL-EDGE DETECTION ROUTINE

The Sobel edge transform is a pixel-wise operation that consumes many cycles. Memory

latencies and matrix operations are two areas we optimized. The benefits of restructuring the

Sobel routine were noticeable immediately, despite the nondeterministic behavior of memory

access.

In external memory, images are stored one row after another. Therefore fetching pixels in row

order will reduce access time. To improve efficiency, the outer loop iterates across each row

while the inner loop iterates across each column.

The Sobel transform operates on each pixel and its eight adjacent pixels. Instead of fetching

nine array elements on every iteration we want to store the reusable pixels on the stack and

only fetch three new pixels. The following examples shows how we restructured the routine

and unrolled the Sobel transform.


14/16

14

Original Code Better Memory Access and Loop

Unrolling

for y, 0 to Image Height

for x, 0 to Image Width

/* Sobel Transform */

edge1 = 0

edge2 = 0

for i, in {0, 1, 2}

for j, in {0, 1, 2}

pixel = Image [x+i][y+j]

edge1 += pixel * sobel1[i][j]

edge2 += pixel * sobel2[i][j]

end

end

/* More processing */

endend

for y, 0 to Image Height

/* Store first nine pixels in local variables

*/

a11 = Image[0][y]

a12 = Image[1][y]

a13 = Image[2][y]

...

a33 = Image[2][y+2]

for x, 0 to Image Width

/* Sobel Transform */

edge1 = a11 + 2*a12 + a13 - ... - a33

edge2 = a11 + 2*a21 + a31 - ... - a33

/* Shift pixel values, fetch new pixel

*/

a11 = a12a12 = a13

a13 = Image[x+3][y]

...

a33 = Image[x+3][y+2]

/* More processing */

end

end

Table 1: This table demonstrates the sort of optimizations made to improve memory access and operational cost of the Sobel-Edge Transform

SUBSAMPLING OF THE INPUT VIDEO SEQUENCE

The overall efficiency and speed of the output increased greatly when we used a sub-sampled

version of the input video sequence in the Y channel. Without sub-sampling we were able to

obtain a frame rate of just about 1frame per second. With sub-sampling, our frame rate went up

to almost 4Hz , thereby giving us a much better display feedback. Down-Sampling about half

the number of pixels required for the Y Channel gave us better results in method II in terms of

speed than in method I.

FRAME REJECTION

There was a problem of stray boxes showing up during the display of the demo. This happened

because every 4 frames there was a shift in the frame sequence that used to occur. This shift in

the frame sequence created an illusion to the algorithm that somethings moving in the frame

sequence. We tried to reject the bad frame by finding out the difference between the values


15/16

15

edges in each frame sequence. Using a Sum of Absolute Differences measure and thresholding,

we rejected frames that had a SAD value of more than 6500 for a given set of pixels.

Unfortunately the method didnt really work well and sometimes even good frames were

rejected.

CONCLUSIONS

From results I and II , we can conclude that each method has its own set of pros and cons.

Although the output from method I produces a few stray boxes due to noisy frames, it localizes

a person who is moving very accurately, and can track people even when they pass each other

in the frame. We feel this algorithm would give us very promising results if we could get it to

run faster than the appalling 1fps, we get from the system at this time. Execution of such an

algorithm at higher speeds will allow us to constrain the location of tracked boxes (as a person

cannot be expected to move very far from his previous location if the next frame was captured

a short interval later). This would enable us to filter out many stray boxes, while also

producing a more useful and robust tracking output.

If we had to implement this in practical setting on a more constrained platform such as one

based on the TMS320C6416 DSK, method II would seem to show better performance

especially in a video surveillance and other security applications where the speed and accuracy

of the tracking is quite important. The idea of searching laterally outside a box assigned to a

particular person in method II worked well and gave the added stability required for a tracking

algorithm. Coarse scanning to localize each tracked person also helps speed up algorithm

execution. These improvements in speed help us apply temporal filtering to constrain the

movements of each person, which in turn helps us improve our tracking consistency.

To be practically feasible however, this algorithm needs to be strengthened with the capability

to handle occlusions. We have not been able to implement a reliable method of handling these

problems due to occlusion at this time, and would like to develop a strategy that overcomes this

crippling problem.

To sum up, the first algorithm is accurate and quite good at tracking people even when they

cross over. However, it is susceptible to noisy camera images, has a very slow frame rate and

can not track people who have stopped moving. The second algorithm overcomes these


16/16

16

disadvantages from the first algorithm, but fails to track people who occlude each other. In the

future we would like to address this issue, which will help us realize a more practical people

tracking system which can be made to work on relatively constrained real-time systems.

References

[1] Jain, R., Kasturi, R., Schunck, B. G., Machine Vision, McGraw-Hill Inc., 1995.

[2] McKenna, S. J., Jabri, S., Duric, Z., Wechsler, H., Rosenfeld, A., Tracking Groups of

People, Computer Vision and Image Understanding: CVIU, Vol. 80, No. 1, p. 42-56, 2000.

[3] Moeslund, T. B., Granum, E., A Survey of Computer Vision-Based Human Motion

Capture, Computer Vision and Image Understanding: CVIU, Vol. 81, No. 3, p. 231-268, 2001.

[4] Bouchrika, I., Nixon,M.S., People Detection and Recognition using Gait for Automated

Visual Surveillance , Crime and Security, 2006. The Institution of Engineering and

Technology Conference (ICDP -2006)

[5] Negre, A. Tran, H. Gourier, N. Hall, D. Lux, A. Crowley, J. L., Comparative Study of

People Detection in Surveillance Scenes, LECTURE NOTES IN COMPUTER SCIENCE, 2006,

NUMB 4109, pages 100-108

[6]Gavrila, D., Pedestrian detection from a moving vehicle. ECCV, 2000,vol. II, pp. 3749.
http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=4123728http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=4123728http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=4123728http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=4123728

People Tracker Report

Documents

Transcript of People Tracker Report