People Tracker Report

download People Tracker Report

of 16

Transcript of People Tracker Report

  • 7/28/2019 People Tracker Report

    1/16

    1

    TRACKING PEOPLE FROM ASTATIONARY CAMERA

    Marc Kelly Robins, Arvind Antonio de Menezes Pereira and Abhay V NadkarniDepartment of Electrical Engineering, USC

    ABSTRACT

    An algorithm for tracking people based on the color statistics of their attire is described. People are identifiedas moving objects using a combination of background subtraction and edge detection. Encompassing of

    people within tracking boxes is based on histograms of large blob masses. The tracking aspect isimplemented using the unique color statistics of a particular individual. Two major methods of using colorstatistics are discussed. The system showed excellent results with two people in a frame for method I. MethodII showed better results in terms of speed, robustness and consistency.

    INTRODUCTION

    Applications in tracking people via real-time systems are prevalent in modern society, ranging

    anywhere from surveillance to event recognition. Following the research of McKenna et al.

    [2], we developed an algorithm for real-time people tracking using the C6416 DSK by Texas

    Instruments. First, adaptive background subtraction is performed using both changes in

    intensity and edge detection. Then we aggregate neighboring foreground pixels to constitute

    moving objects. Finally, individual people are identified by their unique chromatic features

    and followed from one frame to the next. During the design of the system certain assumptions

    were made about the environment. Since the algorithm begins with foreground segmentation,

    the camera must remain stationary. Due to the low frame rate, motion must be moderately

    slow and continuous. Furthermore, the background should stay in effect placid. Implementing

    this system on the C6416 involved certain limitations. For example, the C6416 uses fixed

    point processing, which requires careful attention to numerical precision in parts of the

    algorithm. Also, the video equipment available for this project incurred random frames of

    impulsive noise, which we found insurmountable. The following report presents our

    algorithm, code optimizations, and final results.

    GUIDING PRINCIPLES USED IN ALGORITHM DESIGN

    1. People moving in the video sequence: In a video sequence, our algorithm identifies people

    by motion detection as opposed to feature detection (Corner detection, SIFT) [4,5] or template

    matching [6]. These methods are quite computationally intensive when compared to our

    simplistic method. In our method, we detect a moving person by noting changes in the mean

  • 7/28/2019 People Tracker Report

    2/16

    2

    values of pixels corresponding to the entire frame. Once the values cross a particular threshold,

    we consider that a person has moved.

    2. People occupy a significant portion of the frame: Size is used as one criterion to reject non-

    human motion and ambient noise. The algorithm should reject any small movement as noise.

    Only substantial motion is considered as a person moving. For example, leaves moving in a

    video sequence are not considered as motion.

    3. People may stop moving for a while: Our algorithm should not rely completely on motion

    detection. Temporal tracking of people is a desired trait. Tracking in our algorithm is done

    using a score based mechanism wherein a person who stays in a frame has a fixed score or

    incremented score. People exiting the frame gradually get a reduced score and are slowly

    phased out of the screen.

    4. Collectedcolor Statistics can be used for further filtering and tracking: A person is uniquely

    identified by what he wears. Each person would have a unique mean chrominance value based

    on what he/she is wearing. This value is used for distinguishing people from one another in the

    frame.

    APPROACH

    We have worked on two methods to perform people tracking. Method I uses information

    available in the entire video sequence. The framework and algorithm is discussed in the next

    few sections. Method II is used on a sub-sampled video sequence since we had the aim of

    improving the tracking response. The initial stages for both are similar and differ only in the

    methodology of finding the color statistics. Method I uses a simplistic approach of finding the

    color statistics once a person is encompassed within a box. Method II has a more sophisticated

    approach and works outside the box, i.e. it searches for similar chromatic content outside the

    box encompassing a person in a lateral fashion within consecutive frames.

  • 7/28/2019 People Tracker Report

    3/16

    3

    DESCRIPTION OF THE ALGORITHM

    Shown below is a flowchart of Method I with its corresponding stages. The algorithm is

    explained in detail in a stage based approach in the following sections.

    Fig 1. Flowchart for Method I.

  • 7/28/2019 People Tracker Report

    4/16

    4

    Stage 1: Adaptive Background Subtraction

    Adaptive background subtraction is done to isolate the foreground from the background . Since

    were tracking people in this project, we consider them to be a part of the foreground. The

    means and variances of each pixel is recursively estimated using the equations from Mckenna

    et al.[2,p 4 ]

    t+1= t + (1-) zt+1 (1)

    2t+1=(2t+(t+1-t+1)2)+ (1-)(zt+1-t+1)2 (2)

    where t+1 refers to the weighted mean of the pixel at time t+1 , t refers to the mean value of

    the pixel in the previous frame andzt+1 gives the current value of the pixel. Equation 2 gives a

    similar version of the recursive update for the variance. The coefficient ofis chosen to be 0.8

    for our project. The background subtraction in our case is done based on the values of

    luminance i.e. in the Y channel. If the difference between the current value of the pixel and its

    temporal mean value i.e. Yt+1 - Yt+1 is greater than three times Yt+1 , then the pixel is

    considered to be a part of the foreground. Pixels that do not show changes larger than three

    standard deviations are considered background pixels [2,p 4].The adaptive backgroundsubtraction also includes edge detection based on a sobel mask. The edge detection gives us the

    silhouette of a person. A similar method of updating first and second order statistics is

    recursively found out to determine moving edges within the sequence of frames. Sobel edge

    detection was used since we thought it would be the fastest kind of edge detection. The

    background subtraction in [2] is done using RGB channels. In our method, we had YCbCr as

    the luminance and chrominance components. We used only the Y channel for background

    subtraction because we found that the use of chroma channels was computationally intensive

    while not providing an information gain that would justify their requirements. The results that

    we achieved were quite good with just the Y channel.

    Stage 2: Finding the Region of Interest

    Our background subtraction routine is not expected to produce a complete silhouette of a

    person. Instead it must combine segments of objects in the foreground, such as a head and

  • 7/28/2019 People Tracker Report

    5/16

    5

    lower torso, to infer the presence of a person. McKenna et al. [2, p. 47] use connected

    components to aggregate foreground pixels in the image. According to Jain et al. [3, p. 44],

    connected component algorithms usually form a bottleneck in a binary vision system.

    Therefore in order to maintain low processing overhead we chose to implement a simpler

    method using histograms.

    First, the foreground pixels are projected onto the x-axis to create a simple histogram of

    horizontal motion. Small gaps in the foreground silhouette of a person are common; therefore

    the histogram is smoothed using a window of length 8. Scanning from left to right, the system

    looks for separable hills in the histogram to distinguish as different people. This yields a

    bound for the left and right sides of each region of interest. Within the horizontal ROI, the

    system creates a vertical histogram of the foreground pixels. In a similar fashion, this

    histogram allows us to compute a lower and upper bound on the ROI.

    The mass of an object, or the total area of foreground pixels in a ROI, serves as the chief

    discriminant between people and other moving objects. In other words, if the mass of an

    object is too small the ROI is discarded. The remaining objects are considered candidates for

    moving people. Many authors choose to recognize skin pigment or skeletal forms of the

    human body for further distinction between people and other objects, however in our

    environment the only moving objects are people.

    Stage 3: Matching People based on Nearest Neighbors

    After we have a set of candidate target regions (boxes) which we would like to consider as

    people, we can use the color information from the chroma channels (cb and cr) to help

    uniquely identify each person, so that we can track them more accurately. In order to do this,

    we compute two 16-tuple vectors, containing the averaged cb and cr values for 16 smaller

    boxes within the region of interest for each target box.

    Our algorithm requires that we initialize a similar set of vectors for each person we have

    correctly identified in the previous stages. To ensure that we do this reliably, we begin

    computing color-statistics only after at least 20 frames have elapsed. Such a time-frame allows

    the back-ground subtraction results to settle down to more accurately detect human forms in

    the image sequences. We store the location of each person in a structure which associates the

    locations as well as the color vectors for the ROI for that person we would like to track.

    Given a set of people we are interested in tracking, and a set of candidate targets, we can then

  • 7/28/2019 People Tracker Report

    6/16

    6

    perform a nearest neighbors classification by which we attempt to match each candidate target

    to the nearest tracked person in the color-statistics sense. We do this by using an absolute

    difference distance measure between the color-statistics of each candidate targets color

    statistics and those of each person we have been tracking.

    For each candidate box, we compute its distance to the closest tracked boxes using the

    following distance measure.

    If the minimum distance dcb and dcrbetween the box and all the other tracked boxes is

    smaller than a minimum threshold distance, we assign this box the same id as that of the

    tracked person which it matched. We use this threshold mechanism to ensure that we are

    reasonably confident that this is almost definitely the person we have been tracking upto that

    point of time.

    A further improvement that we have performed attempts to reduce the variance in the colors

    statistics for the persons we are trying to track. Here we try to look more at the center of the

    person rather than the overall region.

    Stage 4: Updating the Feature Vectors of Known Individuals

    To facilitate the arrival of new people in the frame, and to forget people who leave for more

    than several seconds, the system must learn to update the feature vectors of known people.

    When a new person arrives he or she is given an initial score. For every successive frame

    containing this person, his or her score is incremented until the maximum score is reached. For

    every frame that does not contain this person, his or her score is decreased. When a score

    reaches zero the feature vector for that person is removed from memory, and that person is

    forgotten. Similarly, when a person arrives who does not match the known feature vectors he

    or she is added to memory. In this way new people are recognized by the system, and absent

    people are forgotten.

    Due to the non-uniform distribution of light across the room, our system must learn to adapt

    when the chromatic values of a known person slightly change. Assuming the chromatic values

  • 7/28/2019 People Tracker Report

    7/16

    7

    of a person are still recognizable, a weighted average is taken with the latest chromatic feature.

    This operation weights the old chromatic features more heavily than the new values. In our

    design of the system we kept this weight parameter on a GEL slider, usually at 0.1 or 0.2

    depending on the lighting conditions.

    People having scores above a particular threshold are tracked and have boxes drawn around

    them with a corresponding color for a person.

    RESULTS FORMETHOD I.

    The results we obtained for method 1 are shown below. There were two main demos that were

    displayed. Demo 0 shows the output for the adaptive background subtraction so that one gets

    an idea of what the output looks like after this stage. Demo 1 shows the actual output that we

    obtained from method 1. If there are at most two people in a frame, the algorithm is able to

    track people quite well. In our final demo, when two people entered the frame, there were

    boxes that were drawn around them .The boxes had a unique color assigned to them via which

    we identified a particular person. When people crossed each others paths, the algorithm was

    still able to track people well and maintained the same box color assigned to the person. Our

    results were not good for three people since the foreground and background seem to be almost

    similar and thereby prevented reliable tracking as compared to that for the case when there

    were two people. The frame rate is very slow in this algorithm (1 fps).

  • 7/28/2019 People Tracker Report

    8/16

    8

    Fig 2. Output of Demo 0.

    Fig 3, Output of Demo 1.

    METHOD II:AFORWARD-BACKWARD APPROACH

    A fusion of color statistics and adaptive background subtraction

    MOTIVATION

    Our previous method relied heavily upon adaptive background subtraction and that

    people should move within each frame to be detected. This while being a practical method

    for detecting people, tends to fail when people stop moving. Often people enter the frame

    and stop moving, thus resulting in a gradual fading out of the outline of each person

    being tracked, until we lose that person. This is obviously a problem, and we would like to

    use a way to work around having to rely solely upon motion detection to locate people.

    Using the colors from peoples clothing to track them is certainly not a very new idea, as

    we did use it to track people from frame to frame even in our preceding tracking method.

    The interesting change which we felt would help us track people involved computing

    color-statistics within the region which we believe contains a person (based on change-

    detection as we have done earlier), and then to use this information for each tracked

    person to look around their current locations to determine their exact locations in the

    frame. This can be done by performing a coarse scan around the last known locations of

    each tracked person in the frame, and then finding the location that produces the

    minimum distance between the color-statistics !om the scan and those for that person.

    This will naturally speed up the process of searching for the person and produce a much

    needed increase in our output frame-rate.

  • 7/28/2019 People Tracker Report

    9/16

    9

    Another feature that we might want to add to our algorithm is the ability to update the

    color statistics for each tracked person. As a person moves from a dimly lit area to one

    which is brighter, the color spectrum that is reflected from him/her results in a change in

    both the chrominance and luminance values. This results in a variation in the underlying

    color statistics that we have gathered for that person initially. Merging the changes

    gradually into that model will help us adjust to varying intensities more robustly.

    This brings us to the question - When should we update the color statistics for a person?

    We think a good solution would be to collect statistics from the frame by looking in a

    region which should ideally show smaller variance both spatially and temporally, as well as

    one which is easy to compare with statistics we might gather in the future. The easiest way

    to know we have found a person is to look for a large moving blob. We have already

    discussed how to develop such a detector and we can employ this to locate a person who is

    moving. By looking in the center of this region and averaging the values over a decent area,

    we can form a feature vector with color statistics from the two chroma channels -cb and

    cr. However, we would certainly want to avoid updating these statistics initially because we

    would like our Background subtraction algorithm to settle down to a stable state, at which

    point of time only people (modulo camera measurement/acquisition noise) cause changes

    in our frames.

    Is there another situation which we might want to be weary of? We believe the answer is

    yes! Although a large blob is a good indicator of a well localized person in the image

    (assuming it really is a person), things could get complicated if there are two people who

    are moving in the view and cross each other. This results in occlusion, and if we attempt to

    update color-statistics at this point of time, we will be introducing errors in the values for

    both people, which is something we certainly want to avoid.

    Background subtraction is susceptible to trail-effects when a person moves rapidly

    between subsequent frames. This results in poorly localized boxes, producing erroneouscolor-averages because we might pick up cb and crvalues which are part of the background

    through which the person just passed. A solution to this would involve modeling the speed

    of the persons movement, which would allow us to know when the movement might be

    producing a trail-effect, during which color-updates should be avoided. At the time of

    writing this report we are not going to attempt solving this problem.

  • 7/28/2019 People Tracker Report

    10/16

    10

    DESCRIPTION OF THE ALGORITHM

    Figure 4 contains aflowchart of the algorithm which we employ in this second method of

    tracking people. The motion detection stage based on adaptive background subtraction

    has been discussed in the previous algorithm. We use the next stage from the previous

    algorithm which allows us to locate regions in the image which contain movements to

    detect people who are moving.

    These locations are used to update color statistics for each person as well as to update the

    locations of each tracked person during the fusion stage.

    Fig 4 : Flowchart for the Fused people tracker.

    Since many of the intial stages from the new algorithm are essentially the same as those

    used earlier, we will only cover those sections which are functionally different here.

  • 7/28/2019 People Tracker Report

    11/16

    11

    Stage 3: Color Updates

    In this stage, we compute the color statistics for each location that has been detected by

    the motion-detection stage. We compare these statistics with those of each tracked

    person whose statistics we have already computed. If we find that these are very similar

    and that the location of this person is very close to the previous known location of the

    tracked person, we merge the statistics with those of that person. A special case that we

    take care of is to ensure that we do not merge statistics for boxes that are very far away

    from each other in the image to take care that two people wearing similar clothing are

    tracked individually as the case should be. If no match has been found between the people

    we have been tracking and the present one, we will want to add this person to the list of

    people we have been tracking and store the persons color statistics and location in the

    structure.

    Stage 4 : Finding the location of people based on their color statistics

    We scan around and at the last known location of each tracked person, and compute a

    SAD (sum of absolute differences) score between the averaged values of each scanned

    location and those from the tracked person. This gives us a set of scores, and we assume

    that the location with the lowest score is the most likely to be the location of the person

    based on a similar color search in the image. Another advantage of this approach is it

    constrains the search space to a smaller area which results in faster algorithm execution. If

    not, we would have to scan a much larger area of the image, which would increase the

    computational requirements. To ensure that we are not dealing with occlusions we make

    sure that we do not merge color-statistics of the present tracked person with its previous

    statistics if the locations of another tracked person with very similar color-statistics is less

    than a certain threshold. Occluded color statistics are not very helpful in estimating the

    location of the person in the frame reliably.

    Stage 5 : Fusion of color statistics and Background subtraction

    The tracked people are maintained in an array that contains the color statistics vectors

    associated with that person, an id for the person, a score and aflag that is used to indicate

    if the person is being tracked. We employ a scoring mechanism such as that described

    earlier to allow new people to be added to those already being tracked as well as to forget

    people who have left the cameras view. The algorithm implemented in the code which we

    are submitting at the time of writing this report is assumed to implicitly fuse the

  • 7/28/2019 People Tracker Report

    12/16

    12

    information from the outputs of our motion detection and color statistics matcher

    modules through the updation of the location of tracked people. We would like to use a

    more intelligent method of performing this fusion in the future.

    \

    RESULTS FORMETHOD IIThe results we obtained in method II , are as shown below. The algorithm per se performed

    better in terms of speed and accuracy. The sub-sampling of the frames sped up the entire

    process quite significantly. The tracking is more consistent, but has a jittery output since we

    perform a coarse sampling around the tracked location. However, we have very few cases of

    sporadic boxes appearing on the screen that belong to a person who is being tracked. In

    general, this method seemed more robust and consistent than method I. This algorithm is very

    sensitive to occlusions, which we have not handled well. When people pass each other, theircolor statistics may get updated with values belonging to the other person and this prevents the

    algorithm from subsequently tracking at least one of these people accurately. Also, noisy

    frames sometimes cause the background itself to be tracked. The biggest advantage of this

    algorithm is that it can track people who are stationary after they have been captured by the

    tracker.

  • 7/28/2019 People Tracker Report

    13/16

    13

    Figure 5: Display for Method II

    OPTIMIZATIONS

    THE SOBEL-EDGE DETECTION ROUTINE

    The Sobel edge transform is a pixel-wise operation that consumes many cycles. Memory

    latencies and matrix operations are two areas we optimized. The benefits of restructuring the

    Sobel routine were noticeable immediately, despite the nondeterministic behavior of memory

    access.

    In external memory, images are stored one row after another. Therefore fetching pixels in row

    order will reduce access time. To improve efficiency, the outer loop iterates across each row

    while the inner loop iterates across each column.

    The Sobel transform operates on each pixel and its eight adjacent pixels. Instead of fetching

    nine array elements on every iteration we want to store the reusable pixels on the stack and

    only fetch three new pixels. The following examples shows how we restructured the routine

    and unrolled the Sobel transform.

  • 7/28/2019 People Tracker Report

    14/16

    14

    Original Code Better Memory Access and Loop

    Unrolling

    for y, 0 to Image Height

    for x, 0 to Image Width

    /* Sobel Transform */

    edge1 = 0

    edge2 = 0

    for i, in {0, 1, 2}

    for j, in {0, 1, 2}

    pixel = Image [x+i][y+j]

    edge1 += pixel * sobel1[i][j]

    edge2 += pixel * sobel2[i][j]

    end

    end

    /* More processing */

    endend

    for y, 0 to Image Height

    /* Store first nine pixels in local variables

    */

    a11 = Image[0][y]

    a12 = Image[1][y]

    a13 = Image[2][y]

    ...

    a33 = Image[2][y+2]

    for x, 0 to Image Width

    /* Sobel Transform */

    edge1 = a11 + 2*a12 + a13 - ... - a33

    edge2 = a11 + 2*a21 + a31 - ... - a33

    /* Shift pixel values, fetch new pixel

    */

    a11 = a12a12 = a13

    a13 = Image[x+3][y]

    ...

    a33 = Image[x+3][y+2]

    /* More processing */

    end

    end

    Table 1: This table demonstrates the sort of optimizations made to improve memory access and operational cost of the Sobel-Edge Transform

    SUBSAMPLING OF THE INPUT VIDEO SEQUENCE

    The overall efficiency and speed of the output increased greatly when we used a sub-sampled

    version of the input video sequence in the Y channel. Without sub-sampling we were able to

    obtain a frame rate of just about 1frame per second. With sub-sampling, our frame rate went up

    to almost 4Hz , thereby giving us a much better display feedback. Down-Sampling about half

    the number of pixels required for the Y Channel gave us better results in method II in terms of

    speed than in method I.

    FRAME REJECTION

    There was a problem of stray boxes showing up during the display of the demo. This happened

    because every 4 frames there was a shift in the frame sequence that used to occur. This shift in

    the frame sequence created an illusion to the algorithm that somethings moving in the frame

    sequence. We tried to reject the bad frame by finding out the difference between the values

  • 7/28/2019 People Tracker Report

    15/16

    15

    edges in each frame sequence. Using a Sum of Absolute Differences measure and thresholding,

    we rejected frames that had a SAD value of more than 6500 for a given set of pixels.

    Unfortunately the method didnt really work well and sometimes even good frames were

    rejected.

    CONCLUSIONS

    From results I and II , we can conclude that each method has its own set of pros and cons.

    Although the output from method I produces a few stray boxes due to noisy frames, it localizes

    a person who is moving very accurately, and can track people even when they pass each other

    in the frame. We feel this algorithm would give us very promising results if we could get it to

    run faster than the appalling 1fps, we get from the system at this time. Execution of such an

    algorithm at higher speeds will allow us to constrain the location of tracked boxes (as a person

    cannot be expected to move very far from his previous location if the next frame was captured

    a short interval later). This would enable us to filter out many stray boxes, while also

    producing a more useful and robust tracking output.

    If we had to implement this in practical setting on a more constrained platform such as one

    based on the TMS320C6416 DSK, method II would seem to show better performance

    especially in a video surveillance and other security applications where the speed and accuracy

    of the tracking is quite important. The idea of searching laterally outside a box assigned to a

    particular person in method II worked well and gave the added stability required for a tracking

    algorithm. Coarse scanning to localize each tracked person also helps speed up algorithm

    execution. These improvements in speed help us apply temporal filtering to constrain the

    movements of each person, which in turn helps us improve our tracking consistency.

    To be practically feasible however, this algorithm needs to be strengthened with the capability

    to handle occlusions. We have not been able to implement a reliable method of handling these

    problems due to occlusion at this time, and would like to develop a strategy that overcomes this

    crippling problem.

    To sum up, the first algorithm is accurate and quite good at tracking people even when they

    cross over. However, it is susceptible to noisy camera images, has a very slow frame rate and

    can not track people who have stopped moving. The second algorithm overcomes these

  • 7/28/2019 People Tracker Report

    16/16

    16

    disadvantages from the first algorithm, but fails to track people who occlude each other. In the

    future we would like to address this issue, which will help us realize a more practical people

    tracking system which can be made to work on relatively constrained real-time systems.

    References

    [1] Jain, R., Kasturi, R., Schunck, B. G., Machine Vision, McGraw-Hill Inc., 1995.

    [2] McKenna, S. J., Jabri, S., Duric, Z., Wechsler, H., Rosenfeld, A., Tracking Groups of

    People, Computer Vision and Image Understanding: CVIU, Vol. 80, No. 1, p. 42-56, 2000.

    [3] Moeslund, T. B., Granum, E., A Survey of Computer Vision-Based Human Motion

    Capture, Computer Vision and Image Understanding: CVIU, Vol. 81, No. 3, p. 231-268, 2001.

    [4] Bouchrika, I., Nixon,M.S., People Detection and Recognition using Gait for Automated

    Visual Surveillance , Crime and Security, 2006. The Institution of Engineering and

    Technology Conference (ICDP -2006)

    [5] Negre, A. Tran, H. Gourier, N. Hall, D. Lux, A. Crowley, J. L., Comparative Study of

    People Detection in Surveillance Scenes, LECTURE NOTES IN COMPUTER SCIENCE, 2006,

    NUMB 4109, pages 100-108

    [6]Gavrila, D., Pedestrian detection from a moving vehicle. ECCV, 2000,vol. II, pp. 3749.

    http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=4123728http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=4123728http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=4123728http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=4123728