Object Tracking and Face Recognition in Video Streams - DiVA Portal

Object Tracking and FaceRecognition in Video Streams

Linus Nilsson

June 6, 2012Bachelor’s Thesis in Computing Science, 15 credits

Supervisor at CS-UmU: Niclas BorlinSupervisor at CodeMill: Martin Wuotila Isaksson

Examiner: Jonny Pettersson

Umea UniversityDepartment of Computing Science

SE-901 87 UMEASWEDEN

Abstract

The goal with this project was to improve an existing face recognition system for videostreams by using adaptive object tracking to track faces between frames. The knowledge ofwhat faces occur and do not occur in subsequent frames was used to filter false faces and tobetter identify real ones.

The recognition ability was tested by measuring how many faces were found and howmany of them were correctly identified in two short video files. The tests also looked at thenumber of false face detections. The results were compared to a reference implementationthat did not use object tracking.

Two identification modes were tested: the default and strict modes. In the default mode,whichever person is most similar to a given image patch is accepted as the answer. In strictmode, the similarity also has to be above a certain threshold.

The first video file had a fairly high image quality. It had only frontal faces, one at atime. The second video file had a slightly lower image quality. It had up to two faces at atime, in a larger variety of angles. The second video was therefore a more difficult case.

The results show that the number of detected faces increased by 6-21% in the two videofiles, for both identification modes, compared to the reference implementation.

In the meantime, the number of false detections remained low. In the first video file,there were fewer than 0.009 false detections per frame. In the second video file, there werefewer than 0.08 false detections per frame.

The number of faces that were correctly identified increased by 8-22% in the two videofiles in default mode. In the first video file, there was also a large improvement in strictmode, as it went from recognising 13% to 85% of all faces. In the second video file, however,neither implementation managed to identify anyone in strict mode.

The conclusion is that object tracking is a good tool for improving the accuracy offace recognition in video streams. Anyone implementing face recognition for video streamsshould consider using object tracking as a central component.

Contents

1 Introduction 1

1.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Detailed goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 The TLD algorithm 5

2.1 Object model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Implementation 13

3.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Wawo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 The reference implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.4 Integrating Faceclip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.5 Integrating object tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Experiments 19

4.1 Simple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 Complex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Results 21

5.1 Simple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.1.1 Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.1.2 Object tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.1.3 Faceclip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.1.4 Faceclip with object tracking . . . . . . . . . . . . . . . . . . . . . . . 23

5.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2 Complex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2.1 Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

iii

iv CONTENTS

5.2.2 Object tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2.3 Faceclip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.2.4 Faceclip with object tracking . . . . . . . . . . . . . . . . . . . . . . . 25

5.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6 Conclusions 27

6.1 Limitations and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

References 31

A Results 33

List of Figures

2.1 Forward-backward error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Short-term tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Detection fern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Detection forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.5 Affine transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.6 N-P constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.7 Flow of TLD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1 Frame processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Trajectory amendment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1 Frames from the Simple video . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2 Frames from the Complex video . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.1 Simple results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.2 Processed Simple frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.3 Complex results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.4 Processed Complex frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

v

vi LIST OF FIGURES

List of Tables

A.1 Results for the Simple experiment, using the default Wawo mode. . . . . . . 34

A.2 Results for the Simple experiment, using strict Wawo mode. . . . . . . . . . 35

A.3 Results for the Complex experiment, using the default Wawo mode. . . . . . 36

A.4 Results for the Complex experiment, using strict Wawo mode. . . . . . . . . 37

vii

viii LIST OF TABLES

Chapter 1

Introduction

Vidispine1 is a software system for storing and processing video content. One componentin Vidispine is the transcoder, which consists of a set of subcomponents and plugins thattogether can transform a video stream in various ways. It can for example convert a streambetween different formats, or analyse the stream to extract information. CodeMill hascreated a face recognition plugin that uses the library OpenCV2 for detecting faces in avideo stream and Wawo3 for identifying them. The purpose is to make it possible to extractand save information about which people occur in which frames so that the information canlater be looked up when needed. One usage scenario is with surveillance cameras, where itis often necessary to identify people.

The problem is that the face recognition sometimes fails, either by not being able to makean identification, or by making an invalid identification. This project is about improvingthe success rate by taking more context into account, which is done by tracking how thefaces move between frames.

1.1 Aim

The plugin currently used by Codemill does its job by looking at one video frame at a time,trying to detect and identify the faces that occur in that frame. The goal of this project isto integrate a library for tracking a moving object between frames, which can be used totrack a face and recognise when the face belongs to the same person, even if Wawo fails tosee that. If the object tracker sees that a particular face occurs in ten adjacent frames, butWawo only recognises the person’s face in eight of those frames, then chances are that it isstill the same person in the two frames where Wawo failed. This knowledge should be ableto be used to improve the results.

1.2 Detailed goals

Face recognition generally consists of two primary tasks: detecting where there is a face ina frame or picture, and identifying whose face it is (Torres, 2004). Either task can fail for anumber of reasons. For example, the system may only know how to recognise someone from

1http://www.vidispine.com/2http://opencv.willowgarage.com/3http://www.wawo.com/

1

2 Chapter 1. Introduction

a frontal picture, or it may only recognise a face in a particular “ideal”, or neutral, state.When the system fails, it can be by identifying the face as belonging to the wrong person,or it can be by failing to make an identification at all.

This project aims to improve the ability to recognise a person in a non-ideal state orangle. By tracking how a face moves, the system has the potential to recognise that it isthe same face when it moves between an ideal and a non-ideal state or angle, even if Wawono longer recognises it. The system can then make a qualified guess that the identity is thesame in the two (or more) states. If Wawo makes contradictory identifications for a singleface in different frames, the identity that occurs most often can reasonably be assumed tobe the right one.

The goal is to use a system called OpenTLD (Kalal, 2011b). It is an implementationof TLD, which is a set of algorithms that handle object tracking in video streams, whichbasically means that the system tracks where objects move between frames. TLD has thebonus of being adaptive, making it more flexible than the OpenCV face detector. Theoriginal OpenTLD is written in Matlab, but some C++ ports should be explored and usedinstead if possible.

The integration of OpenTLD with Vidispine will be general, allowing it in theory totrack any object. This will be accomplished by writing a general plugin for object tracking.The general object-tracking plugin will then used by the face recognition plugin to trackfaces.

An object can be tracked both forwards and backwards in a video stream. Tracking anobject forwards is obvious, and is the first thing that should be implemented. If the faceis only found in the middle of a scene, it may also be worth it tracking the face backwardsto see see how long it has been there. Therefore, support for tracking backwards should beexplored.

A library called Faceclip (Rondahl, 2011) will be evaluated to see whether it is betterthan the current system at detecting faces in a frame, and may take over that task. Likethe current code, Faceclip is based on OpenCV. It has, however, been adjusted to do abetter job, mainly by performing a larger number of tests. Before making the decision touse Faceclip, it should be evaluated; both the results and the performance should be takeninto account to some degree.

The current Wawo-based code will be used to do the actual face identification. A bonusfeature is to add support for detecting and reporting the direction of a face in a given frame.

The two transcoder plugins: object tracking and face detection/identification, will becombined in a higher level face-recognition transcoder plugin.

The current plugin will be used as the reference system when evaluating the recognitionresults and performance of the project.

In short, the goals for this project are:

G1 Create a transcoder plugin for object tracking, using OpenTLD, and integrate it withthe face recognition plugin. The first goal is to track objects forward.

G2 Support tracking objects backwards

G3 Implement and evaluate Faceclip as an alternative to the current face detection code.

G4 Support finding the direction of detected faces.

1.3. Related work 3

1.3 Related work

Kalal et al. (2010a) had positive results when using a modified version of their TLD systemfor face tracking. In this case, they used an existing face detector. Instead of building anobject detector, they modified the learning process of TLD to build a face validator, whichvalidates whether an image patch, given by the tracker or the face detector, is the soughtface.

Nielsen (2010) did research into face recognition, using continuity filtering to find falseface detections, and Active Appearance Models to identify faces. He found that the conti-nuity filtering reduced the number of false faces to be identified.

4 Chapter 1. Introduction

Chapter 2

The TLD algorithm

TLD (Tracking, Learning, Detection) is a set of algorithms that together try to achievegood long-term object tracking in video streams (Kalal et al., 2011).

TLD has three main processes that it uses to achieve its goal: short-term tracking fortracking small movements from one frame to the next, object detection for re-detecting lostobjects, and training to teach the object detector what the object looks like.

The goal of the short-term tracker is to follow an object as it moves in a trajectory inthe frame sequence. The tracker may lose track of the object if the object disappears fora few frames, becomes partly occluded, or moves a large distance between two successiveframes, such as when the scene changes.

The object detector can re-detect the location of the object, so that tracking can continuefrom the new, detected location. By doing this, an object can potentially be tracked throughany video sequence, regardless of the smoothness of its movements, and regardless of whetherthe object is occluded at any stage. This is called long-term tracking.

TLD can track arbitrary objects. Given an image patch (a part of the image defined bya bounding box) from the first frame, TLD learns the appearance of the object defined bythe image patch, and starts tracking it.

The following sections describe TLD in more detail, with regard to how it is implementedin OpenTLD. Parts of OpenTLD are described by Kalal et al. (2010b, 2011). Other partsare only described in the source code, which may be found at Kalal (2011a).

2.1 Object model

A central component of the TLD algorithms is the object model; a model of what thetracked object looks like. The object model is defined by a current set of positive andnegative examples. An image patch that is similar to the positive examples and dissimilarto the negative examples is considered to be similar to the model.

Similarity is defined using normalised cross-correlation (NCC) (Gonzales and Woods,2008, Ch. 12.2). The NCC is calculated between the image patch and each positive andnegative example, which gives a value between -1 and 1 for each example. An NCC of 1means that it is a perfect match, -1 means that it is an inverse match, and 0 means thatthere is no correlation found between the two images. To better match the usage scenario,the NCC values are moved into the interval [0, 1]: NCC′ = (NCC + 1)/2. The highestNCC’ maxP among the positive examples and the highest NCC’ maxN among the negativeexamples are used to calculate the similarity value as in equation 2.1.

5

6 Chapter 2. The TLD algorithm

distanceP = 1 − maxP ,

distanceN = 1 − maxN , and

similarity = distanceN/(distanceN + distanceP ). (2.1)

The equations say that the similarity value depends on the similarity distance to thepositive and the negative examples. If the distance to the most similar negative example islarger than the distance to the most similar positive example, the similarity value will behigh.

2.2 Tracking

Tracking is implemented using a median-flow tracker based on the Lucas-Kanade method(LK) (Kalal et al., 2011; Lucas and Kanade, 1981). A set of feature points pf,k are selectedfrom a rectangular grid in the currently tracked image patch in the current frame f . In thenext frame, the new position of each selected point is estimated using LK, which gives anew set of points pf+1,k.

To determine which of the points are reliable, the NCC value is calculated from a 2x2pixel area around each pair of points pf,k and pf+1,k. Furthermore, the forward-backwarderror (FBE) is measured (Kalal et al., 2010b). Each point pf+1,k is tracked in reverse fromframe f + 1 to frame f , using LK. This produces a point p′f,k, whose distance to pf,k iscomputed to get an error estimate; the further p′f,k is from pf,k, the larger the error. Ifthe median of the FBE for all points is too large, the result from the tracker is consideredinvalid (see Figure 2.1). Otherwise, all points with an NCC above the median NCC and anFBE below the median FBE are selected.

The motion m of the object is calculated from the median of the estimated point motions(see Figure 2.2):

mk = pf+1,k − pf,k, for all k, and

m = mediank

(mk). (2.2)

The new size of the object is calculated from a scale-change factor s:

di,j = distance(pf,i, pf,j),

d′i,j = distance(pf+1,i, pf+1,j), and

s = mediani6=j

(d′i,jdi,j

). (2.3)

The confidence of the tracker is calculated as the similarity between the object model andthe image patch defined by the new bounding box.

2.2. Tracking 7

Figure 2.1: The images depict two subsequent frames, showing two feature points in anobject. The points are tracked both forward and backward. One point is properly trackedback to its origin, while the other ends up somewhere else.

Figure 2.2: The images depict two subsequent frames, showing three feature points inan object. The tracker finds the three feature points in both frames. The estimatedmovement of each one is indicated by the arrows. The movement of the whole object isestimated as the median of all feature point movements. The scale-change is calculatedfrom the change in internal distance between points in the two frames.


Figure 2.3: Example of applying a fern with three object features (upper left rectangle)on an image patch of size 10x7 pixels (lower left rectangle). The image patch depicts agrey object on a white background. The bit value generated by each object feature is 1if the pixel marked “a” is brighter than the pixel marked “b”, otherwise 0. The fern bit-sequence for this image patch is 101. The corresponding leaf node contains the estimatedprobability 0.92, which means that of all the training data where this fern has generatedthe bit-sequence 101, 92% has been positive data.

2.3 Detection

Object (re-)detection in TLD is a three-stage process (Kalal et al., 2011). The first stageremoves image patches with too small a variance in pixel intensity, based on the varianceof the initial tracked patch. This is a relatively quick operation, and works as an efficientinitial filter. The second stage is based on a randomised forest of ferns (Breiman, 2001) thatattempts to find image patches in the current frame that are similar to the tracked object.Like the first stage, this is a relatively quick operation. In the third stage, the patches thatpassed the previous stages are compared to the object model, which is a slower but moreaccurate operation. The first stage is based on well-known statistical variance, while thethird stage was described in Section 2.1. Therefore, this section focuses on describing thesecond stage, which is also the most involved and novel operation of the three.

A fern is a simple tree with only leaf nodes and a set of associated object features. Afern with n object features will have 2n leaf nodes. An object feature is a boolean operationon the intensity difference between two pixel positions within an image patch. The pixelpair of each feature is initialised at startup to random relative positions within the imagepatch.

Each leaf node contains an estimated probability that an image patch depicts the trackedobject. The node has the value p/(p + n), where p and n are the number of positive andnegative image patches that have corresponded with that leaf node during training (seeSection 2.4).

Applying an object feature to an image patch produces one bit, either 1 or 0, dependingon the outcome of the intensity comparison. The output bits from all object features inthe fern together make up a feature vector, which is used to uniquely select a leaf node (seeFigure 2.3).

2.4. Learning 9

Figure 2.4: A forest has a number of independent trees. The image is compared with eachtree, which produces one probability measure per tree. If the mean of all probabilities isabove a certain threshold, the forest accepts the image patch as the object.

Detection is done using a sliding window of varying size, where each windows definesan image patch. At each position, the detector asks each fern for the probability that thegiven image patch is a match. As illustrated by Figure 2.4, after all ferns are processed, theimage patch is accepted by the forest if the mean probability is high enough (≥ 50%).

For each image patch that passes the first two stages, the similarity (Eq. 2.1) to theobject model is calculated. The patch with the highest similarity value is accepted as thedetection result.

2.4 Learning

The short-term tracker calculates a position based primarily on geometric data, and willusually produce a position close to the previous position. In constrast, the detector calculatesprobabilities based on image data only, as determined by the detection forest, for a numberof positions. TLD uses the two position sets to train the object detector and to find newexamples to add to the object model.

The learning algorithm used in TLD is based on P-N Learning, which uses Positive andNegative constraints for finding training data (Kalal et al., 2011). Based on the position setsreported, two constraints are applied on each frame: a P-constraint that finds false negatives,and an N-constraint that finds false positives. False negatives are used as positive trainingdata, and false positives are used as negative training data. Both constraints can makeerrors, but the idea is that their errors should cancel each other out to a sufficient degree,leading to positive learning.

The detector reports probabilities for a number of possible object positions. The po-sitions with a probability above zero, but with too little spatial overlap with the positionreported by the tracker, are marked as false positives, and are used as negative trainingdata.

The image patch defined by the position reported by the short-term tracker is used togenerate positive training data. The patch is expanded if necessary, so that it correspondsexactly to a position of the detector’s sliding window. The resulting patch is used as positivetraining data, together with a number of affine transformations of the patch (Angel, 2009,Ch. 4.6); Figure 2.5 shows the transformations used.

The generated training examples are candidates for being used in training, but eachexample will only be used if the component to be trained (object detector or object model)is incorrect about the example. This is determined by testing each example against thedetector forest and against the object model, so that each component calculates its likelihoodor similarity value, as described in sections 2.3 and 2.1 respectively. For each component,positive examples are only used if the component does not think the patch depicts the object,


(a) Image patch reportedby short-term tracker.

(b) Expanded to grid points.In this case, the object is cen-tred in the expanded patch.

(c) Translation (d) Scaling (e) Rotation

Figure 2.5: The affine transformations applied on an image patch during training. Thepatch reported by the tracker is first expanded to match one of the detector’s slidingwindow positions. A number of positive examples are then generated from the patch,where each example randomly combines the affine transformations.

i.e. if the calculated value is below a certain threshold. Similarly, negative examples are onlyused if the component thinks the patch does depict the object, i.e. if the calculated value isabove a certain threshold. In other words, the area reported by the tracker is assumed tobe the correct patch, so components that disagree try to learn from it. Figure 2.6 illustratesthe concept.

Since the training sets are based on the location reported by the tracker, the result of thetracker must be considered sufficiently good if the training stage is to be applied: the trackerresult must be valid, and the tracker must be more confident than the detector. Additionally,if the tracker was not deemed good enough in the previous frame, the confidence of thetracker must be above a certain threshold, i.e. the confidence of the tracker must be largeenough for training to start up again.

Initial training examples are generated from the first frame, where the bounding box ofthe object is known. As such, there is no need to take into account the confidence or theforward-backward error. Positive examples are generated from the image patch defined bythe bounding box, and negative examples are generated from other parts of the frame.

2.5 Summary

The overall flow of TLD is shown in Figure 2.7. The short-term tracker and the objectdetector are both run on the current frame. If neither component reports a valid result, theobject is considered lost until it is re-detected. Otherwise, the result of the more confidentcomponent is reported. Furthermore, if the tracker is more confident than the detector, thelearning stage is applied.

2.5. Summary 11

Figure 2.6: Illustration of a one-dimensional trajectory over time. The blue line representsthe trajectory reported by the tracker. The red marks are detections from the detector.The green areas around each tracker position are considered positive, and may be usedas positive training data. A detection outside of a green area may be used as negativetraining data.

Figure 2.7: The overall flow of TLD.

Chapter 3

Implementation

Codemill has previously written a face recognition plugin for the transcoder in Vidispine.It uses OpenCV for face detection and Wawo for face identification. This plugin is used asa reference system with which to compare the results of this project.

The main goal of this project was to integrate object tracking in the face recognitionplugin. The secondary goal was to make the plugin use Faceclip for face detection. Theintegrations are described below.

3.1 Terminology

The face recognition system has two primary subtasks. The first task is to detect the facesin each frame. A detected face is defined by its position, defined as a bounding box. Thesecond task is to identify the faces, thus giving each one a probable person ID, or PID. If alikely PID cannot be decided, the face is given the unknown PID. A face state, or state forshort, is defined by a timestamp, a position, and a PID.

A face trajectory, or trajectory for short, represents a face that is visible in a number offrames. A trajectory consists of states. If p is the PID that has occured most often amongthe states in a trajectory, the trajectory can be amended by changing all states to have pas PID.

A state is reported when the system is ready to output the information of that state. Inthis case, reporting a state means to save the timestamp, position and PID to a file.

3.2 Wawo

Wawo is a central part of the implementation, as it handles all face identifications. Givenan image patch for a detected face, Wawo outputs the most likely PID for that face.

In order for Wawo to be able to identify faces, it must be trained. Training involvesgiving Wawo one or more face images for each known person, which the face recognitionplugin does at startup. A set of images are loaded for each person, where each training imageis assumed to contain a single face. For each image, the system detects the approximateposition of the face before giving the image to Wawo, in order to exclude any surroundingbackground.

Wawo can operate in one of two modes. In the default mode, Wawo always outputs thePID of the person most similar to the given image patch, regardless of how similar it is. In

13

14 Chapter 3. Implementation

strict mode, Wawo only outputs the PID if the similarity is above a given threshold; it willotherwise output unknown. The threshold is a floating-point number from 0 to 1.

3.3 The reference implementation

The reference system performs two tasks: face detection and face identification. For eachframe, a set of faces is detected using OpenCV. Each face is identified using Wawo. Per-forming the two tasks produces a set of face states that are reported immediately.

3.4 Integrating Faceclip

The Faceclip integration was performed by replacing the relevant calls to OpenCV with thecorresponding calls to Faceclip.

Using Faceclip turned out not to be feasible when training Wawo, as it found some falsefaces in the training pictures. For this reason, the old detection code is still used for theinitial training stage.

3.5 Integrating object tracking

In this implementation, there are two components that may report face positions: the facedetector and the object tracker. The system also tries to identify faces detected by the facedetector, as in the reference implementation. This means that the output from the facedetector is both a position and a PID, while the output from the object tracker is just aposition.

The implementation is based around face trajectories. The basic idea is that once theface detector has found a new face, the face detector and the object tracker together try tokeep track of that face in subsequent frames. Saving the face states to a single trajectoryproduces a history of states for that face.

When the face has been out of picture for some frames and/or at the end of the videostream, the trajectory will be amended and reported. States are dropped from the trajec-tory once they have been reported, and will therefore have no effect on future amendmentprocesses.

Each trajectory has its own object tracker. The trajectory and the corresponding trackerare used interchangably in this report. When the detector finds a face that does not have atrajectory, both a trajectory and a tracker are created.

A counter Cpt counts how many times there has been a detection with PID p that hasoverlapped with the tracker t. This includes when the tracker is first created. Only definedvalues of p are counted, i.e. not unknown.

Since there are two components that may each find a set of face positions in a givenframe, the two sets must be combined to give a meaningful result. The procedure first findsthe set of overlapping positions, where a detection d overlaps with a tracker t. Each suchposition is added to the trajectory t, using the PID and exact coordinates from d. Each tand d can occur in at most one such overlap. The tracker positions that did not overlapwith a detection are added to their respective trajectories, using the unknown PID.

The detections that did not overlap with a tracker are saved, given certain conditions.If the PID p is unknown, or if there is no Cpt defined yet, a new trajectory is created, and

3.5. Integrating object tracking 15

the position is added there. Otherwise, given the trajectory t with the highest Cpt, thedetector’s position is added to t if and only if t has not yet been updated during this frame.

To prevent two faces being reported at the same position in a given frame, an initialfilter is applied on the tracker positions before combining them with the detector positions.The trackers are compared pairwise. If they have overlapping positions, the less confidenttracker position is discarded. No filtering is necessary for the face detector, since it does notreport overlapping faces.

During the amendment process, at least pidmin = 50 percent of the states in the trajec-tory have to have the majority PID. If less than pidmin percent has it, all states are set tounknown instead of the majority PID. This process is meant to get rid of false identifications,using the fact that Wawo is unsure about the identity.

In addition, to reduce the number of false face positions reported, a minimum numberof detmin = 2 positions from the face detector must overlap with the trajectory for it to beconsidered valid. If a trajectory has too few detections when it is time to amend it, thetrajectory is instead removed, together with its current states. Figure 3.1 shows how eachframe is processed. Figure 3.2 shows the amendment process.


Figure 3.1: How a frame is processed. The numbers by the arrows leading out fromthe Find overlapping node represent the order in which they are executed; what isimportant is that the third step comes after the other two.

3.5. Integrating object tracking 17

Figure 3.2: The amendment process.

Chapter 4

Experiments

The implementation has been tested using two different video streams. The experiments aremeant to evaluate the improvement of using Faceclip, and the improvement of using objecttracking. Each experiment involves four main test configurations: one with the referencesystem, one with Faceclip, one with object tracking, and one with both Faceclip and objecttracking. Each test configuration has been run with Wawo set first to default and then strictmode. The threshold used in strict mode was 0.3.

The video in each experiment has first been analysed manually, to find all face positions,and to mark the positions that are of known faces. The face positions include both frontalviews and side-views of faces. A known face is defined as one that has a training picture givento the face identifier (Wawo) to learn from; only known faces can be correctly identified. Aface position of an unknown face is marked as unknown.

Two things are analysed for each test run: the reported face positions and the corre-sponding PID values. Each face position and corresponding PID are compared to the knowndata for that frame. A face position is correct if it overlaps with a known face position. Thenumbers of correct and incorrect reports are counted.

4.1 Simple

The Simple experiment involves a video of six people doing presentations on a stage, oneperson at a time. All faces are known, all face views are frontal or close to frontal, and thereare no sudden movements. There are a total of 3768 face positions, which means that a faceis visible in all frames.

The video has a run-length of 2 minutes and 30 seconds, a resolution of 720x576, and aframe-rate of 25 frames per second. The source is a high-definition video file. Despite thedown-conversion from the source, the image is quite clear. Figure 4.1 shows a few exampleframes.

This video is used partly because a potential client showed interest in seeing the results,but also because the results should be relatively easy to analyze given its simplicity. Thelatter point makes it a good starting point for figuring out parts of the implementation thatcan be improved.

19

20 Chapter 4. Experiments

Figure 4.1: Some example frames from the Simple video.

Figure 4.2: Some example frames from the Complex video.

4.2 Complex

The Complex experiment involves a video containing six people, of which three are un-known, in a dynamic setting. People move in and out of the picture, and more than oneperson can often be seen in the same frame. There are a total of 353 face positions, of which205 are known faces. The video has 284 frames, of which 254 have at least one face visible.

The video has a run-length of 30 seconds, a resolution of 1280x960, and a frame rate of7.5 frames per second. It is recorded at Codemill using a webcam. The image is slightlyblocky and not of very high quality. Figure 4.2 shows a few example frames.

Some use cases for the system may involve more than one person at a time, so it is goodto know how adding object tracking affects the results in those cases. The results of thisvideo should give an indication of that.

Chapter 5

Results

This chapter summarises the results of the experiments. The details are listed in Ap-pendix A.

There are two numbers presented for each test run regarding detected faces: the numberof correct detections and the number of false detections. There are also two numbers shownregarding the identifications done: the number of faces correctly identified and the numberof incorrect identifications. The incorrect identifications include identifications done for falsedetections, unless they were identified as unknown.

5.1 Simple

Figure 5.1 shows the results of the Simple experiment, and Figure 5.2 shows a few processedframes.

5.1.1 Reference

The number of detected faces was 3541 in both default and strict mode. The numberof false detections was 6, or 0.00159 per frame, in both modes. There were 2525 correctidentifications and 1021 incorrect identifications in default mode. In strict mode, there were493 correct and zero incorrect identifications.

5.1.2 Object tracking

The number of faces detected increased by 6% in both Wawo modes compared to the ref-erence system. The false detections disappeared in default mode, but increased by about450% in strict mode, leaving the latter at 0.0087 false detections per frame. The faces cor-rectly identified increased by 8% in default mode, and by 550% in strict mode. The falseidentifications decreased by 11% in default mode, and remained at zero in strict mode.

5.1.3 Faceclip

The number of faces detected increased by 3% in both Wawo modes compared to the refer-ence system. The false detections increased by a factor of almost 600 in both modes, leavingthem at 0.91 false detections per frame. The faces correctly identified decreased by about

21

22 Chapter 5. Results

Figure 5.1: The results for the Simple experiment. The number of faces that can becorrectly detected and the number of faces that can be correctly identified are both 3768.The test combining Faceclip and object tracking in strict mode did not run to completion.

Figure 5.2: Some processed frames from the Simple video. Each detected face is markedby a bounding box. The real names of the six people in this video are not know, so thenames “A” through “F” have been used instead, assigned to the faces in the order theyappear in the video. In these particular frames, the first face is correctly identified as A,the second face is incorrectly identified as F (correct would be B), and the third face iscorrectly identified as C.

5.2. Complex 23

25% in both modes. The false identifications increased by over 300% in default mode, andremained at zero in strict mode.

5.1.4 Faceclip with object tracking

When combining Faceclip and object tracking in strict Wawo mode, the computer ran outof memory before the test could finish. The higher memory usage is a result of the largenumber of false detections, together with the fact that false detections are identified asunknown in strict mode. It means that new trackers have to be created more often, sincethe counters Cpt will be undefined in those cases (see Section 3.5). There are therefore noresult numbers for this configuration in strict mode.

The number of faces detected increased by 5% compared to the reference in the defaultWawo mode. The number of false detections is almost 2000 times higher than the referencenumbers, leaving it at 2.7 false detections per frame. The faces correctly identified increasedby 18%. The false identifications increased by almost 900%.

5.1.5 Summary

Object tracking without Faceclip is an improvement over the reference system, in bothdefault and strict Wawo mode. Using object tracking in strict mode gave a particularlylarge improvement.

Using Faceclip was not a clear improvement. Faceclip found a larger number of facepositions than the original detector, but it also found a very large number of false facepositions. Combining Faceclip with object tracking gave a particularly bad result.

5.2 Complex

Figure 5.3 shows the results of the Complex experiment, and Figure 5.4 shows a fewprocessed frames.

5.2.1 Reference

The number of detected faces was 163 in both default and strict mode. The number of falsedetections was 9, or 0.0317 per frame, in both modes. There were 79 correct identificationsand 87 incorrect identifications in default mode. In strict mode, there were no identificationsat all.

5.2.2 Object tracking

The number of faces detected increased by 12% in default mode and by 21% in strict modecompared to the reference. The false detections increased by 22% in default mode and by144% in strict mode, leaving them at 0.039 and 0.077 false detections per frame respectively.The faces correctly identified increased by 32% in default mode, and remained at zero instrict mode. The false identifications increased by 1% in default mode, and remained atzero in strict mode.


Figure 5.3: The results for the Complex experiment. The number of faces that can becorrectly detected is 352, while the number of faces that can be correctly identified is 205.

Figure 5.4: Some processed frames from the Complex video. Each detected face ismarked by a bounding box. The faces in the first frame are properly identified as Johanand Sandra, and the detected face in the second frame is correctly identified as Rickard.Three of the faces are not detected in these frames.

5.2. Complex 25

5.2.3 Faceclip

The number of faces detected increased by 21% compared to the reference in both Wawomodes. The false detections increased by 122% in both modes, leaving them at 0.039 falsedetections per frame. The faces correctly identified decreased by 8% in default mode, andremained at zero in strict mode. The false identifications increased by 51% in default mode,and remained at zero in strict mode.

5.2.4 Faceclip with object tracking

The number of faces detected increased by 37% in default mode and by 21% in strict modecompared to the reference. The false detections increased by 278% in default mode and by122% in strict mode, leaving them at 1.12 and 0.703 false detections per frame respectively.The faces correctly identified increased by 14% in default mode, and remained at zero instrict mode. The false identifications increased by 65% in default mode, and remained atzero in strict mode.

5.2.5 Summary

Unlike in the Simple experiment, Wawo did not manage to identify anyone at all in theComplex video in strict mode. Object tracking increased the number of faces detected inboth Wawo modes, although it also increased the number of false detections. In defaultmode, object tracking increased the number of correct identifications, with only a slightincrease in the number of false identifications. The object tracker managed to track somefaces that were in profile, even when the face detector could not see them.

Faceclip managed to detect more faces than the reference implementation, but it alsoreported a larger number of invalid faces and identities. Combining Faceclip and objecttracking had a slightly positive effect on the correct detections, but also affected the incorrectresults in a negative way.

Chapter 6

Conclusions

All in all, using object tracking seems to improve the face recognition abilities of the system.More faces were found in the tests, and a larger percentage of the faces were correctlyidentified. Unless Faceclip was used, the number of false detections did not increase bymuch, and even went down in one test.

The object tracker managed to track some faces that were in profile, even when the facedetector could not see them. It also failed to track some such faces though, which might beascribed to the low frame-rate; if a face moves too far between two successive frames, theshort-term tracker of TLD may fail to track it. If neither the face detector nor the TLDdetector can detect the face, it will be lost in that frame.

Using Faceclip yielded less positive results. While Faceclip found more faces than theoriginal detector code, it also found a large number of false faces. Rondahl (2011) mentionsthat Faceclip does worse with high-resolution images, so it may be that these videos fallin that category, in particular the Simple video, which has a lower resolution but higherimage quality than the Complex video.

Combining Faceclip with object tracking gave particularly bad results. Some of thenumbers were positively affected by the combination, but the negative effects were greater.This suggests that the object tracker continued tracking the false positions from Faceclipfor an extended period of time, and therefore amplified the wrong results. Faceclip may stillbe a good tool if the system is improved to better get rid of false detections while keepingtrue detections.

Using strict Wawo modes had varying results. In the Simple video, it lead to a greatimprovement when used together with object tracking. It did not give any identificationsat all in the Complex video though, which might be because of the lower image quality.

This project had four goals:

G1 Create a transcoder plugin for object tracking, using OpenTLD, and integrate it withthe face recognition plugin. The first goal is to track objects forward.

G2 Support tracking objects backwards

G3 Implement and evaluate Faceclip as an alternative to the current face detection code.

G4 Support finding the direction of detected faces.

Goals G1 and G3 have been implemented, while G2 and G4 have not. Goal G1 is fulfilledby the intended object tracking plugin, integrated with the face recognition code. Goal G3 is

27

28 Chapter 6. Conclusions

fulfilled by having integrated Faceclip, and having compared it to the original code. Becauseof negative results, Faceclip is not used in the current implementation. While it would havebeen nice to have goals G2 and G4 implemented too, the results of the current systemare enough to give an indication of the effect of using object tracking in a face recognitionsystem.

The conclusion drawn from this project is that object tracking is a good tool for improv-ing the accuracy of face recognition in video streams. Anyone implementing face recognitionfor video streams should consider using object tracking as a central component.

6.1 Limitations and future work

Testing was done by comparing whether a reported face position overlaps with a knownface position. It would, for example, mark a position defining a person’s nose as correct,since it overlaps with the whole face. Inspecting the results shows that there are few suchoccasions, but it still means that the result numbers are not fully accurate.

The current object tracking implementation cannot handle a large number of false de-tections with the unknown PID, as it uses up too much memory.

The current way for filtering false detections easily fails; a trajectory will be acceptedas long as the face detector reports a particular image patch twice within a short timespan.This is particularly apparent when combining object tracking with Faceclip, as Faceclipoften reports the same invalid image patch twice or more.

One type of false detection that occurs is when the system tracks a face properly for awhile, and then finds the face at the wrong position for a few frames, and then goes back totracking it at the correct position. A solution based on continuity filtering (Nielsen, 2010)may be a way to get rid of the false intermediate positions.

It may also be useful to try identifying faces reported by the object tracker, instead ofthe current way of only identifying faces reported by the face detector. Doing this wouldincrease the number of faces for potential identification.

It would be useful to test different settings, for example other values of pidmin and detmin.That may be enough to get rid of some of the false detections and identifications.

The system currently only does forward tracking. As mentioned in Section ??, onepossible improvement would be to also implement backward tracking.

The optional goal of detecting the direction of each face has not been implemented. Afuture research idea is to look into techniques for doing that.

In the Complex experiment, the object tracker failed to track some faces. As mentionedin Section 5.2, it may be due to the low frame rate. One way to test whether it is causedby the low frame rate may be to down-sample the frame rate of a video clip with a higherframe rate, and see how the object tracking is affected. If the low frame rate is the cause ofthe failure, getting good results when using the implementation may require a video sourcewith a high enough frame rate.

In general, Wawo seems to be quite sensitive to what training pictures are included;they all have to be of similar size and quality. Because of the high sensitivity, using theimplementation in practice may require the user to put some effort into producing trainingpictures of high enough, and consistent, quality.

Acknowledgements

I would like to thank my supervisors Niclas Borlin at the Computing Science department atUmea University, and Martin Isaksson Wuotila at Codemill, for their assistance in producingthis thesis. I would also like to thank the people at Codemill who took part in creating someof my test data.

29

30 Chapter 6. Conclusions

References

Angel, E. (2009). Interactive Computer Graphics. Pearson Education, 5th edition.

Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32. DOI: 10.1023/A:1010933404324.

Gonzales, R. C. and Woods, R. E. (2008). Digital image processing. Pearson Prentice Hall,3rd edition.

Kalal, Z. (2011a). OpenTLD Git repository. https://github.com/zk00006/OpenTLD/,commit 8a6934de6024d9297f6da61afb4fcee01e7282a2.

Kalal, Z. (2011b). TLD web page. http://info.ee.surrey.ac.uk/Personal/Z.Kalal/

tld.html, visited at 2011-11-29.

Kalal, Z., Matas, J., and Mikolajczyk, K. (2010a). Face-TLD: Tracking-Learning-Detectionapplied to faces. In Proceedings of the 17th International Conference on Image Processing,pages 3789–3792. DOI: 10.1109/ICIP.2010.5653525, http://www.ee.surrey.ac.uk/CVSSP/Publications/papers/Kalal-ICIP-2010.pdf.

Kalal, Z., Matas, J., and Mikolajczyk, K. (2010b). Forward-backward error: Automaticdetection of tracking failures. In Proceedings of the 20th International Conference onPattern Recognition, pages 2756–2759. DOI: 10.1109/ICPR.2010.675, http://www.ee.surrey.ac.uk/CVSSP/Publications/papers/Kalal-ICPR-2010.pdf.

Kalal, Z., Matas, J., and Mikolajczyk, K. (2011). Tracking-Learning-Detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 99(Preprint).DOI: 10.1109/TPAMI.2011.239, http://kahlan.eps.surrey.ac.uk/featurespace/

tld/Publications/2011_tpami.pdf.

Lucas, B. D. and Kanade, T. (1981). An iterative image registration technique with anapplication to stereo vision. In Proceedings of the International Joint Conference onArtificial Intelligence, volume 2, pages 674–679.

Nielsen, J. B. (2010). Face detection and recognition in video-streams. Bachelor Thesis IMM-B.Sc.-2010-14, Department of Informatics and Mathematical Modeling, Image Analysisand Computer Graphics, Technical University of Denmark, Lyngby. http://orbit.dtu.dk/getResource?recordId=263847&objectId=1&versionId=1.

Rondahl, T. (2011). Face detection in digital imagery using computer vision and imageprocessing. Bachelor Thesis UMNAD-891, Department of Computing Science, UmeaUniversity, Sweden. URN:NBN: urn:nbn:se:umu:diva-51406.

31

32 REFERENCES

Torres, L. (2004). Is there any hope for face recognition? In 5th International Workshopon Image Analysis for Multimedia Interactive Services, Lisboa, Portugal. http://www.

face-rec.org/interesting-papers/General/cr1182.pdf.

Appendix A

Results

Table A.1 and A.2 show the result numbers for the Simple experiment in default modeand strict mode respectively. Table A.3 and A.4 show the result numbers for the Complexexperiment in default mode and strict mode respectively.

A few numbers are calculated for each test run:

– Total detections is the number of face positions that were reported.

– Correct detections is the number of correct positions.

– False detections is the number of incorrect positions.

– Missed faces is the number of known face positions that were not reported.

– Total IDs is the total number of times an ID other than unknown was reported.

– Correct IDs is the number of correct IDs.

– Incorrect IDs is the number of incorrect IDs. This number includes identificationsdone for false faces.

The numbers that specify correctness also have one or two associated percentage num-bers:

– % of reported is the percentage of the total reported number.

– % of known is the percentage of the total known number.

Additionally, each number that is about correctness has, within parantheses, the differ-ence in percentage points compared to the reference results. The default tests are comparedto the default References, while the strict tests are compared to the strict References.

33

34 Chapter A. Results

n % of reported % of knownTotal detections 3547 - -Correct detections 3541 99.83 93.98False detections 6 0.17 -Missed faces 227 - 6.02Total IDs 3546 - -Correct IDs 2525 71.21 67.01Incorrect IDs 1021 28.79 -

(a) Reference

n % of reported % of knownTotal detections 7068 - -Correct detections 3638 (+97) 51.47 (-48.36) 96.55 (+2.57)False detections 3430 (+3424) 48.53 (+48.46) -Missed faces 130 (-93) - 3.45 (-2.57)Total IDs 6399 - -Correct IDs 2168 (-357) 33.88 (-37.33) 57.54 (-9.47)Incorrect IDs 4231 (+3210) 66.12 (+37.33) -

(b) Faceclip

n % of reported % of knownTotal detections 3756 - -Correct detections 3756 (+215) 100.00 (+0.17) 99.68 (+5.7)False detections 0 (-6) 0.00 (-0.17) -Missed faces 12 (-215) - 0.32 (-5.7)Total IDs 3624 - -Correct IDs 2720 (+195) 75.06 (+3.85) 72.19 (+5.18)Incorrect IDs 904 (-117) 24.94 (-3.85) -

(c) Object tracking


(d) Object tracking + Faceclip

Table A.1: Results for the Simple experiment, using the default Wawo mode.

35


(a) Reference (strict)

n % of reported % of knownTotal detections 7068 - -Correct detections 3638 (+97) 51.47 (-48.36) 96.55 (+2.57)False detections 3430 (+3424) 48.53 (+48.36) -Missed faces 130 (-97) - 3.45 (-2.57)Total IDs 365 - -Correct IDs 365 (-128) 100.00 9.69 (-3.39)Incorrect IDs 0 0.00 -

(b) Faceclip (strict)

n % of reported % of knownTotal detections 3794 - -Correct detections 3761 (+220) 99.13 (-0.7) 99.81 (+5.83)False detections 33 (+27) 0.87 (+0.7) -Missed faces 7 (-220) - 0.19 (-5.83)Total IDs 3204 - -Correct IDs 3204 (+2711) 100.00 85.03 (+71.95)Incorrect IDs 0 0.00 -

(c) Object tracking (strict)

Table A.2: Results for the Simple experiment, using strict Wawo mode.

36 Chapter A. Results


(a) Reference


(b) Faceclip

n % of reported % of knownTotal detections 194 - -Correct detections 183 (+20) 94.33 (-0.44) 51.99 (+5.68)False detections 11 (+2) 5.67 (+0.44) -Missed faces 169 (-20) - 48.01 (-5.68)Total IDs 192 - -Correct IDs 104 (+25) 54.17 (+6.58) 50.98 (+12.25)Incorrect IDs 88 (+1) 45.83 (-6.58) -

(c) Object tracking

n % of reported % of knownTotal detections 256 - -Correct detections 222 (+59) 86.72 (-8.05) 63.07 (+16.76)False detections 34 (+25) 13.28 (+8.05) -Missed faces 130 (-59) - 36.93 (-16.76)Total IDs 256 - -Correct IDs 90 (+11) 35.16 (-12.34) 44.12 (+5.39)Incorrect IDs 166 (+79) 64.84 (+12-34) -

(d) Object tracking + Faceclip

Table A.3: Results for the Complex experiment, using the default Wawo mode.

37

n % of reported % of knownTotal detections 172 - -Correct detections 163 94.77 46.31False detections 9 5.23 -Missed faces 189 - 53.69Total IDs 0 - -Correct IDs 0 0 0Incorrect IDs 0 0 -

(a) Reference (strict)

n % of reported % of knownTotal detections 217 - -Correct detections 197 (+34) 90.78 (-3.99) 55.97 (+9.66)False detections 20 (+11) 9.22 (+3.99) -Missed faces 155 (-34) - 44.03 (-9.66)Total IDs 0 - -Correct IDs 0 0.00 0.00Incorrect IDs 0 0.00 -

(b) Faceclip (strict)


(c) Object tracking (strict)


(d) Object tracking + Faceclip (strict)

Table A.4: Results for the Complex experiment, using strict Wawo mode.

Object Tracking and Face Recognition in Video Streams - DiVA Portal

Documents

Transcript of Object Tracking and Face Recognition in Video Streams - DiVA Portal