Face Extraction From Live Video - Mad Computer Scientist

Face Extraction From Live VideoSITE Technical Report TR-2006

Adam Fourney

School of Information Technology and EngineeringUniversity of Ottawa

Ottawa, Canada, K1N 6N5

Project supervised by: Dr. Robert Laganière

i

Table of ContentsIntroduction.............................................................................................................................................1Technology..............................................................................................................................................1

The Open Source Computer Vision Library (OpenCV).......................................................................1Microsoft's DirectShow.......................................................................................................................2The Background Segmentation Component .......................................................................................3

Architecture.............................................................................................................................................3Graphical User Interface..........................................................................................................................4

FACE Dialog......................................................................................................................................4Input Configuration Dialog.................................................................................................................4Output Configuration Dialog...............................................................................................................5Graph Manager...................................................................................................................................5Face Extractor.....................................................................................................................................6Face Export Rules...............................................................................................................................7Modes of Operation............................................................................................................................8Mechanism By Which Face Images are Exported...............................................................................9Possible Improvements.......................................................................................................................9

Face Detector.........................................................................................................................................10The OpenCV Face Detector .............................................................................................................10Measurements Used to Assess Image Quality...................................................................................11

Inferring About Image Quality Using Haar-Classifiers.................................................................11Gaze Direction..............................................................................................................................11Motion, Skin, and “Motion & Skin” Content................................................................................12

Pixel Motion Detection............................................................................................................13Skin Detection.........................................................................................................................13

Quality of Lighting.......................................................................................................................14Measuring the Width of a Histogram.......................................................................................14

Image Sharpness...........................................................................................................................15Image Dimensions and Area.........................................................................................................16

Possible Improvements.....................................................................................................................16Pedestrian Tracker.................................................................................................................................16

Possible Improvements.....................................................................................................................17Appendix A: Building From Source.......................................................................................................18References.............................................................................................................................................20

1

Introduction

The first step in any biometric face identification process is recognizing, with a high degree ofaccuracy, the regions of input video frames that constitute human faces. There has been much researchfocused on this particular task. Thankfully, this has resulted in some very robust solutions for detectingfaces in digital images.

However, frames from live video streams typically arrive at a rate of between 15 and 30 framesper second. Each frame may contain several faces. This means that faces might be detected at a ratewhich is much higher than the original video frame rate. As mentioned above, these faces are destinedto be input into a biometric face identification software package. This software is likely complex, andcertainly requires some finite amount of time to process each face. It is very possible that the high rateof input could overwhelm the software.

Even if the face identification software is very efficient, and can keep up with the high rate ofincoming faces, much of the processing is wasteful. Many of the faces will belong to individuals whohave already been accounted for in previous frames.

The software described in this paper aims to alleviate the situation by detecting faces as fast aspossible, but only exporting select faces for post processing. In fact, the software aims at exporting oneimage for every pedestrian that enters the camera's field of view. This is accomplished by associatingface images to individual pedestrians. Each time a new image is associated to a pedestrian it must becompared to the best image previously associated to the individual. If the new image is animprovement, then it replaces this best image. When a pedestrian leaves the camera's field of view, thepedestrian's best image is exported.

Technology

The current implementation of the project relies very heavily on three important technologies.Without these technologies, the project would not have been possible. The following section lists thetechnologies that were used, and it will discuss why they are so invaluable.

The Open Source Computer Vision Library (OpenCV)

The open source computer vision library is a development library written in the C/C++programming language. The library includes over 300 functions ranging from basic image processingroutines all the way up to state-of-the-art computer vision operations. As the OpenCV documentationdescribes:

2

Example applications of the OpenCV library are Human-Computer Interaction (HCI);Object Identification, Segmentation and Recognition; Face Recognition; GestureRecognition; Motion Tracking, Ego Motion, Motion Understanding; Structure FromMotion (SFM); and Mobile Robotics.

(“What is OpenCV”, 2006)

The importance of OpenCV to this project cannot be stressed enough; the representation of allimages processed by the face extraction software is defined by a structure located in one of theOpenCV libraries. OpenCV routines are used in almost every instance where image processing is done.Finally, without OpenCV's object detection routines, none of this project would have been possible; theobject detection routines are used to detect faces in the video sequences, and the results are trulyamazing.

Microsoft's DirectShow

Microsoft's DirectShow is an application programming interface that allows developers tomanipulate multimedia data in various useful ways. Microsoft describes the DirectShow API asfollows:

The Microsoft® DirectShow® application programming interface is a media-streamingarchitecture for the Microsoft Windows® platform. Using DirectShow, your applicationscan perform high-quality video and audio playback or capture.

(Microsoft, 2006)

DirectShow is also occasionally known by its original codename “Quartz”, and was designed toreplace Microsoft's earlier Video For Windows technology (“DirectShow”, 2006). Like Video ForWindows, DirectShow provides a standardized interface for working with video input devices as wellas with multimedia files. It also provides a technology called “Intelligent Connect” which makes iteven easier to program for a wide range of input devices and video encodings.

DirectShow is a very complicated API, with an equally complicated history. It is often criticizedas being overly complex. Perhaps the Wikipedia article describes this situation best:

DirectShow is infamous for its complexity and is often regarded by many people as oneof Microsoft's most complex development libraries/APIs. A long-running semi-joke onthe "Microsoft.public.win32.programmer.directx.video" newsgroup is "see you in 6months" whenever someone wants to develop a new filter for DirectShow.

(“DirectShow”, 2006)

Thankfully, this project did not require the development of any new DirectShow filters, and ingeneral, the technology seemed relatively manageable.

3

The Background Segmentation Component

The final technology used for the project was a background segmentation component,contributed by Dr. Robert Laganière. This component is a C++ class that, among other things, is able todetermine which pixels of a video frame constitute the foreground. This is accomplished by building astatistical model of a scene's background, and comparing each video frame to this model. This projectuses background segmentation for motion detection and object tracking. Both the face detector andpedestrian tracker components require the segmented image, which is output from the backgroundsegmentation component.

Architecture

The face extraction software is composed of six main components, as described by thefollowing diagram. Each component will be the subject of a lengthy discussion

4

Graphical User Interface

The graphical user interface (GUI) of the face extractor software is arguably the least importantcomponent of the entire project. For this reason, the level of detail in this section will be quite minimal.Additionally, this document focuses more on design than on usability. Thus, the following discussionwill not cover the individual interface widgets, nor will it serve as a manual for anyone operating thesoftware. Instead, it will simply discuss where certain aspects of the implementation can be found, andwhat functionality should be expected.

The GUI for the face export application is composed of four main classes. There is the maindialog, the input configuration dialog, the output configuration dialog, and the graph manager. Thegraph manager is the most important (and complex) sub-component of this part of the system. Inaddition to the aforementioned classes, there are a few other classes that simply provide some customcontrols to the GUI.

FACE Dialog

Header File: ./FaceDlg.hC++ File: ./FaceDlg.cppNamespace: <None>C++ Class Name: CFaceDlg

The entire application was originally designed as a Microsoft Foundation Classes (MFC) dialogproject. Every dialog application project – including this project – begins by displaying a single maindialog window. The face extractor project uses the FACE Dialog for exactly this purpose.

From this dialog, users can:

1. Configure the input settings2. Configure the output settings3. Start and stop video capture4. Configure the individual settings for DirectShow capture graph pins and filters.

Input Configuration Dialog

Header File: ./ConfigInputDlg.hC++ File: ./ConfigInputDlg.cppNamespace: <None>C++ Class Name: CConfigInputDlg

The input configuration dialog allows users to select a video input device, or a file to whichvideo has been previously saved. The list of available input devices includes all DirectShow filters that

5

are in the CLSID_VideoInputDeviceCategory category. These typically include web cameras, TV tunercards, and video capture cards.

Output Configuration Dialog

Header File: ./ConfigOutputDlg.hC++ File: ./ConfigOutputDlg.cppNamespace: <None>C++ Class Name: CConfigOutputDlg

Unlike input configuration, output configuration is entirely optional. These settings allow usersto save the processed video sequences to a file. It also allows users to specify a directory where theexported face images can be saved (currently, all images are saved in the JPEG format). If a userdecides to save the video to a file, then the user is prompted for a valid file name. They may also selectand configure a video compressor.

Graph Manager

Header File: ./GraphManager/GraphManager.hC++ File: /GraphManager/GraphManager.cppNamespace: <None>C++ Classes:GraphManager, FilterDescriptor

The GraphManager class is one of the largest and most complicated classes of the entire project.This class is responsible for the construction and destruction of the Microsoft DirectShow capturegraphs used by the application. The graph manager makes heavy use of the intelligent connecttechnology (by constructing graphs using the CaptureGraphBuilder2 interface). Therefore, it supportsmany different video capture devices and multimedia file encodings. In fact, the face extractor softwarehas been tested with various brands of web cameras and at least one brand of TV tuner card(Hauppauge WinTV). Interestingly, when using the TV tuner card, intelligent connect is wise enoughto include all filters required to control the TV tuner.

The graph manager was inspired by the SequenceProcessor class included in Dr. Laganière'sOpenCV / DirectShow tutorial. In fact, both the SequenceProcessor and the GraphManager usefunctions defined in the file “filters.h”, which was also included with the tutorial. There are, however,some major differences between these two classes. The first major difference is the use of theCaptureGraphBuilder2 interface, which was described above. The second difference is that theGraphManager uses display names (rather than friendly names) to identify the filters internally; thisallows the system to distinguish between many physical devices that each have the same friendly name.The final difference is that the graph manager class provides support for displaying a filter's propertiesor an output pin's properties. For example, to change the channel on a TV tuner device, one simplydisplays the properties of the TV tuner filter, and then selects the appropriate channel. Different deviceshave different properties, and these property sheets are built into the filters themselves.

6

So far, the discussion has focused on how the graph manager controls video input. Notsurprisingly, it also controls the video output. In all cases, video is rendered directly to the screen;however, the graph manager also allows users to save the output to disk. Video can be saved in one ofmany supported video encodings, and in most cases, the user can specify the level of quality andcompression of the video encoder.

At the time of the writing of this paper there are some outstanding issues regarding the graphmanager. In particular, the graph manager has trouble constructing filter graphs when using anuncompressed video file as the source. Additionally, the ability to pause playback has not been entirelyimplemented. As a result, the face extractor software does not have a pause button on the maininterface. Another issue that needs consideration involves the installation of filter graph event handlers.At this time there is no clean way to forward events to the controlling window. The final issue is thatthere are currently no provisions for seeking within video files. Despite these issues, the graph managerwas able to provide some very powerful functionality. Also, the above issues will likely be resolved inthe near future.

Face Extractor

Header File: ./FaceExtractor.hC++ File: ./FaceExtractor .cppNamespace: <None>C++ Classes:Face, FaceGroup, ExtractorObserver, Extractor

The face extractor component is the next highest level of abstraction below the GUI.Essentially, it is the only interface that any programmer using the system is likely to need. The faceextractor receives input video frames, and exports face images according to a well defined set of rules.All image processing is accomplished by three of the face extractor's subcomponents: the face detector,the background segmentation component, and the pedestrian tracker. The face extractor merelyinterprets the results of its subcomponents, and uses this information to associate faces to pedestrians.This association is achieved by assigning an identifier to each instance of a face. If two face imageshave the same identifier, then it implies that both images are from the same individual.

The face extractor is also responsible for determining when a face image should be exported. Asmentioned in the introduction, the idea of the entire project is to export only the best face captured foreach pedestrian. The face extractor, however, provides slightly more flexibility. It exports a face imageif any one of the following three conditions are met:

1. No previous face image has been captured for a given pedestrian.2. The current face image is an improvement over all previously exported images.3. If a pedestrian leaves the scene, then the best face ever captured is re-exported.

In addition to identifiers, exports are also labeled with an event. Events describe which of the

above three conditions caused the export to take place.

Finally, each exported face is also given a score value. These scores are simply numerical

7

values that indicated the quality of the exported image. As a consequence of the export conditions, thescore values of a sequence of images exported for a single individual are monotonically increasing invalue.

Using the output of the face extractor component, developers can devise several high-level post-processing rules. For example, developers can decide to process a face as soon its score crosses somepre-determined threshold. An alternative rule would be to process the best available face as soon as thepedestrian leaves the scene.

Face Export Rules

The three rules described in the previous section assume that faces can be uniquely associated topedestrians. Unfortunately, the relationship between the faces returned by the face detector and thepedestrians returned by the pedestrian tracker is not usually one-to-one. For example, two people mightbe walking side-by-side, and the pedestrian tracker might incorrectly assume that they are a singlemoving object. In this case, the face detector might associate two faces to one “pedestrian”. Worse yet,the face detector might locate one face in some frames, and two faces in other frames. Therefore, theface export rules must be slightly more complex then previously stated.

In total, five rules are used in order to determine when faces are exported. These rules operateon face groups rather than on individual face images. Face groups are simply unordered collections offace images that are all from the same video frame, and are all associated to the same pedestrian. Atany given time, two face groups are maintained for each pedestrian: the current face group and thehistorical best face group. The current face group represents faces detected in the current video frame.The historical best face group represents the best faces ever associated to the pedestrian.

The rules given below describe the circumstances under which the historical best face group isupdated to reflect the current face group. Whenever such an update occurs, all faces in the current facegroup are exported. In addition to this simple behavior, the face extractor re-exports a pedestrian'shistorical best face group whenever the pedestrian tracker indicates that the individual has left thecamera's field of view. The five rules are as follows:

1. If a pedestrian is considered new, then the first face group associated to the pedestrian isconsidered the historical best.

2. If the current face group contains a single face image, and the historical best face group alsocontains a single image, then the historical best group is updated only if the new face is animprovement over the previous best image. The exported face image is assigned the sameunique identifier as the image it is destined to replace.

3. If the historical best face group contains more than one image, and this number does notincrease with the current face group, then update the historical best face group when the bestimage from the current group is better than the worst image in the historical group. This rule isnecessary because it is impossible to determine the association of new faces to old faces.Therefore, it is impossible to determine which of the many faces may have improved. For this

8

same reason, all face images in the group are given new unique identifiers.

4. If the current face group contains more faces than the historical best face group, thenautomatically update the historical face group. Of all rules, this one is perhaps the mostobscure. The reason that the historical best face group is updated is that it causes the faces to beexported. This is important because it is impossible to determine which face in the group is new(and thus not yet exported). The current faces are all given new unique identifiers in order toensure that none of the previously exported faces are replaced.

5. If none of the previous rules are applicable, then take no action.

These rules are designed so that the face extractor errors on the side of caution. Whenever apedestrian confuses the face extractor, the software exports every face image that might be associatedto the pedestrian. In order for these rules to capture all possible scenarios, the pedestrian tracker mustalso be programmed to error on the side of caution; if the tracker ever gets confused about a pedestrian,it must assign the pedestrian a new identifier, and treat it as a new entity.

Modes of Operation

The face exporter component has two modes of operation. These modes simply control thesequencing of operations that are applied to the video frames. In the current version of the software,these modes are not accessible programmatically; instead, they are specified using complier constants.For example, to search for faces before locating pedestrians, one would compile the application withthe OPERATION_MODE constant set to the value FIND_FACES_THEN_PEDESTRIANS. However,to located pedestrians and then search for faces, the constant should be set to FIND_PEDESTRIANS_THEN_FACES. The following section describes these modes in detail.

FIND_FACES_THEN_PEDESTRIANS: (Suggested mode of operation)

The find faces then pedestrians mode of operation searches for faces before searching for (andtracking) pedestrians. The location of each face is then input into the pedestrian tracker in the form of a“hint”. Currently, the tracker uses hints to ensure that all faces are associated to a pedestrian. Forexample, if a face is detected, and no pedestrian is nearby, then a new pedestrian record is created. Thisnew pedestrian is described by the smallest rectangular region that contains the orphaned face. Undernormal circumstances, faces are associated to whichever pedestrian yields the largest intersection withthe rectangular region describing the face. This mode of operation ensures that all faces detected by theface detector have the opportunity to be processed.

FIND_PEDESTRIANS_THEN_FACES:

The find pedestrians then faces mode of operation is based on the idea that faces should only belocated in regions where pedestrians are found. This idea seems rather sound, however, the pedestriantracker occasionally returns regions that do no encompass the entire pedestrian. In this case, the facedetector may fail to detect faces that are cut off by a pedestrian's overly-small bounding rectangle. Thisproblem can be remedied by expanding the search region around each pedestrian. However, if there areseveral pedestrians in the scene, then the search regions may overlap; portions of the frame may be

9

searched twice. Finally, this mode of operation is guaranteed to call the face detection routines once perpedestrian per frame. Without careful attention to detail, the overhead of multiple calls may besignificant.

All of the above issues can be resolved, but the sources of error are numerous and the benefitsare not significant. For this reason, this mode of operation is not recommended. Selecting this modewill work, but the various algorithms still need to be tuned to address the above issues.

Mechanism By Which Face Images are Exported

Up until now, the discussion has simply mentioned that face images should be exported, but didnot explain the mechanism by which this occurs. The face extractor component implements theobserver / observable design pattern. Developers interested in the output of the face extractor simplyregister their classes as observers. Whenever a face image is to be exported, the face extractor classsimply notifies the observers by calling their updateFace() method. The arguments provided to thismethod are as follows:

ParameterName

Description

color This is the RGB color of the rectangle drawn around theface in the output video. This information is not requiredfor the main functionality, but it improves the usabilityof the GUI (it allows users to easily associate exportedimages with faces outlined in the video).

previousFace This is a pointer to the face record being replaced by thenewFace. If this value is NULL, then no face haspreviously been exported on behalf of the pedestrian (i.e:indicating a new pedestrian).

newFace This is a pointer to the face record being exported. If thisvalue is NULL, then it indicates that the pedestrian hasleft the scene. In this case, the previousFace is providedas a record of the best face exported on behalf of thepedestrian.

The individual observers are responsible for determining which of the export events areimportant, and which ones to ignore.

Possible Improvements

The five export rules used by the face extractor may be overly cautious; when many individuals

10

are erroneously grouped into a single pedestrian, the rate of exported images is quite fast. There are atleast two ways to correct this problem:

• one can improve the pedestrian tracker so that there are fewer grouping mistakes, or• a mechanism can be developed to help pair faces across face groups. For example, it can use a

person's shirt color to help match faces belonging to the same individual.

Additionally, the face extractor should offer better support for the find pedestrians then facesmode of operation. This would allow the software to function on much higher resolution images,provided that the pedestrians do not occupy the entire viewing area.

Finally, the face extractor occasionally fails to export faces when pedestrians leave the scene. Itis not known if this problem is the result of a bug with the face extractor, or with the pedestrian tracker.This issue will hopefully be resolved in the near future.

Face Detector

Header File: ./FaceDetector/FaceDetect.hC++ File: ./FaceDetector/FaceDetect.cppNamespace: FaceDetectorC++ Classes:Face, Detector

The face detector is the most important – and most complex – of all the project components.The face detector is responsible not only for detecting faces in image sequences, but also in assessingtheir quality. The OpenCV library provided the mechanism by which faces can be detected, but themechanism used to assess image quality needed to be built from the ground up. Measuring imagequality was certainly the most challenging aspect of the entire project.

The OpenCV Face Detector

In OpenCV, face detection is accomplished by invoking a single library method:cvDetectObjects. This method uses a technique known as Cascading Haar Classifiers in order torecognize certain objects (in this case, faces). A tutorial on the OpenCV documentation Wiki describesthis technique as follows:

First, a classifier (namely a cascade of boosted classifiers working with haar-likefeatures) is trained with a few hundreds of sample views of a particular object (i.e., a faceor a car), called positive examples, that are scaled to the same size (say, 20x20), andnegative examples - arbitrary images of the same size.

After a classifier is trained, it can be applied to a region of interest (of the same size asused during the training) in an input image. The classifier outputs a "1" if the region islikely to show the object (i.e., face/car), and "0" otherwise. To search for the object in thewhole image one can move the search window across the image and check every location

11

using the classifier. The classifier is designed so that it can be easily "resized" in order tobe able to find the objects of interest at different sizes, which is more efficient thanresizing the image itself. So, to find an object of an unknown size in the image the scanprocedure should be done several times at different scales.

(“Face Detection using OpenCV”, 2006)

Currently, the face detector component uses several of these classifiers to identify faces that arefacing in different directions. In order to improve the runtime of the face detector, the classifiers areapplied to a half-scale copy of each input frame. This means that faces smaller than 40x40 pixels willnot be detected. However, the software does allow developers to decide if scaling should take place.

Measurements Used to Assess Image Quality

In order to assess the quality of face images, a series of metrics are used to measure variousaspects of the input. These measurements are then fed into a linear function that returns the final score.The scores increase in value as the image quality improves; therefore, larger scores are better thansmaller scores. The following list enumerates all of the measurements used for this purpose:

1. The particular Haar-Classifier that detected the face2. Gaze direction3. Motion and skin content4. Quality of lighting5. Sharpness of the image6. The size of the detected face

Each of the above criteria will be discussed in great detail in this section.

Inferring About Image Quality Using Haar-Classifier s

The face detector uses several classifiers to detect faces in the video frames. Some classifiersare more accurate than others, and some classifiers detect certain gaze directions, but not others. Forexample, results returned from a frontal face classifier are more desirable than those returned by aprofile face classifier.

Gaze Direction

In the introduction to this paper, a biometric face recognition scenario was used to introduce theconcept of face extraction from live video. Face recognition packages perform best when theindividuals are facing towards the camera. For this reason, gaze direction is an enormously importantmetric in determining the quality of a face image. Since gaze direction is simply defined as thedirection in which an individual is looking, it can be estimated by locating an individual's eyes inrelation to the rest of their head. For example, if an individual is looking directly forward, the midpoint

12

between the eyes should be horizontally centered on the head. However, if the individual is facingslightly to the left or to the right, then the midpoint between the eyes will be slightly off-center.

Measurements of gaze direction, as mentioned above, are highly sensitive to error.Experimentation has shown that gaze direction is the least reliable metric, and leads to vastinconsistencies between results: on some occasions it works wonderfully, and in other cases the methodfails outright. For this reason, the gaze direction measurements are considered accurate only when theyfall within a very particular range of values.

Despite the inaccuracies, eye detection and gaze direction measurements are quite worthwhile.Many of the operations needed for eye detection are also needed for other metrics, so partial results canbe reused. For example, the OpenCV face detector returns square regions around people's faces. Inmany cases, the faces are better represented by a rectangular region, not by a square; the square regionsinclude too much background. The first step in eye detection is locating the side edges of the face. Theresult is a rectangular region that cuts off much of the unwanted background. This new rectangle is thenused in all other metrics.

Additionally, if the eyes can be successfully located, a wealth of other information immediatelybecomes available. For example, locating the eyes will also reveal the vertical axis of symmetry of theface. This measurement can be used to test various hypotheses about a candidate face image.

The mechanism by which eye detection is accomplished is described in the paper “A RobustAlgorithm for Eye Detection on Gray Intensity Face without Spectacles” written by Kun Peng et al.The method uses the horizontal gradient of the input image in order to detect the vertical location of theeyes, and the horizontal locations of the sides of the face. It then estimates a region where the eyes arelikely to be found, and searches this region for the brightest point. The brightest point is assumed to bethe region of flesh between the individual's eyes, and directly above the nose. Thus, the horizontalcoordinate of this point can be assumed to describe the axis of symmetry of the face. This method onlyworks when face are oriented in the usual way (i.e: the faces are not upside down or otherwise rotated).

Motion, Skin, and “Motion & Skin” Content

The OpenCV face detector is not always perfect; in many cases it returns regions that are nothuman faces. In order to detect false positives, it is assumed that human faces are part of theforeground; will move from frame to frame; and will be represented by pixels having colors that arenormally associated with human skin.

Motion content is defined to be the percentage of pixels in the face region that have recentlyexperienced motion. Similarly, skin content is defined to be the percentage of pixels in the face regionthat are considered likely to be human skin. Finally “motion and skin” content is the percentage ofpixels that have recently moved and are likely to represent human skin (in other words, this is thenumber of pixels that exhibit both qualities at the same time). In all of the above cases, the face regionis not the region returned by the OpenCV face detector, but is instead the improved region provided bythe gaze direction analysis.

13

The above discussion assumes that pixels exhibiting motion can be easily detected. It alsoassumes that pixels representing human skin are equally simple to identify. Thankfully, this is in factthe case. The following sub-sections will describe how this is accomplished.

Pixel Motion Detection

The face detector constructs a motion history image to determine which pixels have recentlyexperienced motion. Rather than identifying a color or a shade of gray, each pixel in a motion historyimage encodes the most recent time that the pixel was considered part of the foreground. This isaccomplished by numbering input video frames, and using the output of the background segmentationcomponent to selectively update the pixels of the motion history image. For the purpose of this project,pixels that have experienced motion are exactly those that are/were recently considered part of theforeground. These pixels can be easily identified by a simple thresholding operation applied to themotion history image.

Skin Detection

The skin detector used by the face extraction software is based on the research of Margaret M.Fleck and David A. Forsyth as described in their paper “Naked People Skin Filter”. This filter usestexture and color information to determine pixels that likely represent human skin. The main idea isthat pixels representing skin are generally tightly confined in a small region of the hue-saturation colorspace; in particular, skin tends to range from red to yellow in hue, and it tends to only be moderatelysaturated. Additionally, skin tends to be rather smooth in texture. This is well represented by areaswhere the variance in pixel intensity is low (although, texture is not considered in the currentimplementation of the face detector).

Currently, the skin detector uses the Hue/Saturation/Luminosity color space, in which the hueranges from 0º (red) to 360º (red again), and in which the saturation ranges from 0 to 1. Hues between0º and 38º tend to be described as reddish-yellow, while values between 330º and 360º are consideredreddish-blue. During informal experimentation, pixels representing skin fell within one of these tworange. Additionally, pixels representing skin tended to be more saturated as the hue approached whatmight be considered yellow. Of course, these results closely agree with the results described in theaforementioned paper, although a different color space and pair of regions was used. The particularregions used by the face detector are described in the following table:

Region Hue Saturation

Reddish-Yellow 0 – 38º 0 – 1.0

Reddish-Blue 330 – 359º 0 – 0.6The above values were heavily influenced by an articled entitle “SkinColor Analysis” authored by Jamie Sherrah and Shaogang Gong

Once the pixels representing skin are identified (producing the skin mask image), a second filter

14

is applied to all neighboring pixels. This second filter uses a less-strict set of rules in an attempt tointelligently close gaps that might occur in the original skin mask.

Interestingly, images illuminated by natural or fluorescent lighting tended to be shifted towardthe blue portion of the color spectrum, while images illuminated by incandescent lighting were shiftedtowards reddish-orange. For this reason, the above regions are slightly larger than would be necessaryif the light source could be controlled.

Quality of Lighting

When developing the face detector component, one of the most frustrating phenomena occurredwhen testing the application after a long night of programming; at night, the software was carefullytuned, and measurements like skin detection and face feature detection worked wonderfully. Indaylight, however, the lighting was harsh, the colors were shifted, and the carefully adjustedmeasurements needed to be re-calibrated. The harsh lighting also greatly confused the edge detector(horizontal gradient map) used when detecting face features. For this reason, the quality of lighting isan important metric for assessing image quality. Even if the face extraction software could cope withpoor lighting, it is not known how biometric face recognition software (or other post processing) mightcope with such images.

The first step in assessing the quality of lighting is converting the color input frames to grayscale intensity images. Once a gray scale image has been acquired, a histogram of the image's pixelintensities is computed. The general assumption is that the quality of lighting is directly proportional tothe width of this histogram. This is not always a valid assumption; it fails when a subject is not evenlyilluminated.

In order to help address the problems caused by uneven illumination, one can rely on theassumption that faces are symmetric across a central-vertical axis. This central-axis is determined whenthe face detector locates the face features. If the lighting is soft and even, then the distribution of grayscale values on one side of an individual's face should be similar to the distribution of values on theother side of the face. This comparison is done by computing the histogram of the left and right halvesof the face, normalizing each of these histograms, and then computing their intersection. The finallighting score is computed by multiplying the weight of this intersection with the width of the originalhistogram.

Measuring the Width of a Histogram

In the above discussion, histogram width was not well defined. For the purpose of thisapplication, a histogram's width is defined as the smallest number of consecutive histogram bins thataccount for 95% of the pixels in the input image. To compute this value, a greedy algorithm is used.This algorithm locates the mean of the histogram which is the first bin to be added to a region calledthe histogram body. Histogram bins with indices higher than the largest index in the body are said to bein the head of the histogram. Similarly, bins with lower indices are said to be in the tail.

15

The greedy algorithm iteratively grows the body of the histogram by claiming the lowest indexfrom the head, or the largest index from the tail. If the head of the histogram accounts for more pixelsthan the tail, the body's expansion is in the direction of the head. Otherwise, the body expands in thedirection of the tail. This expansion continues until the body accounts for 95% of all pixels.

Image Sharpness

In addition to the quality of lighting, image sharpness is another good indicator of imagequality. In this sense, the adjective “sharp” is used as the antonym of blurry. Images can be blurred forseveral reasons, including motion blur or a camera that is incorrectly focused. In all cases, blurredimages are certainly less desirable than sharp images. The challenge is determining a viable method formeasuring image sharpness.

Currently, the face extraction software attempts to measure the amount of high-frequencycontent contained in an image in order to judge its sharpness. In images, high frequency content can bedefined as content that encodes edges, lines, and areas where the pixel intensities change significantlyover short distances. With faces, the high-frequency content tends to concentrate around the eyes, lips,and other face features.

In order to find high-frequency content, the software uses the Laplacian operator as a highpassfilter. Pixels that survive the highpass filter (have values greater than some pre-determined threshold)are counted, and the result is divided by the total image area. Thus, the current measure of sharpness issimply the percentage of pixels that are considered to encode edges, lines, and other high-frequencycontent.

In general images, this approach may not always be valid; one perfectly focused image maycontain fewer edges than another complex, but blurry, image. Thankfully, the face export softwareshould only ever compare face images to similar images acquired in previous frames. Thus, any changein the high-frequency content can generally be attributed to such things as motion blur.

Image Dimensions and Area

16

The final assumption about face image quality is that larger images are better than smallerimages. Now, it should be noted that the OpenCV face detector returns regions that are square. If theface regions were not square, then it would likely be the case that certain aspect ratios would be betterthan others.

Unfortunately, one cannot always assume that large images are best. In fact, large images oftenindicate a false positive returned from the OpenCV face detector. Often, only a certain range of sizeswill be reasonable for a given scene. For example, a security camera responsible for observing a largelobby will expect smaller face images than a personal web camera sitting on top of a user's computermonitor. This suggests that the range of acceptable image sizes should be configurable by the end-userof the system. Unfortunately, at this time, these parameters can only be changed by modifyingconstants defined in the face detector source code.

Another option worth mentioning is that it might be possible for the software to learn on its ownabout the range of acceptable image sizes. For example, an average face size can be determined byconsidering the dimensions of all face regions that have scored well in the other face quality metrics.Once this is accomplished, candidate face regions that are significantly different than this model can bediscarded. At this time, this option has not been explored.


The main problem with the face detector is that an image's score is not always a reasonableindicator of quality. This is not the fault of the individual quality metrics, but is the result of thefunction which combines these individual values into the final score. As mentioned earlier, thisfunction is nothing more than a simple weighted sum of the aforementioned quality metrics. Theweights associated to each metric were chosen almost arbitrarily. These weights were then hand-tunedover a series of tests until the results seemed adequate. There is almost certainly a better approach thatneeds to be taken.

Pedestrian Tracker

Header File: ./PedestrianTracker/PedestrianTracker.hC++ File: ./PedestrianTracker/PedestrianTracker.cppNamespace: PedestrianTrackerC++ Classes:Pedestrian, Tracker

The pedestrian tracker is an important component of the system, but could be the subject of anentire project. For this reason, the tracker was kept as simple as possible, while still producingacceptable results. The current implementation processes each frame of the video sequence. Thisprocessing operates in four distinct phases:

1. The first phase uses the background segmentation component to identify the foregroundpixels. It then locates all of the connected foreground components, and their boundingrectangles.

17

2. At this point the second phase attempts to associate each of the connected components to apedestrian detected in the previous video frame. A pedestrian is nothing more than arectangular region of interest, and the association is achieved by determining to whichpedestrian each component intersects with the greatest area. If no association is possible,then the component is considered a new pedestrian.

3. The third phase of the tracker groups the connected components based upon the pedestrianto which each is associated. A bounding rectangle is computed for each of these groups.These large bounding rectangles are then assigned to the appropriate pedestrian (replacingtheir bounding rectangle from the previous frame).

4. The fourth, and final, phase determines the foreground-pixel-density of each of the resultantpedestrians. This density is simply the percentage of foreground pixels that occupy thepedestrian's current bounding rectangle. If this density is low, then it is assumed that theresults are incorrect. In this case, the tracker re-divides the pedestrian into its individualcomponents.

The above algorithm is based upon the assumption that pedestrians from the current frameshould be associated to nearby pedestrians from the previous frame. It also assumes that pedestriansmay be composed of several distinct foreground components.


As mentioned above, the current generation of the pedestrian tracker is very simple. It can, andprobably should, be replaced at a later date. At present, the tracker is slightly over-zealous aboutmerging objects. It also has difficulty tracking fast-moving objects; but, pedestrians are not usuallymoving very fast. Finally, the tracker fails to recognize when objects merge – although this latter issuecould probably be easily resolved.

18

Appendix A: Building From Source

Assumptions:The following discussion assumes that OpenCV beta 5 is installed in the following default location onWindows XP Service Pack 2:

“C:\Program Files\OpenCV”

This implies that the Haar classifier data sets are located as follows:

“C:\Program Files\OpenCV\data\haarcascades”

Finally, it assumes that OpenCV has been properly installed, and that the system path has beenmodified to include:

“C:\Program Files\OpenCV\bin”

It also assumes that developers are using Microsoft VisualStudio 2003.

If the Assumptions Fail:If any of the above assumptions fail then, let <OpenCV_Install_Path> be the location where OpenCVwas actually installed. The following modifications are then necessary:

Changes to “./FaceDetector/FaceDetect.h”

The following constants need to be modified to point to the appropriate Haar datasets:

#define DEFAULT_FRONTAL_CLASSIFIER_PATH \“<OpenCV_Install_Path>\\data\\haarcascades\\haarcascade_frontalface_default.xml”

#define DEFAULT_PROFILE _CLASSIFIER_PATH \“<OpenCV_Install_Path>\\data\\haarcascades\\haarcascade_profileface.xml”

Changes to VisualStudioProject

“Project -> Properties -> C/C++ -> General -> Additional Include Directories” must be set to:

"<OpenCV_Install_Path>\otherlibs\highgui";"<OpenCV_Install_Path>\filters\ProxyTrans";"<OpenCV_Install_Path>\cxcore\include";"<OpenCV_Install_Path>\cvaux\include";"<OpenCV_Install_Path>\cv\include"

Also,

19

“Project -> Properties -> Linker -> General -> Additional Library Directories” must be set to:

"<OpenCV_Install_Path>\lib";

Finally, just to be thorough, make sure that “Project -> Properties -> Linker -> Input -> AdditionalDependencies” is set to:

“strmiids.lib quartz.lib cv.lib cxcore.lib highgui.lib”

Running the Binaries on Systems without OpenCV

If one wishes to run the face extractor software on a system where OpenCV is not installed, then thefollowing files must be included in the same directory as the binary executable FACE.exe:

cv097.dllcv097d.dllcvaux097.dllcxcore097.dllcxcore097d.dllhaarcascade_frontalface_default.xmlhaarcascade_profileface.xmlhighgui096d.dllhighgui097.dllproxytrans.ax

Additionally, the constants DEFAULT_FRONTAL_CLASSIFIER_PATH andDEFAULT_FONTAL_PROFILE_PATH defined in ./FaceDetector/FaceDetect.h must be modified toload the classifiers from the local ./ directory.

Finally, the proxy transform filter must be registered. This can be achieved by executing the followingcommand at the command shell:

“regsvr32 proxytrans.ax”

20

References

Fleck, M. & Forsyth, A (nd.). Naked People Skin Filter. Berkeley-Iowa Naked People Finder.Retrieved April 17, 2006 from http://www.cs.hmc.edu/~fleck/naked-skin.html

Laganière, R. (2003). A step-by-step guide to the use of the Intel OpenCV library and the MicrosoftDirectShow technology. Retreived April 17, 2006, fromhttp://www.site.uottawa.ca/~laganier/tutorial/opencv+directshow/

Microsoft Corporation (2006). Microsoft DirectShow 9.0. MSDN Library. Retrieved April 17, 2006from http://msdn.microsoft.com/library/default.asp?url=/library/en-us/directshow/htm/directshow.asp

OpenCV Comminity (2006). Face Detection using OpenCV. OpenCV Library Wiki. Retrieved April17, 2006 from http://opencvlibrary.sourceforge.net/FaceDetection

OpenCV Comminity (2006). What is OpenCV?. OpenCV Library Wiki. Retrived Retrieved April 17,2006 from http://opencvlibrary.sourceforge.net/

Peng, K, et al. (2005). A Robust Algorithm for Eye Detection on Gray Intensity Face withoutSpectacles. Journal of Computer Science and Technology. Retrieved April 17, 2006 fromhttp://journal.info.unlp.edu.ar/Journal/journal15/papers/JCST-Oct05-3.pdf

Sherrah, J. & Gong, S (2001). Skin Color Analysis. CvOnline. Retrieved April 17, 2006 fromhttp://homepages.inf.ed.ac.uk/cgi/rbf/CVONLINE/entries.pl?TAG288

Wikipedia contributors (2006). DirectShow. Wikipedia, The Free Encyclopedia. Retrieved April 17,2006 from http://en.wikipedia.org/w/index.php?title=DirectShow&oldid=48688926.

Face Extraction From Live Video - Mad Computer Scientist

Documents

Transcript of Face Extraction From Live Video - Mad Computer Scientist