Real-time Monocular Vision-based Tracking For Interactive ...

University of Central Florida University of Central Florida

STARS STARS

Electronic Theses and Dissertations, 2004-2019

Real-time Monocular Vision-based Tracking For Interactive Real-time Monocular Vision-based Tracking For Interactive

Augmented Reality Augmented Reality

Lisa Spencer University of Central Florida

Part of the Computer Sciences Commons, and the Engineering Commons

Find similar works at: https://stars.library.ucf.edu/etd

University of Central Florida Libraries http://library.ucf.edu

This Doctoral Dissertation (Open Access) is brought to you for free and open access by STARS. It has been accepted

for inclusion in Electronic Theses and Dissertations, 2004-2019 by an authorized administrator of STARS. For more

information, please contact STARS@ucf.edu.

STARS Citation STARS Citation Spencer, Lisa, "Real-time Monocular Vision-based Tracking For Interactive Augmented Reality" (2006). Electronic Theses and Dissertations, 2004-2019. 975. https://stars.library.ucf.edu/etd/975

REAL-TIME MONOCULAR VISION-BASED TRACKING FOR INTERACTIVE AUGMENTED REALITY

LISA G. SPENCER

B.S. University of Arizona, 1984 M.S. University of Central Florida, 2002

A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy

in the School of Electrical Engineering and Computer Science in the College of Engineering and Computer Science

at the University of Central Florida Orlando, Florida

Spring Term 2006

Major Professor: Ratan K. Guha

ABSTRACT

The need for real-time video analysis is rapidly increasing in today’s world. The

decreasing cost of powerful processors and the proliferation of affordable cameras, combined

with needs for security, methods for searching the growing collection of video data, and an

appetite for high-tech entertainment, have produced an environment where video processing is

utilized for a wide variety of applications. Tracking is an element in many of these applications,

for purposes like detecting anomalous behavior, classifying video clips, and measuring athletic

performance. In this dissertation we focus on augmented reality, but the methods and

conclusions are applicable to a wide variety of other areas. In particular, our work deals with

achieving real-time performance while tracking with augmented reality systems using a

minimum set of commercial hardware. We have built prototypes that use both existing

technologies and new algorithms we have developed. While performance improvements would

be possible with additional hardware, such as multiple cameras or parallel processors, we have

concentrated on getting the most performance with the least equipment.

Tracking is a broad research area, but an essential component of an augmented reality

system. Tracking of some sort is needed to determine the location of scene augmentation. First,

we investigated the effects of illumination on the pixel values recorded by a color video camera.

We used the results to track a simple solid-colored object in our first augmented reality

application. Our second augmented reality application tracks complex non-rigid objects, namely

human faces.

In the color experiment, we studied the effects of illumination on the color values

recorded by a real camera. Human perception is important for many applications, but our focus

is on the RGB values available to tracking algorithms. Since the lighting in most environments

where video monitoring is done is close to white, (e.g., fluorescent lights in an office,

incandescent lights in a home, or direct and indirect sunlight outside,) we looked at the response

to “white” light sources as the intensity varied. The red, green, and blue values recorded by the

camera can be converted to a number of other color spaces which have been shown to be

invariant to various lighting conditions, including view angle, light angle, light intensity, or light

color, using models of the physical properties of reflection. Our experiments show how well

these derived quantities actually remained constant with real materials, real lights, and real

cameras, while still retaining the ability to discriminate between different colors. This color

experiment enabled us to find color spaces that were more invariant to changes in illumination

intensity than the ones traditionally used.

The first augmented reality application tracks a solid colored rectangle and replaces the

rectangle with an image, so it appears that the subject is holding a picture instead. Tracking this

simple shape is both easy and hard; easy because of the single color and the shape that can be

represented by four points or four lines, and hard because there are fewer features available and

the color is affected by illumination changes. Many algorithms for tracking fixed shapes do not

run in real time or require rich feature sets. We have created a tracking method for simple solid

colored objects that uses color and edge information and is fast enough for real-time operation.

We also demonstrate a fast deinterlacing method to avoid “tearing” of fast moving edges when

recorded by an interlaced camera, and optimization techniques that usually achieved a speedup

of about 10 from an implementation that already used optimized image processing library

routines.

Human faces are complex objects that differ between individuals and undergo non-rigid

transformations. Our second augmented reality application detects faces, determines their initial

pose, and then tracks changes in real time. The results are displayed as virtual objects overlaid

on the real video image. We used existing algorithms for motion detection and face detection.

We present a novel method for determining the initial face pose in real time using symmetry.

Our face tracking uses existing point tracking methods as well as extensions to Active

Appearance Models (AAMs). We also give a new method for integrating detection and tracking

data and leveraging the temporal coherence in video data to mitigate the false positive

detections. While many face tracking applications assume exactly one face is in the image, our

techniques can handle any number of faces.

The color experiment along with the two augmented reality applications provide

improvements in understanding the effects of illumination intensity changes on recorded colors,

as well as better real-time methods for detection and tracking of solid shapes and human faces

for augmented reality. These techniques can be applied to other real-time video analysis tasks,

such as surveillance and video analysis.

TABLE OF CONTENTS

LIST OF TABLES.............................................................................................................. x

LIST OF FIGURES ........................................................................................................... xi

LIST OF ACRONYMS/ABBREVIATIONS ................................................................ xviii

1 INTRODUCTION ...................................................................................................... 1

1.1 Augmented Reality Overview............................................................................. 1

1.2 Virtual Looking Glass......................................................................................... 2

1.3 Contributions ...................................................................................................... 4

2 BACKGROUND ........................................................................................................ 7

2.1 Current AR Applications .................................................................................... 7

2.2 Augmented Reality Challenges ........................................................................ 13

2.3 Augmented Reality Implementation ................................................................. 14

2.3.1 Sensing the Real Environment...................................................................... 14

2.3.2 Adding Virtual Elements .............................................................................. 16

2.3.3 Presenting the Combined Result................................................................... 17

2.3.4 Tools ............................................................................................................. 18

2.4 Augmented Looking Glass System Description ............................................... 19

2.5 Related Work .................................................................................................... 23

2.5.1 Color Constancy ........................................................................................... 23

2.5.2 Virtual Mirror Applications .......................................................................... 33

2.5.3 Rectangle Tracking ....................................................................................... 36

2.5.4 Face Detection in Images.............................................................................. 38

2.5.5 Face Detection in Video ............................................................................... 41

2.5.6 Real-time Face Tracking............................................................................... 43

2.5.7 Face Pose Estimation .................................................................................... 49

3 COLOR ANALYSIS WITH VARIABLE LIGHT INTENSITY............................. 51

3.1 Color Models .................................................................................................... 52

3.1.1 YIQ ............................................................................................................... 53

3.1.2 HSV............................................................................................................... 53

3.1.3 HLS............................................................................................................... 54

3.1.4 CIELAB ........................................................................................................ 55

3.1.5 Chromaticity (Normalized RGB).................................................................. 56

3.1.6 c1c2c3 ............................................................................................................. 57

3.1.7 l1l2l3 ............................................................................................................... 57

3.1.8 Derivative...................................................................................................... 57

3.1.9 Log Hue ........................................................................................................ 59

3.2 Experimental Setup........................................................................................... 59

3.3 Results............................................................................................................... 61

3.3.1 Indoors Manual Change Light Sequence...................................................... 62

3.3.2 Indoor Manual Change Iris Sequence........................................................... 69

3.3.3 Indoor Manual Flashlight Sequence ............................................................. 76

3.3.4 Outdoor Automatic Change Light Sequence ................................................ 82

3.3.5 Outdoor Manual Change Light Sequence..................................................... 89

3.4 Analysis ............................................................................................................ 95

3.4.1 Indoors Manual Change Light Sequence...................................................... 96

3.4.2 Indoor Manual Change Iris Sequence........................................................... 99

3.4.3 Indoor Manual Flashlight Sequence ........................................................... 102

3.4.4 Outdoor Automatic Change Light Sequence .............................................. 105

3.4.5 Outdoor Manual Change Light Sequence................................................... 107

3.5 Conclusions..................................................................................................... 109

4 AUGMENTING A SOLID RECTANGLE ............................................................ 113

4.1 Color Representations..................................................................................... 113

4.2 Proposed Method ............................................................................................ 114

4.2.1 Color Model ................................................................................................ 115

4.2.2 Minimum Bounding Quadrilateral.............................................................. 118

4.2.3 Quadrilateral Refinement............................................................................ 119

4.2.4 Deinterlacing............................................................................................... 122

4.2.5 Displaying the Result.................................................................................. 124

4.3 Optimization Methods .................................................................................... 125

4.4 Discussion....................................................................................................... 135

4.5 Conclusions..................................................................................................... 136

5 AUGMENTING HUMAN FACES AND HEADS................................................ 137

5.1 Integration of Detection and Tracking............................................................ 137

5.1.1 Face Detection and Localization................................................................. 138

5.1.2 Face Tracking ............................................................................................. 140

5.1.3 Integration of Detection and Tracking........................................................ 144

5.1.4 Results......................................................................................................... 148

5.2 Initial Pose Estimation .................................................................................... 149

5.2.1 Pose from Skin Color Detection ................................................................. 150

5.2.2 Proposed Point Symmetry Method ............................................................. 154

5.2.3 Point Symmetry Results.............................................................................. 158

5.2.4 Conclusion .................................................................................................. 161

5.3 Extension of the Active Appearance Model for Face Tracking ..................... 162

5.3.1 Building an Active Appearance Model ...................................................... 162

5.3.2 Tracking with an Active Appearance Model .............................................. 170

5.3.3 AAM Experimental Results........................................................................ 181

5.3.4 Optimization ............................................................................................... 184

5.3.5 Conclusion .................................................................................................. 188

5.4 Augmentation.................................................................................................. 189

5.5 Conclusions..................................................................................................... 190

6 CONCLUSIONS..................................................................................................... 193

6.1 Contributions .................................................................................................. 193

6.2 Future Directions ............................................................................................ 195

REFERENCES ............................................................................................................... 199

LIST OF TABLES

Table 1: Original code to calculate chromaticity. Functions prefixed by "cv" reference the

OpenCV library................................................................................................................... 129

Table 2: Code for table lookup to replace division (three cvDiv lines) in Table 1 .................... 130

Table 3: Chromaticity calculation after integrating scale and sum into loop. ............................ 131

Table 4: Chromaticity calculation after all OpenCV functions have been replaced. ................. 132

Table 5: Final optimized code for chromaticity calculation. ...................................................... 134

Table 6: Results from integrating detection and tracking........................................................... 149

Table 7: Roll angle calculation results........................................................................................ 160

LIST OF FIGURES

Figure 1: Milgram's virtuality continuum....................................................................................... 1

Figure 2: Pepper's Ghost ................................................................................................................. 8

Figure 3: Augmented first down line in football. ........................................................................... 9

Figure 4: Virtual advertising......................................................................................................... 10

Figure 5: Example of an aircraft HUD. ........................................................................................ 11

Figure 6: Image guided surgery. ................................................................................................... 12

Figure 7: Overview of the AR application.................................................................................... 21

Figure 8: Screen capture from "Magic Mirror" project. Green highlights show areas where

motion is detected. ................................................................................................................ 33

Figure 9: Sample display from the virtual mirror of Darrell et al. which distorts detected faces. 34

Figure 10: Screenshots from the face augmentation demo of Lepetit et al. ................................. 36

Figure 11: A sample frame from the experiment. The labels were added later to accommodate

printing in black and white. .................................................................................................. 60

Figure 12: RGB color space for the Indoor Manual Change Light Sequence.............................. 62

Figure 13: RGB color space in 3D plot for the Indoors Manual Change Light Sequence ........... 63

Figure 14: YIQ color space for the Indoors Manual Change Light Sequence ............................. 63

Figure 15: HSV color space for the Indoors Manual Change Light Sequence............................. 64

Figure 16: HLS color space for the Indoors Manual Change Light Sequence ............................. 65

Figure 17: CIELAB color space for the Indoors Manual Change Light Sequence ...................... 65

Figure 18: Chromaticity space for the Indoors Manual Change Light Sequence......................... 66

Figure 19: c1c2c3 color space for the Indoors Manual Change Light Sequence ........................... 67

Figure 20: l1l2l3 color space for the Indoors Manual Change Light Sequence ............................. 67

Figure 21: Derivative color space for the Indoors Manual Change Light Sequence.................... 68

Figure 22: Log Hue for the Indoors Manual Change Light Sequence.......................................... 69

Figure 23: RGB color space for the Indoor Manual Change Iris Sequence ................................. 70

Figure 24: RGB color space in 3D for the Indoors Manual Change Iris Sequence...................... 70

Figure 25: YIQ color space for the Indoor Manual Change Iris Sequence .................................. 71

Figure 26: HSV color space for the Indoor Manual Change Iris Sequence ................................. 71

Figure 27: HLS color space for the Indoor Manual Change Iris Sequence.................................. 72

Figure 28: CIELAB color space for the Indoor Manual Change Iris Sequence ........................... 72

Figure 29: Chromaticity color space for the Indoor Manual Change Iris Sequence .................... 73

Figure 30: 321 ccc color space for the Indoor Manual Change Iris Sequence................................. 73

Figure 31: 321 lll color space for the Indoor Manual Change Iris Sequence ................................... 74

Figure 32: Derivative color space for the Indoors Manual Change Iris Sequence....................... 75

Figure 33: Log Hue for the Indoor Manual Change Iris Sequence .............................................. 75

Figure 34: RGB color space for the Indoor Manual Flashlight Sequence.................................... 76

Figure 35: RGB color space in 3D for the Indoor Manual Flashlight Sequence.......................... 77

Figure 36: YIQ color space for the Indoor Manual Flashlight Sequence..................................... 77

Figure 37: HSV color space for the Indoor Manual Flashlight Sequence .................................... 78

Figure 38: HLS color space for the Indoor Manual Flashlight Sequence .................................... 78

Figure 39: CIELAB color space for the Indoor Manual Flashlight Sequence.............................. 79

Figure 40: Chromaticity color space for the Indoor Manual Flashlight Sequence....................... 80

Figure 41: 321 ccc color space for the Indoor Manual Flashlight Sequence................................... 80

Figure 42: 321 lll color space for the Indoor Manual Flashlight Sequence..................................... 81

Figure 43: Derivative color space for the Indoor Manual Flashlight Sequence ........................... 81

Figure 44: Log Hue for the Indoor Manual Flashlight Sequence ................................................. 82

Figure 45: RGB color space for the Outdoor Automatic Change Light Sequence....................... 83

Figure 46: RGB color space in 3D for the Outdoor Automatic Change Light Sequence............. 83

Figure 47: YIQ color space for the Outdoor Automatic Change Light Sequence........................ 84

Figure 48: HSV color space for the Outdoor Automatic Change Light Sequence....................... 84

Figure 49: HLS color space for the Outdoor Automatic Change Light Sequence ....................... 85

Figure 50: CIELAB color space for the Outdoor Automatic Change Light Sequence ................ 85

Figure 51: Chromaticity color space for the Outdoor Automatic Change Light Sequence.......... 86

Figure 52: The 321 ccc color space for the Outdoor Automatic Change Light Sequence.............. 86

Figure 53: The l1l2l3 color space for the Outdoor Automatic Change Light Sequence ................ 87

Figure 54: Derivative color space for the Outdoor Automatic Change Light Sequence.............. 88

Figure 55: Log Hue for the Outdoor Automatic Change Light Sequence.................................... 88

Figure 56: RGB color space for the Outdoor Manual Change Light Sequence ........................... 89

Figure 57: RGB color space in 3D for the Outdoor Manual Change Light Sequence ................. 90

Figure 58: YIQ color space for the Outdoor Manual Change Light Sequence ............................ 90

Figure 59: HSV color space for the Outdoor Manual Change Light Sequence ........................... 91

Figure 60: HLS color space for the Outdoor Manual Change Light Sequence............................ 91

Figure 61: CIELAB color space for the Outdoor Manual Change Light Sequence ..................... 92

Figure 62: Chromaticity color space for the Outdoor Manual Change Light Sequence .............. 93

Figure 63: The 321 ccc color space for the Outdoor Manual Change Light Sequence................... 93

Figure 64: The l1l2l3 color space for the Outdoor Manual Change Light Sequence..................... 94

Figure 65: Derivative color space for the Outdoor Manual Change Light Sequence .................. 94

Figure 66: Log Hue for the Outdoor Manual Change Light Sequence ........................................ 95

Figure 67: Standard Deviation for the Indoor Manual Change Light Sequence .......................... 97

Figure 68: Discriminative power for the Indoor Manual Change Light Sequence ...................... 98

Figure 69: Standard Deviation for all frames of the Indoor Manual Change Iris Sequence ...... 100

Figure 70: Standard deviation for frames with no saturation for the Indoor Manual Change Iris

Sequence ............................................................................................................................. 100

Figure 71: Indoor Manual Change Iris Sequence Discriminative Power for all frames............. 101

Figure 72: Indoor Manual Change Iris Sequence Discriminative Power excluding frames with

saturation............................................................................................................................. 102

Figure 73: Standard Deviation for all frames of the Indoor Manual Flashlight Sequence......... 103

Figure 74: Standard Deviation for frames with no saturation for the Indoor Manual Flashlight

Sequence ............................................................................................................................. 103

Figure 75: Discriminative Power for the Indoor Manual Flashlight Sequence with all frames . 104

Figure 76: Discriminative power for the Indoor Manual Flashlight Sequence with no saturated

frames.................................................................................................................................. 105

Figure 77: Standard deviations for all frames of the Outdoor Automatic Change Light Sequence

............................................................................................................................................. 106

Figure 78: Discriminative power for the Outdoor Automatic Change Light Sequence ............. 107

Figure 79: Standard Deviation for all frames of the Outdoor Manual Change Light Sequence 108

Figure 80: Discriminative power for the Outdoor Manual Change Light Sequence.................. 109

Figure 81: Original and augmented video frame. ....................................................................... 113

Figure 82: Processing stages: original frame, converted to chromaticity space, detected object

color pixels, bounding quadrilateral, saturation, horizontal gradient, vertical gradient, and

difference between subsequent frames. .............................................................................. 117

Figure 83: Area calculation for finding the bounding quadrilateral ........................................... 118

Figure 84: Edge refinement process ........................................................................................... 120

Figure 85: Refining the quadrilateral. The rough bounding quadrilateral is computed from the

color detection result, which misses the bottom edge. The refinement process results in a

more accurate outline.......................................................................................................... 121

Figure 86: Tearing caused by interlacing ................................................................................... 123

Figure 87: Fast deinterlacing algorithm results .......................................................................... 124

Figure 88: Sample frames from the video sequence showing occlusion, a corner off the screen,

and challenging lighting conditions. The tracked boundary is drawn in red..................... 136

Figure 89: Integrating detection and tracking data ..................................................................... 146

Figure 90: Face detection results. ............................................................................................... 151

Figure 91: Skin detection results using local histogram. Pixels detected as skin for each face are

shown in magenta. The orientation calculated for each face is shown with the horizontal

axis in red and the vertical axis in green............................................................................. 152

Figure 92: Global skin detection results. .................................................................................... 153

Figure 93: Face orientation using global skin detection. ............................................................ 154

Figure 94: Symmetry distance. (a) is the original point set, (b) shows the original filled points

and the reflected hollow points, (c) shows the closest filled point to each hollow point. The

average of these line lengths is the symmetry distance. ..................................................... 156

Figure 95: Measuring symmetry. (a) shows the corners in black, (b) and (c) show two different

rotations, with the reflected points in white, and lines between matched points................ 158

Figure 96: Test images. The original image is shown with a black triangle that connects the

ground truth coordinates for the eyes and mouth center..................................................... 159

Figure 97: Algorithm results. The box shows the region detected as a face, small green circles

show the high contrast points found in that region, and an axis shows the orientation found,

with red pointing in the positive x direction and green in the positive y. ........................... 160

Figure 98: Face orientation results using our point symmetry method. The red line indicates the

horizontal axis, the green line indicates the vertical axis, and the high contrast points are

shown in blue. ..................................................................................................................... 161

Figure 99: Offline generic face model fitting process. (a) shows the points identified so far, (b)

shows the prompt for the last point, and (c) shows the mesh that has been fit overlaid on the

image................................................................................................................................... 165

Figure 100: The first three shape modes. The circles show the mean shape, and the lines show

the magnitude and direction of the positive (green) and negative (red) displacements. .... 168

Figure 101: The mean appearance and two appearance modes.................................................. 170

Figure 102: X and Y gradients of the mean appearance............................................................. 174

Figure 103: Warp Jacobians for the first three 2D shape modes. The top row is the x component,

and the bottom row is the y component. ............................................................................. 175

Figure 104: The steepest descent images obtained by multiplying the gradients in Figure 102 by

the warp Jacobians in Figure 103. ...................................................................................... 175

Figure 105: Iterations 0, 5, and 12 of fitting the AAM. Column (a) is the 2D shape on the

original image with the face detection result shown by a rectangle. Column (b) is the 3D

shape on the original image. The warped input image I(N(W(x; p); q)) with the mesh

overlaid is shown in Column (c), and the error image ( )( )( ) ( )xqpxWN 0;; AI − is shown in

column (d). .......................................................................................................................... 183

Figure 106: Video frame augmented with cap and glasses. ....................................................... 190

LIST OF ACRONYMS/ABBREVIATIONS

2D Two Dimensional

3D Three Dimensional

AAM Active Appearance Model

API Application Program Interface

AR Augmented Reality

CCD Charge-Coupled Device

DOF Degree of Freedom

DP Discriminative Power

HLS Hue Lightness and Saturation

HMD Head Mounted Display

HSV Hue Saturation and Value

Hz. Hertz

ICIA Inverse Compositional Image Alignment

MR Mixed Reality

NTSC National Television System Committee

PC Personal Computer

PCA Principal Component Analysis

RANSAC RANdom SAmple Consensus

RGB Red, Green and Blue

SDK Software Development Kit

SVD Singular Value Decomposition

VR Virtual Reality

1 INTRODUCTION

1.1 Augmented Reality Overview

Augmented Reality (AR) is a situation where real objects are augmented by adding

virtual elements. It is related to the more commonly known Virtual Reality (VR), in which

everything is synthesized. The Webopedia [102] defines virtual reality as “An artificial

environment created with computer hardware and software and presented to the user in such a

way that it appears and feels like a real environment.” Milgram and Kishino [68] defined a

continuum relating augmented reality and virtual reality under the umbrella of Mixed Reality

(MR), which is shown in Figure 1. This structure categorizes applications by their degree of

virtual and real elements, with no synthesized elements at one extreme and no real elements at

the other. Our research focuses on augmented reality, which is near the “real” end of the

spectrum, introducing a few virtual items attached to objects tracked in a real environment. This

is a rapidly expanding sector that uses both computer vision and computer graphics to achieve an

interactive result that seamlessly mixes real and virtual elements.

Figure 1: Milgram's virtuality continuum.

The biggest challenges for augmented reality are speed and accuracy. Speed is necessary

for an interactive environment, requiring processing frame rates of at least 10 Hz., and preferably

closer to video input rates of 25 to 30 Hz. Many video tracking algorithms either run in “batch”

mode, which uses information about all the frames when processing each frame, or they require

seconds or minutes to process each frame. While good results are achieved, these algorithms are

unsuitable for augmented reality. Fast algorithms that do not require future frames are needed

for interactive applications such as augmented reality.

Tracking accuracy is necessary to maintain the illusion that virtual objects are attached to

real objects. The human eye is sensitive to relative motion between objects, and errors or noise

in the position of virtual objects will be apparent to the observer. The high standards imposed by

these two challenges make augmented reality applications difficult to do well.

To meet these challenges, many augmented reality systems use specialized environments

or additional hardware to create realistic results. Carefully lit green screens are often used to

identify where to substitute virtual elements. Tracking devices may be used to pinpoint the

location and orientation of real objects. Markers may be applied to real objects to make them

easier to track. Knowledge about the environment may be collected beforehand, in the form of

3D models or locations of static objects. Multiple cameras may be used to get depth information

or to deal with occlusions in a single camera view.

1.2 Virtual Looking Glass

Our work is at the minimalist end of the equipment spectrum, with a minimum of

equipment and no specialized environment. We use a standard PC, a consumer video camera,

and a display to create a virtual looking glass. Real objects visible to the camera are mirrored on

the display, and virtual objects are added that augment some of the real objects.

This type of application is worthwhile on its own, both for the technical challenges and

the entertainment value of the result. Beyond entertainment, the augmentation could be used to

convey information, e.g., showing a hat appropriate for the day’s weather. In addition, the

modules needed are useful in other situations. For example, real time face detection and tracking

are needed in automated video surveillance systems.

In order to better understand the relationship between the tracked object color and the

color recorded by the camera under different lighting conditions, we performed an experiment to

see if the theoretical changes predicted for the color components in various color spaces matched

the results measured with real objects, lights, and cameras. We evaluated the constancy of a

number of color spaces as the intensity of a nearly white light changed, as well as their ability to

discriminate between different colors. The results showed that the traditional color spaces used

for illumination invariant analysis did not work as well as some others, but all fall short of the

ideal..

We created two applications as part of our research. The first one augments a solid

colored rectangle. The object is detected and tracked in real time, and replaced with a prestored

image. While the simplicity of the object is easier to handle in some aspects, the limited number

of features available and the change in the recorded color due to illumination variations creates

its own challenges. The findings of the color experiment were used to make the rectangle

tracking more robust.

The second application tracks and augments human faces. While face detection and

tracking have both been widely researched, robust and fast algorithms are still rare, especially

with our equipment restrictions, allowing new faces to appear while another is being tracked, and

with various sizes and orientations. Since we are accustomed to viewing our own faces in a

mirror, the virtual mirror setup is a natural platform for creating a real-time interactive

augmented face system.

1.3 Contributions

We have made several contributions while creating these virtual looking glass

applications, including

• Color space selection experiments. Theory and practice for color constancy in the

presence of varying illumination in different color spaces do not always agree. We have

performed experiments measuring the color values recorded by consumer cameras in

varying illumination. From this, a number of color spaces were compared to see how

well they remained constant as the light changed, yet still discriminated between

different colors. We found that some less well-known color spaces perform better than

the traditional ones.

• A real-time algorithm for tracking simple objects in monocular video. Many common

algorithms that track object silhouettes only work for objects with complex feature sets

and run too slowly for interactive use. Our tracking method uses color, edge, and

motion information to achieve real-time rates.

• Optimization methods for real time video processing. Careful optimization can make

the difference between an implementation being “fast enough” or not. We have

achieved speedups of two to ten on various functions involved in tracking a rectangle

and speedups close to 50 for motion detection and face tracking, even starting with an

optimized image processing library.

• Fast video deinterlacing. Consumer level video cameras are interlaced, with the even

and odd lines captured at different times. Moving objects appear to tear when displayed

on a noninterlaced computer display. Our deinterlacing algorithm is fast, maintains the

detail in stationary parts of the frame, and removes the tearing in the moving parts of the

frame.

• Extended continuous symmetry measure to handle not only shapes and graphs but

general 2D point sets as well.

• Real-time initial face pose estimation in video. After a face is detected, its orientation

must be determined. This is often done manually, which is not suitable for a general

augmented reality application. Other methods perform costly correlations to find facial

features. We present a novel method for recovering face orientation using generic high

contrast points. Using the coordinates of these points results in a faster algorithm than

using appearance comparisons, and sufficient accuracy, outperforming methods based

on skin detection.

• Enhanced Active Appearance Models (AAMs) for face tracking. Existing methods for

AAMs require more than a hundred manually marked points on each of several hundred

images to create the 2D and 3D models used for tracking, and the parameters extracted

during tracking do not have meaningful labels. By incorporating a parameterized 3D

nonrigid head model, we reduced the manual labeling effort to 30 points on tens of

images, and derive meaningful parameters from the result, such as “jaw drop” or

“eyebrows raised”, while still maintaining the robustness afforded by AAMs.

• Robust face tracking for augmented reality. We integrate motion detection, face

detection, and tracking to create a system that can track multiple faces, filter out

spurious detections in dynamic regions, eliminate detections in static regions, and verify

the track of faces outside the native range of the face detector.

2 BACKGROUND

2.1 Current AR Applications

Many examples of augmented reality exist today. In some cases, they are so well

accepted that many people don’t realize that they are artificial. The ideal augmented reality

system should be one in which the virtual elements are indistinguishable from the real elements,

and cannot be identified without outside knowledge of the situation. A sample of these

applications will be explored here.

One of the earliest examples of augmented reality is known as “Pepper’s Ghost”, named

after John Henry Pepper, a chemistry professor at London Polytechnic Institute. In the 1860’s,

audiences saw the startling effect of a transparent, three-dimensional ghost interacting freely

with a live actor on a seemingly ordinary stage. The effect is achieved by a piece of glass

mounted at an angle. The audience sees both the real scene through the glass and a transparent

image reflected by the glass. The arrangement is illustrated in Figure 2. The effect was used for

ghosts in Shakespeare’s plays, and is still used today in places like Disney’s Haunted Mansion

Figure 2: Pepper's Ghost

Local weather broadcasts have been using augmented reality for years, with the help of a

blue (or green) screen. The person appears to be standing in front of graphics showing

temperatures or radar, when he is actually in front of a solid color screen. A chroma-keying

system replaces the pixels that are in a certain color range with an alternate image. The person

must look at a monitor to see what he is pointing at, and adjust accordingly. While

straightforward, this system has stringent lighting requirements for the screen, since any shadows

or changes in lighting will change the color viewed by the camera, which will result in holes in

the blended image.

During football broadcasts, augmented reality is used to draw the first down line on a

football field for each play. With a system called “1st & Ten” from SportVision [86], the pan,

tilt, zoom and focus for each camera are tracked and registered with a model of the field to

determine the mapping of each pixel to a point on the field. The desired location of the line is

input manually. With this information, the line can be positioned at a fixed location on the field.

A color model is calibrated prior to each game to learn the grass, dirt, and uniform colors, so that

the line covers up the ground, and is in turn covered by the players, as shown in Figure 3.

Figure 3: Augmented first down line in football.

Another augmented reality application used in sports broadcasting is the insertion of

virtual advertisements that appear to be attached to the stadium wall. One such system is LVIS

(pronounced “Elvis”) from Princeton Video Image (PVI) [73]. It uses various methods,

including instrumented cameras, a vision-based tracking system and pattern recognition, to

position the advertisement, then modifies the video as it is broadcast. An example is shown in

Figure 4.

Figure 4: Virtual advertising.

The Head-Up Display (HUD) in many military aircraft overlays graphics showing the

horizon, the instantaneous flight path, and the locations of tracked targets over the real world, in

addition to information such as altitude and airspeed. Avionics on board the aircraft calculate

the information to display. The symbols are projected on a see-through combiner glass, giving

the appearance that they are superimposed on the world outside. Figure 5 shows an example.

The Apache helicopter goes one step further with its Integrated Helmet and Display Sight

System (IHADSS). Video showing infrared imagery is reflected off a small lens in front of the

pilot’s right eye. The infrared sensor is mounted on a turret which moves in synchronization

with the pilot’s head, so the pilot merely has to look in a particular direction to see the infrared

view in that direction. This allows the pilot to fly with one eye seeing the real world and the

other eye seeing a correlated sensor view of the world.

Figure 5: Example of an aircraft HUD.

In the medical world, images such as X-Ray, CT or MRI scans must be registered with

the patient’s anatomy before and during surgery. One such application is MIT’s project on

Image Guided Surgery [1]. Three-dimensional reconstructions of the internal anatomy are

obtained from scans and projected on live video views of the patient. This eliminates positioning

errors and allows the surgeon to concentrate on the task without constantly reorienting the scan

data mentally. One example is shown in Figure 6

Figure 6: Image guided surgery.

Virtual sets are a step up from basic green screen chroma-keying. Instead of

broadcasting from an expensive studio, the anchor is filmed in front of a backdrop, which may

contain calibrated markers. By tracking the location of these markers, the camera locations and

parameters can be determined. With this knowledge, the background can be replaced with a

three dimensional virtual studio, with accurate perspective changes as the camera moves.

Less well-known applications involve overlaying additional information on a scene, such

as labels for buildings, names of engine parts, or pipes and ducts hidden behind walls. These

provide a more intuitive presentation of information than the traditional means of maps and

blueprints.

This sampling of augmented reality applications shows the diversity of situations in

which it can be applied. A complete list would require a book on its own, and is bounded only

by the imagination. The possibilities range from mere entertainment, to a more useful display of

existing information, to the ability to accomplish tasks that weren’t possible without the added

understanding provided by the augmented reality.

2.2 Augmented Reality Challenges

The diverse sampling of augmented reality applications presented in the previous section

may appear to have little in common. However, all augmented reality applications have two

common requirements; real-time execution and accurate registration between the real and virtual

worlds.

The real-time requirement is derived from the interactivity inherent to augmented reality.

The virtual objects must react to the users immediately. Movies use many of the same

techniques as augmented reality for creating special effects, like filming actors safely in front of

a green screen and then adding an explosion in the background, but they have the luxury of

unlimited time to mix the two. Movie effects can also be tweaked by hand or tried several

different ways to see which achieves the best effect for each individual situation. In augmented

reality, the reaction must be immediate, or the illusion is lost. NTSC cameras update at 30

frames per second, so ideally, the augmented reality display should run at 30 Hz. as well.

Anything less than about 10 Hz. will have unacceptable delay.

The registration between the virtual and real objects in the scene is critical to maintain

the illusion that both sets of objects exist in the same world. The virtual first down line could not

be mistaken for a real line if it didn’t stay firmly stuck to the ground. Virtual objects must be

stable, without jitter or drift, and they must be accurate – a stable object won’t look right if it is

in the wrong place. Registration error and uncertainty is also a factor in augmented reality

interfaces, where the user must be able to unambiguously select a single virtual object [19].

2.3 Augmented Reality Implementation

Most of the work in augmented reality is focused on the visual presentation. Other

aspects needed for a totally immersive experience include sound and touch (like tactile feedback,

wind, and heat). Since our work is in the visual arena, these other elements will not be discussed

further here.

The process for generating the augmented view can be described in three basic steps:

1. Sense the real environment

2. Add virtual elements

3. Present the combined result

2.3.1 Sensing the Real Environment

There are many ways that the system can obtain the required knowledge about the real

world. Prior knowledge can be provided, such as the location and dimensions of objects and

cameras, light sources, 3D models of objects, colors and materials of objects, or motion

characteristics of objects. Using prior knowledge of the environment has the advantage that the

data given is stable and accurate, and processing time is not needed to determine this information

while running. The disadvantage is that this makes the application less flexible and limited to a

single location. At times, it is impossible to proceed without prior knowledge. For example, a

system to navigate around a city would require a city map as input.

Another method for determining the location and orientation of various objects is to use

specialized sensors such as magnetic or acoustic trackers. These sensors may require a

transmitter to be mounted on the object to be tracked, and a receiver elsewhere in the

environment. These sensors generally do a reasonable job of tracking the objects within a finite

range, whether or not the object is visible in the scenario. They also make it easy to get the

position and orientation of objects with a simple interface. Drawbacks may include the need to

add sensors and wires to the tracked objects, possibly hindering their motion and a limited area

of operation. The accuracy of the measurements varies with the device, and may or may not

provide the precision necessary for the application.

Since augmentation of the observer’s view usually involves sampling that view with a

camera, that video can be used to learn about the environment. In fact, since registration with

this view is more important than physical accuracy for making the virtual objects look like they

belong in the real world, use of this view is essential for creating a convincing environment. The

other sources may be used to supplement the visual information, but fine tuning based on the

visual data is needed for a seamless blend. Unfortunately, extracting useful information from the

visual input is not trivial, and tends to involve significant processing resources. Combined with

the real-time requirement, this makes use of vision techniques for augmented reality a challenge.

Computer vision techniques have been developed to sense the real environment by

performing tasks including finding the camera pose in a static environment, tracking rigid and

nonrigid moving objects, determining the depth of objects in the scene, and object recognition,

but many of the existing algorithms run too slowly to be used in a real-time environment.

Moore’s Law has increased processing speed to the point that many algorithms are possible

today that were too complex to run a few years ago, but speed is still lacking in most algorithm

implementations. One of the focuses of our research in augmented reality is finding algorithms

that run fast enough to meet the real time requirement.

2.3.2 Adding Virtual Elements

In order to make the virtual elements added to a real environment appear like they belong

there, care must be taken to ensure that they are properly registered, occluded, and lit. Lack of

any of these elements will result in a mismatch between the real and virtual parts of the scene.

As discussed earlier with augmented reality challenges, virtual object must be registered

with the real scene so that they are stable and properly positioned. If a virtual object is attached

to a real object, its position relative to that object must remain consistent, regardless of object

and camera motion. If it jumps around or drifts, the observer will be distracted. If the camera

moves past a static scene made up of both real and virtual objects, both types of objects must

appear to remain stationary. The accuracy of the registration required depends on the resolution

of the display presented to the observer. The error in the registration should be less than the

smallest perceivable difference on the display device.

Occlusion is necessary when a virtual object appears to be behind a real object. One

example is when a football player passes in front of the virtual first down line. If the line did not

disappear at that point, it would appear to be floating in space instead of sticking to the ground.

In the case of the first down line, the occlusion is detected when the pixel that should be part of

the line does not match the grass or dirt colors. In other cases, occlusion determination must

compare the depth (distance from the camera) of the real and virtual objects so that the closer

object occludes the further object.

The other major factor in making virtual object appear real is lighting. In some cases,

like labels for scene object, constant lighting may be acceptable, but in most cases the virtual and

real lighting should match. A virtual billboard must change when shadows fall on it, or when the

sun goes behind a cloud. More elaborate lighting involves casting shadows from virtual objects,

and using virtual lights to illuminate real objects.

2.3.3 Presenting the Combined Result

There are a wide variety of methods for presenting the augmented reality to the observer.

We will focus on visual methods, although a complete system can include aural, tactile and

olfactory cues as well. The displays vary in complexity, cost, mobility, and degree of

immersion.

Monitors and projectors are the simplest display devices. The immersion can range from

limited, with a single monitor, to complete, with a dome or displays on all sides. Multiple

people can usually view this type of display at once. The screen or monitor is typically

stationary, but can be tracked in orientation, providing a window into the augmented reality

world.

A head mounted display (HMD) is more complex, since the view must change with the

viewer’s head position, but more flexible as well. There are two main techniques for displaying

the virtual elements combined with the real scene. The first is optical see-through HMDs, where

the visor is transparent. Additional graphics can be projected on the visor to supply the virtual

elements. The second method is video see-through HMDs, where the viewer sees only an

opaque screen in front of his eyes. The real world view is supplied by one (for monocular view)

or two (for stereo view) cameras mounted on the helmet. Graphics can be added to this video

stream for the virtual objects. It is easier to register the real and virtual objects with the video

see-through HMD, since the actual observer’s view is available, but this process introduces a

time delay. The optical see-through HMD has no delay in viewing the real world, which makes

it a challenge to add graphics that are properly synchronized. In addition, the observer’s view is

not directly available with this system. Head trackers may be used with either type of HMD to

aid in determining the viewer’s position and orientation. HMDs tend to provide more immersion

than monitors or projection screens, but can only be viewed by one observer at a time.

2.3.4 Tools

There are several tools and APIs that can be useful for augmented reality applications.

AR Toolkit [2] is an open source software library that can be used to calculate the camera

position and orientation relative to physical marks. It requires that fiducial markers be placed on

the tracked object. If these markers are occluded, the lighting conditions are poor, or the

background has low contrast, stable tracking is not possible [40].

Canon Inc. and the Japanese government teamed to create the Mixed Reality System

Laboratory Inc. to conduct mixed reality research [93]. They developed a package called “MR

Platform” which includes a stereo video see-through HMD and an SDK for a Linux PC

environment [98]. A hardware tracking device determines the rough head position and

orientation, and this is refined using vision based methods to register the locations of markers.

Their AR2Hockey (Augmented Reality AiR Hockey) system uses color segmentation to locate

the markers, and infrared LEDs on the mallets are sensed with an infrared camera to get the

mallet positions on the plane of the table. Their MR Living Room uses infrared landmarks for

video registration. This requirement to modify the environment makes this tool not compatible

with our goals.

OpenCV is an open source library from Intel [14]. It seeks to “aid commercial uses of

computer vision in human-computer interface, robotics, monitoring, biometrics and security by

providing a free and open infrastructure where the distributed efforts of the vision community

can be consolidated and performance optimized.” We have made extensive use of this library in

our work. The routines are generally fast. Most of the functions perform low level image

processing, but there are some high level algorithms as well, including face detection code that

will be described later.

DirectX is the subset of Microsoft’s DirectX for handling multimedia for Windows. It is

based on the concept of a “filter graph”. The three basic filter types are source, transform and

render. Source filters have only output pins, and generate video or audio content, typically from

a file or from an external device such as a camera. Transform filters have both input and output

pins, and are used to modify the content, for example by changing the color depth, resolution,

compression, brightness or adding overlays. Render filters have only input pins and typically

render the result to devices like a screen or speakers or to a file. These filters can be connected

under software control to perform most any operation on multimedia streams. Filters exist to

handle most cameras and file formats, so this API allows applications to deal with a wide variety

of devices with a minimum of development effort. We have created an introduction to both

DirectShow and OpenCV [90], which is widely used.

2.4 Augmented Looking Glass System Description

We have developed interactive augmented reality systems that simulate a mirror. At first

glance, users see themselves and their surroundings. A second look reveals that enhancements

had been added, such as a picture on a blank placard or hats on the people. The users can

manipulate the virtual objects and interact with them.

The goal is to use minimal hardware and setup, with no special environment. The

hardware consists of a single fixed camera for input, a standard single processor PC for

processing, and a stationary monitor for display. The system should work in all but the most

extreme lighting. (There is not much that can be done if the camera senses all black or all white

pixels.) No special room, fiduciary markers or blue screen will be used, nor will the room setup

be known in advance. Any training or calibration is simple and fast, and the goal is to require

The application receives the video stream from the camera, processes it, then renders the

original video plus the augmentation in real time. Microsoft’s DirectShow is used to get the

video frames from the camera. The processing is done with the aid of Intel’s OpenCV library.

The rendering is done either with OpenGL, to take advantage of hardware acceleration of

rendering tasks such as texture mapping, or simple bit-mapped graphics within the DirectShow

transform filter. This overview is shown in Figure 7. The processing requires three basic steps:

1. Detect object

2. Fit model (pose estimation)

3. Augment

Figure 7: Overview of the AR application

Detecting the object to be augmented and localizing it in the video frame is the first step.

Depending on what object we are tracking, there are several techniques that can be used for this

step, which may include color detection and background subtraction. If the color of the object is

known, the object can be located by finding large concentrations of pixels with the given color.

Since the RGB values recorded by the camera vary with lighting, care must be taken to handle

various lighting conditions. Background subtraction can be used to detect moving objects.

Although there are many variations, the basic idea in background subtraction is to build a model

for the fixed background, then mark pixels that differ from this background as belonging to

DirectShow

Video Processing

OpenCV

Render

moving objects. More specialized methods can be used for detecting certain objects, such as

faces, in video.

Once an object is located, its pose (position and orientation) must be determined in order

to augment it. The precision required for this step depends on the application. For example, if

the goal is to put a box around the object, the orientation is not needed and the position can be

approximate. On the other hand, tasks like adding a hat or glasses to a person’s head require that

both the position and orientation be precisely known, or else the augmented portions of the scene

will not appear to be attached to the person’s head. The method for determining the pose

depends on the type of object being tracked. For tracking a solid shape like a rectangle, the

locations of the corners (or equivalently, the edges) must be determined. This is sufficient and

necessary for overlaying a perspective corrected image, and with knowledge of the rectangle size

and camera focal length, the Euclidean position and orientation can be determined. For more

complex object such as faces, pose can be determined by tracking features like eye and mouth

corners, or by comparing the input image with a database of faces with known poses. For any

method, the task becomes more challenging when parts of the object being tracked are occluded

by other objects or go outside the camera field of view.

With the pose known, augmentation is primarily a graphics rendering task. An accurate

pose estimate leads to proper registration of virtual items in the scene. Depending on the

application, the rendering may need to take occlusions and lighting into account. When

rendering with OpenGL, occlusions may be handled by initializing the depth buffer with

appropriate values so that occluded portions of the virtual objects are not rendered. Descriptions

of occluding objects may be supplied to the application, learned with training, or calculated from

the input video. Finding occlusions from the input video is generally done by either tracking all

objects in the scene and their relative depth, or finding parts of the object that don’t fit the model.

For example, we could track all the people in the scene and deal with occlusion when their

silhouettes overlap, or mark all the non-red pixels within the borders of a red object as

occlusions. Likewise, lighting can be given beforehand, learned from training, or determined

from the input video.

2.5 Related Work

In this section, we will present an overview of representative work in areas covered by

our work, including color constancy, virtual mirror applications, shape tracking, face detection,

face tracking, and face pose estimation.

2.5.1 Color Constancy

Variable illumination is a factor in most image processing applications using images

recorded by cameras. Two images of the same static scene will not be identical due to noise in

the camera and lighting changes. Any outdoor application must deal with light changes,

especially on cloudy days. Many indoor environments derive some of their light from windows,

resulting in lighting changes similar to outdoor environments. Even without windows (or at

night), the many reflections from the walls cause light change when there is motion outside the

camera view, which can also contribute to the two images of identical scenes being different.

Any application that compares successive images, including motion detection for surveillance,

motion compensation, color-based object detection and video indexing, must deal with changes

in the recorded pixels due to illumination changes.

The field of color has been widely studied from several perspectives, including human

perception, computer graphics, and computer vision. Human perception researchers seek to

understand and model how humans convert a spectrum of light input to an image and assign

color labels, as well as other aspects like the emotional response of humans to different colors.

Of particular relevance to the subject at hand is the topic of color constancy, or the

human ability to determine an object’s color correctly most of the time in widely differing

lighting conditions. A camera may record snow with a bluish tint if it erroneously uses indoor

light settings, but a human never sees blue snow. We can tell that a banana is yellow (and not an

unripe green) with various indoor and outdoor lights, in shadow or sunlight, during midday or

with the reddish rays of sunset. Getting a machine to do the same is not trivial. Efforts in this

area will be described in more detail below.

In the field of computer graphics researchers seek to answer the question, “Given the

color and material properties of an object and the lighting conditions, what is the proper color to

display?” This decision takes into account the physics of light reflection and refraction, the

response of rods and cones in the human eye, and the properties of the display device in order to

produce RGB values that will stimulate the eye to produce the same response as the real scene.

Computer vision approaches the problem from the other direction, trying to answer

questions like, “Given the color recorded by the camera, what is the likely object color?” and “Is

the color of the object imaged by this pixel the same or different than the one in the previous

video frame?” Understanding how our eyes interpret the visible spectrum provides insight into

the problem, but it is not necessary for computers to arrive at the answer the same way.

Our interest is in applications tracking a moving object. Even with a fixed light source

and fixed camera settings, the RGB values of the object recorded by a camera will change as the

object moves, due to changes in the surface normal of the object and shadows. Since color is an

important feature in object detection and tracking, we seek to find a color representation that will

remain constant as the object moves as well as when the illumination changes in intensity (such

as the sun going behind a cloud), or the camera adjusts to changing scene brightness.

For this reason, we are not interested in estimating the illuminant, unless that helps to

find an invariant color representation. We are also not very concerned about changes in

illuminant color, since for most tracking applications the light source is a uniform color that is

close to white (like daylight or incandescent lamps). We are, however, extremely interested in

illumination intensity changes, whether from shadows, changes in the surface normal, or varying

light levels.

Many color constancy algorithms assume that the illumination is uniform (or at least

continuous) across the scene, and the scene content does not vary between images. While this is

true in many of our experiments, we do not take advantage of this because we seek a color space

that remains constant for a moving object within the scene.

In the discussion that follows, we will present representative work from the large volume

of research that has been done in this area.

Although the uniform lighting and static scene assumptions do not apply to our

application, there has been much work that uses these premises. One longstanding model,

attributed to von Kries [101], is that illumination change can be modeled by scaling each

channel. That is,

(R’, G’, B’) = (αR, βG, γB), (1)

where α, β, and γ are constant across the image. This is also known as the coefficient rule.

Since this is equivalent to multiplying the original vector by a diagonal matrix, it is also called

the diagonal model.

The simplest category of color constancy methods is Gray World algorithms. They

compute a statistic for the entire image, like the mean or the max, and find α, β, and γ such that

Equation 1 scales each channel so that the statistic matches a preset value. There are many

variations on what statistic to measure and what preset value to use. For example, if the scene

contains a white object, scaling each channel so that the maximum is 255 (for 8-bit color values)

will force the object to be white in the corrected image, and hopefully corrects the remainder of

the scene appropriately. Buchsbaum [15] scaled each channel so that the mean value for each

channel was 50% of the maximum possible value.

This group of algorithms assumes that the lighting changes equally throughout a fairly

static scene. This may fit well with an outdoor camera monitoring a relatively quiet area, but

clearly would be not work if a large white object entered or exited the scene, as this would be

misinterpreted as a change in illumination. Local changes, such as a moving shadow, would not

be corrected.

To handle localized changes, Gershon et al. [41] used this same idea on smaller regions,

segmenting the image into planar surfaces. A problem with this approach is that improper

segmentation may introduce errors. Edges in the segmentation can be caused by geometric

discontinuities, shadow edges, or specular highlights.

Finlayson et al. [28][29] added sensor sharpening, which maps the original sensors to a

new set by multiplying the original vector by a 3x3 matrix. The performance of coefficient-

based color constancy algorithms is improved by using the transformed sensors. Three methods

for finding the transformation matrix were proposed. Data-based sharpening uses knowledge of

the surface reflectances and illuminants in a particular scene to find the optimal transformation.

Sensor-based sharpening calculates the transformation that will produce the narrowest band

sensors. The third operates on a limited set of illuminants and reflectances to achieve perfect

color constancy.

Barnard et al. [8] evaluated the effect of sharpening on several color constancy

algorithms and found that sometimes it helped and other times it did not. Based on these

observations, they proposed multiple-illuminant-with-positivity sharpening. This method

minimizes the mapping error from a set of reasonable illuminants instead of the original, which

minimized the error mapping between two illuminants, since the second illuminant is unknown

for color constancy problems.

Land (of Polaroid fame) and his colleagues made a major contribution to color constancy

with the Retinex theory [59][60][61]. The name comes from a combination of retina and cortex.

Land was trying to model the human visual system, but didn’t know where the color constancy

operation took place. Land also coined the term “Mondrian”, used extensively in the color

constancy literature, because the abstract collections of colored rectangles he used for his tests

reminded him of a painting by Piet Mondrian. In this simplified world of uniform lighting and

sharp borders between colors, he performed experiments that changed our understanding of how

humans perceive color.

One experiment involved illuminating the Mondrian with three independently controlled

light sources, with low, medium and high frequencies. The lights were individually adjusted so

that the energy reflected from a white patch in the Mondrian for the three wavelengths were

equal (1, 1, 1). Human subjects reported the observed color of the patch to be white. The lights

were then changed so that the energy reflected from a yellow patch was (1, 1, 1) in the same

units as before. The human response was expected to be “white”, since the spectral energy from

the yellow patch was the same as the white patch before, but the observer reported “yellow”.

The process was repeated numerous times, with different colors and light settings, and the

observer consistently reported the correct intrinsic color with only minor deviations. The

experiment was also done with the subject demonstrating the ability to match a patch in the

Mondrian to a standardized set of color samples illuminated with a reference white light.

Further experiments showed that the humans appear to process inputs from the cones that

sense red, green, and blue wavelengths independently. This correlates with the coefficient rule

that scales each channel separately. Land and his colleagues also demonstrated that human

perception of color is relative across the field of view. The borders between adjacent colors

affect our ability to compare them.

These findings were combined to create the retinex algorithm. In retinex methods, small

changes in image intensity are assumed to be caused by illumination variation, and large changes

are assumed to be from geometry changes. The initial algorithm [60][61] traced paths through

the image, computing ratios between regions, assuming that white would be encountered

somewhere in the image. Subsequent work [59] removed the requirement for a white patch by

calculating the average relative reflectance in a small area using paths from surrounding pixels.

Each path to the designated area is traversed, accumulating ratios of neighboring pixels that

differ by more than a threshold..

The method is effective for Mondrians, with no lighting discontinuities, like shadows or

sudden depth changes, but has trouble when it misses an edge. Implementation of retinex in

MATLAB has recently been published [38][18].

Barnard et al. [9] improved the edge detection in retinex and recovered the illumination

variation across the image. Once the illumination is known, it can be removed. The image is

segmented into regions roughly corresponding to surfaces. A relative illumination map is

calculated by finding the ratio between each point in a patch and the centroid of the patch, then

solving for the relative strengths of the patch centers, assuming that illumination is smooth

across region boundaries.

Gamut mapping algorithms were introduced by Forsyth [36]. Assuming that the scene

contains a wide variety of reflectances, the illuminant is estimated by observing the gamut of

sensor responses. Constraints on the possible illuminants that could produce the observed image

are used. For example, if there are strong red sensor responses, the light is not blue. The

feasible set of illuminants is those diagonal transforms which map the observed gamut to a

subset of the possible gamut under a canonical white light. The results for Mondrians were

reasonable when a diverse set of colors was present. Ambiguities caused by multiple light

sources, varying illumination, shadows, specular highlights, or orientation effects were not

addressed.

Finlayson et al. [30] introduced color by correlation, which computes multiple possible

illuminants, then uses likelihood estimates to choose a single illuminant. Due to the ambiguity

in determining light intensity (is it darker because the light is dimmer or because the objects have

lower reflectivities?) they reduce the sensor responses and the illumination from three

components (R, G, B) to two (R/B, G/B). First, information is collected about which illuminants

produce which image colors. Second, this data is correlated with the colors present in a given

image to calculate likelihoods. Finally, these likelihoods are used to estimate the illuminant.

Unfortunately, since this idea ignores light intensity, it cannot be used for the most common

case, where only the light intensity changes over time.

An extensive evaluation of color constancy algorithms was done by Barnard et al.

[7][10]. The first part analyzes synthetic data, and the second installment uses real non-

Mondrian images. The evaluation includes a wide range of color constancy algorithms,

including gray world, retinex, gamut-mapping, and color by correlation. The experiments tested

the ability of the various algorithms to recover the illuminant color and intensity and the error in

the corrected color image. Due to the difficulties in changing only the light color and intensity

between images, the corrected image color was only evaluated for the synthetic scenes, and

automatic camera aperture was simulated even on the real images. Since our desire is to measure

how constant a color remains while the light intensity varies with real world scenes and cameras,

this study does not provide any useful data towards this end.

In color constancy, the image with unknown illumination is transformed to its equivalent

under a canonical light. Invariants take a different approach, finding quantities in which the

lighting terms cancel out, making them independent of illumination.

Funt and Finlayson [39] use ratios of adjacent colors as an invariant in the task of color

indexing (finding a matching image from a database based on color histograms). These ratios

are actually implemented as the derivative of the logarithm of the image. Assuming that the

unknown lighting is the same at two adjacent pixels, this operation should produce the same

result, independent of light color or intensity. The continuous lighting assumption was also used

in retinex. However, this method does not handle illumination discontinuities (such as those

caused by a geometric discontinuity), and does not discriminate well between colors. For

example, a solid red object and a solid blue object of the same shape would both have log

derivatives of zero (ratios of one) across the interior of the object, and the ratios around the edge

would depend on the background color as much as the object color.

Healey and Slater [50] show that changes in illumination cause affine transformations of

a color histogram. They use Taubin and Cooper’s [94] affine invariant moments to describe the

color histograms. The results are claimed to improve on previous methods for color indexing

while using fewer parameters to describe the color pixel distribution. The ideas here are specific

to the object recognition task, and don’t apply to pixel-wise computations or localized

illumination changes. The work is extended in [84] by computing the invariants for smaller

areas instead of the whole image. This allows matches to be found in the presence of partial

occlusion, but still doesn’t help in our quest for pixel-wise invariants.

Gevers and Smeulders [44] analyze several color spaces for invariance to image

conditions, such as viewing orientation, illumination intensity, or highlights. These properties

were derived from a dichromatic reflectance model. Hue, saturation, and normalized color RGB,

as well as two newly proposed color models, c1c2c3 and l1l2l3, were shown to be invariant to a

change in viewing direction, object geometry, and illumination for white light. Hue and l1l2l3

should also be invariant to specular highlights. They also introduced a third new color space,

m1m2m3, which is based on ratios between pixels in two locations. The m1m2m3 space should be

invariant to changes in light color as well. The performance of the color spaces were then tested

in a color-based object recognition experiment. The c1c2c3 and l1l2l3 model improved on the

recognition success with white light, and the m1m2m3 space worked the best when the light color

changed. Since ratios do not distinguish between different colors (e.g., constant red and constant

green will both have ratios between adjacent pixels of 1.0), m1m2m3 is not useful for object

detection or tracking based on color, but may be used to find color edges. We used the c1c2c3

and l1l2l3 models in our experiments.

Finlayson and Schaefer [31] evaluated color indexing using a variety of color spaces,

color indexing methods, and color constancy algorithms that are theoretically invariant to

various types of illumination changes and concluded that none of the techniques performs well

enough for the color indexing task.

Finlayson and Schaefer [32] point out that although hue is generally considered to be

invariant to brightness, many cameras apply gamma correction to adjust for varying scene

brightness. This nonlinear change modifies the hue, as it is usually computed. They propose a

new method for computing hue using logarithms that is invariant to gamma as well as

illumination changes. We used this Log Hue in our experiments.

Geusebroek et al. [42] derive a complete set of geometrical invariants for various color

invariant sets. Basically, they use these invariants to segment images by finding regions that

have the same material color, even though the RGB color recorded by the camera varies due to

changing surface normal, illumination intensity, or illumination direction across the object.

2.5.2 Virtual Mirror Applications

In this section, we review augmented reality applications that use the concept of a virtual

mirror.

A project called “Magic Mirror” [24] is similar to ours in that it uses a camera and a

screen to simulate a mirror interactively, but it is designed for a large audience, like a theater.

Computer generated effects are overlaid, but the image processing involved is solely motion

detection. The areas where there is motion are highlighted, and used to affect a virtual object,

such as a ball. Figure 8 shows an example from this application. The reported frame rate is 15

Figure 8: Screen capture from "Magic Mirror" project. Green highlights show areas where motion is detected.

In another system, Darrell et al. [23] detect faces in real time using stereo cameras, and

distort them on a display. A sample of the output of this system is shown in Figure 9. A half

silvered mirror is used so that two stereo cameras can be aligned with the optical axis of the

display. Stereo information obtained with the help of special purpose hardware is used to

segment the users, and then skin color detection is applied to identify likely body parts. Face

detection is used to eliminate hands and other non-face parts from further consideration. The

detected faces are then distorted on the output display. The system was reported to run as 12 Hz.

This system localizes faces, but does not need to find the orientation of the faces nor identify

features, such as the location of the eyes. It also uses two cameras, three computer systems and a

dedicated stereo computation board, which is significantly more equipment than our minimal

setup (although computational power has increased significantly since this was done in 1998).

Figure 9: Sample display from the virtual mirror of Darrell et al. which distorts detected faces.

Lepetit et al. [63] demonstrated a face augmentation system that does not require

specialized hardware or markers. It accurately recovers the full 3D pose of the face, allowing

virtual additions, such as glasses or a mustache at a rate of 25 Hz. The demo used the authors’

tracking method [62], which uses a calibrated camera and a 3D model and image library of the

object being tracked. The stability of the tracked objects comes from tracking many points on

the object, so will not work for simple objects, such as a solid rectangle, that only have a few

unique points. A single face fills most of the screen in all of the sample images, such as in

Figure 10. Example movies were shown that had other people in the background, changed the

lighting, and occluded parts of the face without losing track. Sudden, large motion appears to

cause tracking to fail, invoking a recovery scheme. Only one face at a time is tracked. It is not

clear whether nonrigid motion such as facial expressions will cause problems. More details of

the tracking method will be discussed with other face tracking methods.

Figure 10: Screenshots from the face augmentation demo of Lepetit et al.

2.5.3 Rectangle Tracking

Tracking simple objects means few features are available. Tracking methods based on

corners or interest points, such as [69], [48], [88] and [72] are common, but a rectangle only has

four corners, and all four must be accurately located to get the outline. If a single corner is

occluded, an interest point-based tracking method will fail to recover the rectangle from the

video. Tracking of individual corners also tends to be noisy. For tracking a rectangle, three

quarters of the neighborhood around the corner is background, which will likely change as the

rectangle is moved, further complicating the task of tracking the corners.

Edges are more promising features for this task, and have long been used in tracking.

More edge pixels are available than corner pixels, so tracking can still be successful if some are

occluded. Canny [17] is probably the best known edge detection method. The process for

finding the edges uses image gradients, which in turn typically require a convolution of the

image with at least a 3x3 filter for the horizontal and vertical directions, in each of three color

planes, to get color gradients. This operation alone may make edge detection in the whole image

too slow for real-time operation on high resolution images. Methods for tracking edges include

[105] and [99].

Once edges are available, the problem of finding a rectangle first involves finding

straight lines. Popular methods include the Hough Transform [27] and Burns line detector [16].

In the Hough Transform, every edge pixel votes for all possible lines that could go through it.

The line candidates with the most votes are detected as lines. Continuity and endpoints are not

checked. The Burns line detector finds line segments by grouping adjacent edge pixels on a line

with gradients perpendicular to the line.

Even if line detection were fast enough for real time, the problem of weeding out the four

lines belonging to the rectangle from the background clutter still remains. For many

environments, there may be few straight lines, but a background like a bookshelf can yield an

abundance of lines.

Color is the other obvious feature, since the rectangle is specified to have a solid color.

Much of the work in color-based tracking and detection has been in the context of skin color

detection, which will be discussed with face detection.

Shape can also be used for tracking a rectangle. One popular technique for tracking a

general curve in dense visual clutter is Isard and Blake’s Condensation [56], an example of a

particle filter. A particle filter tracks multiple hypotheses simultaneously. Each “particle”

represents a possible object state, and has a weight corresponding to the probability of the

current observation (the current frame) given that object state. The hypotheses are stochastically

propagated to the next frame and the weights adjusted based on the new observation. Impressive

results are shown, including tracking the outline of a leaf on a bush in the wind. The algorithm

was reported to run in “near real-time” in 1998. The validity of many hypotheses must be

measured every frame, and each may require samples at numerous points on the curve. As an

example, they show tracking of a human head-and-shoulders silhouette. The probability of a

hypothesis is based on the distance from the curve to a high gradient pixel along the normal to

the curve at 20 to 30 locations. Invariant image features have also been used for tracking general

objects [89].

2.5.4 Face Detection in Images

Face detection is the problem of determining whether or not an image contains at least

one face. Face localization finds the location and size of each face. However, the distinction

between these two problems is often blurred, with face detection covering both whether or not a

face exists and finding the bounds for each face. We will use the term face detection to include

face localization unless otherwise specified. There are several related research areas that we will

not be exploring. These include face recognition, which looks for matches between the current

face image and a database of faces; face authentification, which validates the claim that a face

image belongs to a given individual; and facial expression recognition, which identifies the

affective state of the given face.

Yang et al. [109] surveyed face detection techniques in images. They listed the

challenges to face detection as pose, presence or absence of structural components, facial

expression, occlusion, image orientation, and imaging conditions. The general categories they

used to characterize the methods are knowledge-based methods, invariant feature approaches,

template matching methods, and appearance-based methods. The detection rates were compared

for ten representative appearance-based methods, but no execution times were given.

Knowledge-based methods use rules defined by human knowledge of what constitutes a

face. Typical rules specify relationships between facial features, like relative position and

symmetry. These methods tend to work well to find frontal faces in uncluttered scenes, but it is

difficult to define rules that are neither so general that many non-faces are detected nor so

specific that faces are missed. Yang and Huang [108] used a multiresolution method using

knowledge-based rules. They found face candidates by identifying patterns of similar colored

pixels at coarse resolutions, when the face occupies only a few pixels. These candidates were

evaluated by looking for facial features in finer resolutions.

Invariant feature methods look for facial features, such as eyebrows, eyes, nostrils, or

mouth, and infer the presence of a face from the location of these features. Lighting, occlusion,

and background clutter can make it challenging to identify these features. Also in this category

are methods that use other low level image components like texture and skin color. A

representative example from this group is from Yow and Cipolla [112]. They detected interest

points using a filter. Edges and interest point characteristics were used to group the interest

points into regions. Each region was labeled as a feature based on a comparison between a

feature vector obtained from the region and a training database. A Bayesian network was used

to evaluate features and groupings for face detection. The method handles faces with different

orientations, but requires a relatively large (60x60 pixels) face image.

Skin color detection is commonly used as a component of face detection. Skin color by

itself is not sufficient for face detection since it also occurs in other body parts, such as hands,

and skin colors may occur in other places in the scene. Even though skin tones vary

considerably, many pixels can be eliminated from consideration because no skin is, for example,

purple or green. One example of color-based skin detection is Kjeldsen and Kender [57]. They

trained a color predicate to segment skin pixels belonging to a hand from background pixels

using both positive and negative training samples.

Texture is also a useful component in a face detection scheme. Augusteijn and Skufca

[2] classified 16x16 pixel subimages as skin, hair, or other. The presence of hair and skin

textures indicates the presence of a face. This technique can handle any face pose, but depends

on the uniqueness of these textures.

Template matching involves correlating a predefined face template with an input image.

In its simplest form, this method does not handle changes in size, shape and face orientation.

Numerous modifications have been proposed to deal with these variations. Templates are often

comprised of edges, including the subtemplates for eyes, nose, mouth and face contour used by

Sakai et al. in 1969 [79]. Govindaraju [45] used curves defined by the hairline and left and right

sides of a face as a template, and then linked contour segments as a basis for face detection.

Templates have also been built from face silhouettes [80] and relative brightness of facial

regions [83].

Appearance-based methods differ from templates primarily in that the patterns are

learned from training data instead of defined by an expert. One of the best known methods for

face recognition has been dubbed “Eigenfaces” because it decomposes the training set of aligned

and normalized images into eigenvectors. The eigenvector decomposition of the test image is

matched with the training database to find the best match. Turk and Pentland [97] applied this

idea to face detection by looking at the different clusters formed when face and nonface images

are projected into the subspace spanned by the eigenvectors. Various machine learning

techniques have been applied to face detection, including neural networks [78], Support Vector

Machines (SVMs) [71], Sparse Network of Winnows (SNoW) [110], Bayes classifier [81] [82]

[54], Hidden Markov Model (HMM) [74], and AdaBoost [100].

2.5.5 Face Detection in Video

Many of the existing face detection algorithms operate on a single image. While an

interactive system that uses video as an input has the disadvantage of requiring real time

operation, it also has the advantage of providing motion information and frame-to-frame

coherency. By limiting the tests for faces to foreground (moving) pixels, the search space is

greatly decreased. As a consequence, static faces like a portrait on the wall may be missed, but

this is acceptable for our augmented reality application. In fact, since the people in front of the

camera are not likely to be upside down, the tops of moving groups of pixels are good candidates

to check for faces. Once a face is found, it can be tracked in subsequent frames.

A similar approach was used by Foresti et al. [35]. They used change detection to find

moving objects, and then analyzed the silhouette of each blob to locate areas where there was a

high probability of finding a human head. Skin color and principal component analysis were

used to find human faces. The system was reported to successfully localize the face 98% of the

time, and works with multiple faces, but no frame rates were given.

Viola and Jones’ [100] AdaBoost cascade provides one of the few single image methods

fast enough for interactive video. Boosting is the general process of combining multiple weak

classifiers to create a single strong classifier. The weak classifiers must be better than random

chance, but don’t have to be much better. AdaBoost [37] was introduced by Freund and

Schapire as the first boosting method to achieve good results. Its name is due to the adaptive

weighting that takes place when building the classifier. The classifier is trained using a set of

labeled samples. The weights for each sample are equal at the beginning, but weights for

correctly classified samples are decreased, while weights for incorrectly classified samples are

increased. This makes later classifiers focus on the “harder” samples.

Viola and Jones used the presence or absence of simple rectangular features as their weak

classifiers. They also introduced a method for combining classifiers in a cascade for more

efficient processing. While standard AdaBoost decides on the classification of a sample based

on the majority vote of all the weak classifiers, cascaded AdaBoost uses multiple stages.

AdaBoost is performed at each stage, and samples that fail a stage are not processed further.

This allows many negative samples to be eliminated with little processing. For this to work, the

“false negative” rate of the earlier stages must be low. It is reported to run at 15 Hz. on a

Pentium III processor for images that are 384x288 pixels, 15 times faster than the Rowley-

Baluja-Kanade detector [78] (considered the fastest detector at the time, in 2001), and 600 times

faster than the Schneiderman-Kanade detector [82]. Examples given included images with

single and multiple faces. The method is limited to frontal faces and does not use color

information.

Huang and Lai [53] combined these ideas with skin color detection and position

prediction to create a face detection system reported to run 4 times the video rate at 320x240

pixel resolution on a Pentium IV.

2.5.6 Real-time Face Tracking

Once a face is localized, its position and orientation must be determined in order to

augment it. For example, if a hat is added, it must appear to move and rotate with the head. The

label “face tracking” is applied to everything from tracking the centroid of the face, to tracking

the bounding box of the face, to analyzing facial expressions. For our work, we adopt Toyama’s

definition [96]: “The 3D Face-Pose Tracking Task is the real-time recovery of the six-degree-of-

freedom (6-DOF) pose of a target structure that has a rigid geometric relationship to the skull of

a particular individual, given a stream of live black-and-white or color images.”

Toyama [96] also lists face tracking systems reported in the literature through May 1998.

These systems range from tracking only position to the full six degrees of freedom at various

speeds. They are classified by a number of characteristics, including the algorithm types used for

tracking and recovery, the execution speed, and robustness. While numerous papers on face

tracking have been published since 1998, the basic categories still suffice. These features are:

• Color – Color-based algorithms tend to be fast, but can only be used to find position, not

orientation. They are good for recovery (finding a face that is not being tracked), but are

easily confused by similar colored objects in the background.

• Motion – Pixels with similar motion are clustered.

• Depth – Pixels with similar depth are clustered.

• Edge – Uses the contour of the head.

• Feature – Uses specific facial features. Most of the methods in Toyama’s survey that

tracked all six degrees of freedom used some form of feature tracking.

• Template – Uses a template that covers the whole face. Methods using small templates

are classified with “Feature”. The larger number of pixels involved in tracking with this

method (vs. Feature) provides more robust tracking, but is computationally more

expensive.

• Optic flow – Uses optical flow for tracking.

A representative method that uses color and feature methods is from Bakic and Stockman

[6]. They extracted a region matching the skin color model as a possible face. In this region,

intensity variations were used to find the eyes and nose. These three points were then used to

estimate the head pose. Frame rates of 10 to 30 Hz. were reported on an SGI Indy 2 for 320x240

images with accuracy sufficient to determine the region of a computer screen that the user is

looking at.

Optical flow was used by Basu et al. [11] to track rigid head motion in video. They used

a 3D ellipsoidal head model and find the head motion that best matched the observed optical

flow. The ellipsoid was chosen as a compromise between the too-simple plane model and the

complexity of a full head model. The method claims to be robust to large angular and

translational motion and small variations in the initial fit, as well as remain stable over extended

sequences.

To obtain not only the six parameters describing the position and orientation of the head,

but the 3D location of any point on the head, a 3D face model is often used. One commonly

used generic 3D face model is the CANDIDE model [12]. The Candide-3 model has about 100

vertices. There are shape units to modify the rigid structure to fit a specific face instance and

action units to animate facial expressions. If the wireframe face model can be matched to the

video such that it appears to be painted on the face, then augmentation is simply a matter of

adding more polygons to the model. The challenge is in matching the shape of the generic

model to the specific face, then updating the position and orientation in real time.

Toyama [96] described a face tracking system that integrates several of the available

features using an incremental focus of attention, that is, the tracking is performed at the highest

layer possible given the available information. Layer one detected skin color pixels, layer 2

looked for motion near the last location of the object, layer 3 detected the approximate size and

shape of clusters of skin colored pixels, layer 4 detected the principal axis of these clusters, layer

5 matched live images to a stored template, and layer 6 tracked five point features on the face.

When conditions were good, layer 6 determined the correct pose. When tracking failed, the

layers were traversed to recover the face location. The algorithm was reported to run at 30Hz on

a Pentium II, with recovery in less than 200 ms. when tracking was lost.

La Cascia et al. [58] modeled the head as a texture mapped cylinder. Tracking was a

matter of finding the pose that provides the best image registration with the video frame.

Illumination models were used to account for lighting variation. A 2D face detector initialized

the system, and the implementation was reported to run at 15 Hz.

Cootes et al. introduced the Active Appearance Model (AAM). AAM uses an analysis-

by-synthesis approach, evaluating differences between a synthesized model with the

hypothesized pose and the current image. Variations in shape and intensity are combined to get

a model for an object, such as a face. The shape is modeled by the 2D location of vertices on an

image. For faces, they used 122 points. After normalization, principal component analysis

(PCA) was done on the deviation of the shape of the training images from the mean, producing a

set of orthogonal modes of variation, sorted in importance order. Any face shape in the set could

then be reconstructed by applying shape vectors with appropriate weights. Likewise, the

appearance could be modeled by warping each input shape to the mean shape, one triangle at a

time, normalizing the brightness and contrast, then applying PCA to the pixels inside the shape

model to get a mean appearance and a set of appearance modes. PCA was then applied to the

combined shape and appearance data to capitalize on correlations between shape and appearance

for a final set of parameters. Finding the set of parameters to make the model match a test image

would appear to be a difficult high-dimensional optimization, but the authors proposed a way to

guide the optimization process. They perturbed 2D position, scale, orientation, and all of the

model parameters and noted how each affected the error image (the difference between the

model and the test image). For example, the parameter that controls “smile” caused changes in

the mouth region. Using a linear model, estimates for changes in all of the parameters were

computed from the error image, and the image with the minimum error was produced in

relatively few iterations. On faces that were about 200 pixels wide, the process was reported to

take an average of 4.1 seconds on a Sun Ultra. This method recovers facial shape and expression

in addition to 2D pose.

Dornaika and Ahlberg [25] used an Active Appearance Model (AAM) as their face

template, but used the 3D CANDIDE shape instead of the 2D shape. They determined 12 shape

parameters and 6 animation parameters in addition to the global 6-DOF pose. They presented an

alternate faster search method as well as adaptations from using the 3D shape instead of the 2D

shape. They claimed that each video frame can be processed in less than 10 ms. It is not clear

whether this process will work with small faces, since all of the examples showed the face filling

most of the frame.

Matthews and Baker [67] applied recent computational advances in image alignment to

AAMs. In traditional Lucas-Kanade image alignment [5], the goal is to find a set of parameters,

p, which minimizes the error between the original image warped by p and a template. At each

iteration, the remaining error is used to calculate increments for the parameters (Δp). The

parameter estimate used for the next iteration is p + Δp. This is referred to as the forwards

additive approach. Solving for the parameter increments requires the gradient and Jacobian,

which both depend on p, so they must be recomputed at each iteration, making the method

computationally expensive. The detailed cost analysis is in [5]. Instead of warping the image by

p + Δp, it can be warped by Δp, then by p. This is called the forwards compositional approach.

The advantage is that since the warp by Δp is only applied to the original image, the Jacobian

only needs to be computed once at initialization. The gradient must still be computed each

iteration. For the inverse compositional approach, the roles of the image and template are

swapped. The template is warped by Δp and the original image is warped by p. The gradient of

the template need only be computed at initialization, and the Jacobian of the warp is still only

computed once. When these ideas are applied to AAMs, the computational burden of the high

dimensional nonlinear optimization problem can be overcome. The result allows a 2D mesh to

be overlaid on a face in video, but does not provide 3D information about head orientation.

Xiao et al. [106] built a non-rigid 3D shape model from the 2D AAM using [107], and

expanded the 2D AAM idea to include a constraint that the vertices of the 2D AAM match a

projection of the 3D model. Although each iteration takes longer, the algorithm is claimed to be

faster because fewer iterations are needed. One of the advantages of this extension is that the 3D

orientation is available as parameters of the 3D model projection. We used a similar idea in our

work and will present details in Chapter 5.

The tracking method used by Petit et al. [62] for their face augmentation demo is not

specifically for faces. It works with any rigid object that has a sufficient number of interest

points. It combines the use of keyframes, which prevents drift over time, with concatenated

transformations, which yield smooth tracking. The system requires a rough 3D model and a few

keyframes of the object to be tracked. In this method, a keyframe is an image of the object and

its corresponding pose. In each keyframe, the interest points that lie on the object are found in

the 2D image, and their 3D coordinates are calculated from the 3D model. A small image of the

neighborhood around each point along with the surface normal is stored for use during tracking.

Several variations of each image patch are created by rerendering the patch from different

angles, using a planar assumption. For initialization, interest points found in the video frame are

matched with the interest points from the keyframes using eigen images that combine the

variations as descriptors. Using eigen images to quickly compare image patches was developed

in [65]. From the matches, the pose can be calculated. The matching process can be repeated for

each keyframe, then the results from the keyframe with the most matches are used. Tracking is

similar to initialization. The keyframe that had the most object area visible in the previous

frame’s pose is used, and each patch in that keyframe is rerendered to match the previous

frame’s pose. Interest points are extracted from the current video frame and matched with the

keyframe patches. The rotation and translation are calculated using numerical minimization with

an M-estimator. The last step adjusts the pose and the 3D point location to minimize the

reprojection error in the current and previous frames. This compensates for initialization errors

and adds frame-to-frame stability.

2.5.7 Face Pose Estimation

Face pose estimation is the problem of finding the head position and orientation relative

to the camera, given a face image. This bridges the gap between face detection, which finds the

region occupied by a face, and face tracking, which requires an initial pose. Approaches include

appearance-based and feature-based methods. Appearance-based methods find a transformation

that maps the observed image to a model known to be in a frontal pose. This is made more

difficult with the variations between individuals, occlusions, lighting conditions, and facial

expressions.

Feature-based methods recover the positions of facial features, such as eye and mouth

corners. The geometry between these features can be used to determine the head position and

orientation. We focus here on methods that recover the in-plane rotation angle.

Yilmaz and Shah [111] automatically detected the eyes, eyebrows, and mouth, and then

used these points to determine the pose. Candidate locations were found by using training

templates for the eyes and eyebrows and an edge map for the mouth, then using likelihoods to

match features with candidates. The feature locations were then used to calculate the orientation

angles.

Fleuret and Geman [33] found the in-plane rotation angle and scale of a detected head by

creating a hierarchy of classifiers that successively restricted the positions of the eyes and mouth

within the detected face region, so that the finest level classifier isolated a particular pose. This

method requires training data for every possible face pose.

Romdhani and Vetter [77] fit a 3D morphable model to an image using an Inverse

Compositional Image Alignment (ICIA) algorithm. They obtained correspondences with errors

less than a pixel, but this took 30 seconds per image on a 2.8 GHz. P4 and was not automated.

Hu et al. [51] detected facial components, such as eyes and mouth corners, and then used

the confidences of these detections to compute the face pose. The basic idea is that the

confidence increases as the difference between the actual and reference poses decreases.

3 COLOR ANALYSIS WITH VARIABLE LIGHT INTENSITY

In order to track objects viewed by a camera, it is important to know which changes in

the color of a pixel are due to changes in the reflectance (e.g., a different object or different part

of the object) and which are caused by lighting changes (e.g., shadows or changes in light

intensity). Large changes in light color are certainly possible, but infrequent. The most common

environments for tracking applications are daylight, or in a house or office. While the spectral

content of these light sources varies between different types as well as while varying the power

of a single light, these are all in the broad category of “white light”. In this chapter, we will

analyze the effect of light intensity changes on different diffuse reflectances in order to find a

color space in which color changes are only due to object color, not lighting, assuming that the

light is nearly white.

For this problem, we do not seek to recover the color of the illuminant or even the object

color. Instead, we only want to distinguish between different object colors. To further narrow

our focus, our interest is only in the camera’s perception, not what humans would observe.

This color knowledge is also required for the inverse problem – creating a graphics image

based on scene content that resembles one from a camera. Much theoretical research has been

done in this area, and many assumptions are commonly made based on the models created, and a

number of color conversions have been invented to deal with illumination changes, but the

assumptions are rarely tested. In this chapter, we will explore how well those assumptions hold.

3.1 Color Models

The color spaces used in our experiments are defined in this section. Most images and

video originate in RGB, which can be viewed directly on most display systems. The remaining

color spaces are calculated from this RGB data. For clarity, the following equations assume that

R, G, and B are in the range [0, 1], although in the experiments the pixels were integers in the

range [0, 255] and the derived quantities were also scaled to be in the same range.

In computer graphics, OpenGL uses simple models for lighting [104]. For a single point light

with only diffuse reflection from the object, the formula is

( ) materiallightqlc

diffusediffusenLdkdkk

color vertex rr•

where d is the distance between the object and the light, kc, kl, and kq are constant, linear and

quadratic terms of an attenuation factor, Lr

is the unit vector that points from the vertex to the

light position, nr is the unit normal vector, diffuselight is the RGB light intensity, and diffusematerial

is the RGB reflectivity of the object. If only diffuselight changes, this matches the coefficient rule

from Equation 1. If the change is only intensity (diffuselight’ = scalar * diffuselight), α, β, and γ

will be the same, so

(R’, G’, B’) = (αR, αG, αB) = α (R, G, B) (3)

Specular reflection is included in most models as well, but is beyond the scope of the

current experiment.

3.1.1 YIQ

YIQ is used for U.S. TV broadcasting. The Y component is luminance, while the

chromaticity is encoded in I and Q. For black-and-white televisions, only the Y component is

shown. The transformation is [34]

⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

−−−=

⎥⎥⎥

⎢⎢⎢

311.0523.0212.0321.0275.0596.0

114.0587.0299.0. (4)

Multiplying R, G, and B by a scalar will result in Y, I, and Q being similarly scaled,

however adding a constant to R, G, and B will only produce a change in Y.

3.1.2 HSV

HSV is more intuitive. H is the hue, measured as an angle. S is saturation, which ranges

from 0 to 1 and measures the purity of a color. Adding white pigment to a color will lower its

saturation. The V component is value or intensity and measures the brightness of a color. Thus

shades of gray have S = 0, with black having V = 0 and white with V = 1. Hue is an angle, with

red at 0°, green at 120°, teal at 180°, and blue at 240°. The conversion from RGB to HSV is

mxBmnmxGR

mxGmnmxRB

mxRmnmxBG

BGRmnBGRmx

⎪⎪⎪

==−−

360 and 0between Clamp

),,min(),,max(

With the angle clamped in the range [0°, 360°) as listed above, colors near red may jump

from 0° to 359°. With hue angles between -180° and +180°, the wraparound will be colors near

teal instead of red. Since our experiments used red but not teal samples, we used hues between

±180°. Scaling R, G, and B by the same factor will change V by the same factor, but the scale

factor cancels out of H and S. The hue is expected to be unstable when BGR == , since this

will cause division by zero. Likewise, for dark colors, mx will be near zero, so small changes in

RGB values will cause large changes in saturation.

3.1.3 HLS

HLS is very similar to HSV. The three components are hue, lightness, and saturation.

The hue is identical to HSV (including clamping it between -180° and +180° for this experiment,

instead of 0° to 360°). The saturation and lightness capture the same concepts as HSV’s

saturation and value, but the calculation is slightly different. The conversion from RGB to HLS

is [34]:

⎪⎪⎩

⎪⎪⎨

+−−

≤+−

⎪⎪⎪

==−−

otherwise)(2

5.0 if

360 and 0between Clamp

),,min(),,max(

mnmxmnmx

Lmnmxmnmx

mxBmnmxGR

mxGmnmxRB

mxRmnmxBG

BGRmnBGRmx

Comparable to HSV, only L in HLS should change when the light intensity changes.

3.1.4 CIELAB

The CIELAB color space was created by the Commission Internationale de l’Eclaraige

(CIE). The L* component measures lightness, and the hues are changed by varying a* and b*.

The nonlinear relationships are intended to mimic the response of the human eye. It is defined

⎪⎪⎩

⎪⎪⎨

≤⎟⎟⎠

⎞⎜⎜⎝

>⎟⎟⎠

⎞⎜⎜⎝

=008856.0 if*3.903

008856.0 if*116*

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛−⎟⎟⎠

⎞⎜⎜⎝

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛−⎟⎟⎠

⎞⎜⎜⎝

where ( )⎪⎩

⎪⎨

008856.0 tif11616t*7.787

008856.0 if31

tttf (9)

Since L*, a*, and b* are defined in terms of the CIE XYZ tristimulus values of the measured

point (X, Y, Z) and the reference white point (Xn, Yn, Zn), we also need to convert the measured

RGB value into XYZ. For an unknown generic white light, we can use [22]

⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

76.102.000.003.091.075.030.022.024.1

and assume the reference white value is (1, 1, 1).

3.1.5 Chromaticity (Normalized RGB)

In chromaticity space, also called normalized RGB, the vector representing the color is

normalized so the resulting vector components sum to 1.

= , , (11)

Since r + g + b = 1, this is really only a two dimensional color space. A change in light

intensity should not affect this model. Dark colors will result in unstable values as the

denominator approaches zero.

3.1.6 c1c2c3

The 321 ccc model was proposed by Gevers and Smeulders [44] to be invariant under

white illumination on matte, dull surfaces, and is defined

( ) ( )⎟⎟⎠⎞

⎜⎜⎝

⎛=⎟⎟

⎞⎜⎜⎝

⎛=⎟⎟

⎞⎜⎜⎝

,maxarctan ,

),max(arctan 321 (12)

Scaling R, G, and B by a constant should not affect this model.

3.1.7 l1l2l3

The 321 lll color space was also proposed by Gevers and Smeulders [44] to be invariant

under white illumination on shiny as well as matte dull surfaces. It is defined:

( )( ) ( ) ( )

( )( ) ( ) ( )222

BGBRGRBGl

BGBRGRBRl

BGBRGRGRl

−+−+−−

Scaling R, G, and B by a constant should not affect this model. Shades of gray, when

BGR == will cause the denominator to go to zero, causing instability.

3.1.8 Derivative

The invariants used by Geusebroek et al. [42] included three color derivatives in addition

to detectors for various types of edges. The first is H, which is related to the hue of the material:

H = (14)

where E is shorthand for ( )xE r,λ , or the energy of a particular wavelength λ received from a

scene location xr and the subscript denotes variable of differentiation. They show that using the

dichromatic reflection model, with uniform illumination and white light H is independent of

viewpoint, illumination direction, illumination intensity and Fresnel reflectance coefficient.

The second property, Cλ, is interpreted as describing object color regardless of intensity:

C λλ = . (15)

With white light on matte surfaces, they show Cλ to be invariant to viewpoint, surface

orientation, illumination direction and illumination intensity. Differentiating once more by

wavelength yields

EC λλ

λλ = (16)

which shares the same invariant properties.

To get from RGB camera inputs to wavelength derivatives, they assume a Gaussian color

model [43]. That is, using measurements integrated over a range of wavelengths using a

Gaussian aperture function. With a Taylor expansion around λ0 = 520 nm and σ0 = 55 nm for

compatibility with the human vision system, the observed Gaussian spectral derivatives E , λE ,

and λλE are a good match with the CIE 1964 XYZ basis. The transformation from camera RGB

inputs to spectral derivatives is therefore

⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

−−=

⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

−−

⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

17.006.034.035.004.030.0

27.063.006.0

ˆˆˆ

03.118.14.0048.0

28.02.118.0

ˆˆˆ

11.103.001.005.056.030.019.011.062.0

ˆˆˆ

λ (17)

3.1.9 Log Hue

Finlayson and Schaefer [32] derived a formula for hue that uses logarithms to cancel the

effects of gamma. It is:

GRHlog2loglog

loglogarctan−+

−= . (18)

If R, G, and B are multiplied by a constant, logarithms of these quantities will make the

gain additive. The differences in both numerator and denominator were designed to cancel out

the extra terms, even before the ratio, so this model should be invariant to changes in light

intensity. For shades of gray, when BGR == , the denominator will become zero, resulting in

instability.

3.2 Experimental Setup

To learn how the various color spaces really react to changes in illumination intensity, we

assembled samples of eight different colors, and then videotaped them while the light intensity

varied. Figure 11 shows the first frame from one trial. The colors are orange, red, green, blue,

yellow, brown, beige, and black. The palette includes the three graphics primaries (red, green,

and blue), different “brightnesses” of the same color (brown and beige), and extremes in

intensity (beige and black). A Panasonic PV-DV951 digital video camera (which has 3 CCD

chips) was used. Several of the trials were simultaneously captured with another camera with

only one CCD chip. The results are not shown here, but were very similar.

Figure 11: A sample frame from the experiment. The labels were added later to accommodate printing in black and white.

The experiment was repeated several times, varying the light source (a halogen lamp or

daylight from a window), camera settings (automatic or manual), and the way the illumination

changed (change the light source or change the camera).

The video clip from each trial was captured as an AVI file with as little modification as

possible. This resulted in a 720x480 video file with DV compression for each trial.

An application was created to sample the RGB values at a specified pixel location at each

frame of a video file and write the results to a text file. This application was executed eight

times for each video, sampling a pixel at the center of each color region, creating eight text files

for each trial.

Each text file was used as input to a MATLAB script. The script converted the RGB

values in the file to each of the color spaces listed in Section 3.1 and plotted the results. These

plots will be shown in Sections 3.3 and 3.4.

3.3 Results

Each sequence name has three parts:

1) Indoors or outdoors – This refers to the type of light, not the location. For indoor

lighting, the test was done at night, so no sunlight was present. All the light

comes from an artificial white source. The light was changed using a dimmer

switch. For outdoor lighting, the only light source was indirect sunlight from a

window. The light intensity was changed by opening and closing blinds on the

window.

2) Manual or automatic – This references the camera settings. On automatic, the

camera is free to make adjustments in response to changes in lighting. In manual,

the settings are fixed for the duration of the trial unless otherwise stated.

3) Change light or change camera – The apparent scene illumination was changed

either by varying the light source or the camera. When the camera was changed

during the sequence, the camera was using manual settings and the iris was

opened and closed to change the amount of light at the sensor.

Qualitative results will be discussed in this section. Quantitative analysis will follow.

3.3.1 Indoors Manual Change Light Sequence

In this sequence, no outdoor light was present. The lamp started at a fairly dim level, got

dimmer, and then returned to the original level. Figure 12 shows the RGB levels for each of the

color samples on the vertical axis with time on the horizontal axis. On this and all the rest of the

plots, the eight plots are shown in the same relative position as the eight color tiles in Figure 11.

As expected, all three components at each sampled point decrease with the light, with the

amount of decrease proportional to the signal strength. The black sample in the lower right does

show a change, albeit a small one, since the RGB values were low to begin with.

100 200 3000

250RGB cam1-in-man-change-light-orange.txt

frame number

redgreenblue

100 200 3000

250RGB cam1-in-man-change-light-red.txt

frame number

redgreenblue

100 200 3000

250RGB cam1-in-man-change-light-green.txt

frame number

redgreenblue

100 200 3000

250RGB cam1-in-man-change-light-blue.txt

frame numberco

redgreenblue

100 200 3000

250RGB cam1-in-man-change-light-yellow.txt

frame number

redgreenblue

100 200 3000

250RGB cam1-in-man-change-light-brown.txt

frame number

redgreenblue

100 200 3000

250RGB cam1-in-man-change-light-beige.txt

frame number

redgreenblue

100 200 3000

250RGB cam1-in-man-change-light-black.txt

frame number

redgreenblue

Figure 12: RGB color space for the Indoor Manual Change Light Sequence

The same data is plotted in 3D in Figure 13, using the red, green, and blue axes instead of

time. Since the scene was generally dark, the data is near the origin. The plots are roughly

linear, but curves can be seen, especially with yellow and brown on the bottom row.

RGB cam1-in-man-change-light-orange.txt

RGB cam1-in-man-change-light-red.txt

RGB cam1-in-man-change-light-green.txt

RGB cam1-in-man-change-light-blue.txt

RGB cam1-in-man-change-light-yellow.txt

RGB cam1-in-man-change-light-brown.txt

RGB cam1-in-man-change-light-beige.txt

RGB cam1-in-man-change-light-black.txt

Figure 13: RGB color space in 3D plot for the Indoors Manual Change Light Sequence

The plots after these RGB values were converted to YIQ are shown in Figure 14. The

luminance (Y), shown in red, reflects the change in the light intensity as expected. Ideally, the I

and Q components should remain constant, since the color did not change, but some variation is

visible, especially in the green sample (top row, third from left). This set of plots also shows that

there is not much difference between the I and Q components of the different colors.

100 200 300-100

YIQ cam1-in-man-change-light-orange.txt

frame number

Y (luminance)IQ

100 200 300-100

YIQ cam1-in-man-change-light-red.txt

frame number

Y (luminance)IQ

100 200 300-100

YIQ cam1-in-man-change-light-green.txt

frame number

Y (luminance)IQ

100 200 300-100

YIQ cam1-in-man-change-light-blue.txt

frame number

Y (luminance)IQ

100 200 300-100

YIQ cam1-in-man-change-light-yellow.txt

frame number

Y (luminance)IQ

100 200 300-100

YIQ cam1-in-man-change-light-brown.txt

frame number

Y (luminance)IQ

100 200 300-100

YIQ cam1-in-man-change-light-beige.txt

frame number

Y (luminance)IQ

100 200 300-100

YIQ cam1-in-man-change-light-black.txt

frame number

Y (luminance)IQ

Figure 14: YIQ color space for the Indoors Manual Change Light Sequence

The HSV color space is shown in Figure 15. The V component, shown in blue is the

intensity measure in this color space, and reacts to the illumination change as expected. The hue

(red) and saturation (green) ideally should remain constant as the illumination changes, but the

saturation changes, most noticeably for beige. Hue shows fewer changes, but several of the

colors appear to have the same hue, making it not very useful for distinguishing different colors.

Black hue gets noisy, since the R, G, and B values are equal and saturation gets noisy as the R,

G, and B values near zero.

100 200 300

HSV cam1-in-man-change-light-orange.txt

frame number

H (hue)S (saturation)V (value)

100 200 300

HSV cam1-in-man-change-light-red.txt

frame number

100 200 300

HSV cam1-in-man-change-light-green.txt

frame number

100 200 300

HSV cam1-in-man-change-light-blue.txt

frame number

100 200 300

HSV cam1-in-man-change-light-yellow.txt

frame number

100 200 300

HSV cam1-in-man-change-light-brown.txt

frame number

100 200 300

HSV cam1-in-man-change-light-beige.txt

frame number

100 200 300

HSV cam1-in-man-change-light-black.txt

frame number

Figure 15: HSV color space for the Indoors Manual Change Light Sequence

The HLS color space is shown in Figure 16. In this color space, the lightness (L)

component shown in green follows the illumination level. Theoretically, the hue and saturation

should remain unchanged, just like in HSV. In fact, the hue in HLS is identical to the hue in

HSV. Even though the formula is slightly different, the saturation (blue) in HLS is very similar

to the saturation (green) in HSV, with the most variation from illumination in the beige sample.

Hue does not distinguish well between six of the colors.

100 200 300

HLS cam1-in-man-change-light-orange.txt

frame number

H (hue)L (lightness)S (saturation)

100 200 300

HLS cam1-in-man-change-light-red.txt

frame number

100 200 300

HLS cam1-in-man-change-light-green.txt

frame number

100 200 300

HLS cam1-in-man-change-light-blue.txt

frame number

100 200 300

HLS cam1-in-man-change-light-yellow.txt

frame number

100 200 300

HLS cam1-in-man-change-light-brown.txt

frame number

100 200 300

HLS cam1-in-man-change-light-beige.txt

frame number

100 200 300

HLS cam1-in-man-change-light-black.txt

frame number

Figure 16: HLS color space for the Indoors Manual Change Light Sequence

The CIELAB color space is shown in Figure 17. The brightness is in the L* component

(red). The other two components should ideally remain constant, but small changes can be seen.

100 200 300-200

200Lab cam1-in-man-change-light-orange.txt

frame number

100 200 300-200

200Lab cam1-in-man-change-light-red.txt

frame number

100 200 300-200

200Lab cam1-in-man-change-light-green.txt

frame number

100 200 300-200

200Lab cam1-in-man-change-light-blue.txt

frame number

100 200 300-200

200Lab cam1-in-man-change-light-yellow.txt

frame number

100 200 300-200

200Lab cam1-in-man-change-light-brown.txt

frame number

100 200 300-200

200Lab cam1-in-man-change-light-beige.txt

frame number

100 200 300-200

200Lab cam1-in-man-change-light-black.txt

frame number

Figure 17: CIELAB color space for the Indoors Manual Change Light Sequence

In the chromaticity color space shown in Figure 18, all three components are expected to

remain constant, since it is a matte surface under white illumination. We see much more

consistency than with RGB, but it is still obvious from the plots when the illumination changed,

especially in the beige sample (bottom row, third from left). The black sample (lower right) gets

noisy as the RGB values approach zero, and the difference between red and orange is subtle.

100 200 3000

1chrom. space cam1-in-man-change-light-orange.txt

frame number

redgreenblue

100 200 3000

1chrom. space cam1-in-man-change-light-red.txt

frame number

redgreenblue

100 200 3000

1chrom. space cam1-in-man-change-light-green.txt

frame number

redgreenblue

100 200 3000

1chrom. space cam1-in-man-change-light-blue.txt

frame number

redgreenblue

100 200 3000

1chrom. space cam1-in-man-change-light-yellow.txt

frame number

redgreenblue

100 200 3000

1chrom. space cam1-in-man-change-light-brown.txt

frame number

redgreenblue

100 200 3000

1chrom. space cam1-in-man-change-light-beige.tx

frame number

redgreenblue

100 200 3000

1chrom. space cam1-in-man-change-light-black.txt

frame number

redgreenblue

Figure 18: Chromaticity space for the Indoors Manual Change Light Sequence

The c1c2c3 color space is plotted in Figure 19, and looks indistinguishable from

chromaticity.

100 200 3000

c1c2c3 cam1-in-man-change-light-orange.txt

frame number

c1c2c3

100 200 3000

c1c2c3 cam1-in-man-change-light-red.txt

frame number

c1c2c3

100 200 3000

c1c2c3 cam1-in-man-change-light-green.txt

frame number

c1c2c3

100 200 3000

c1c2c3 cam1-in-man-change-light-blue.txt

frame number

c1c2c3

100 200 3000

c1c2c3 cam1-in-man-change-light-yellow.txt

frame number

c1c2c3

100 200 3000

c1c2c3 cam1-in-man-change-light-brown.txt

frame number

c1c2c3

100 200 3000

c1c2c3 cam1-in-man-change-light-beige.txt

frame number

c1c2c3

100 200 3000

c1c2c3 cam1-in-man-change-light-black.txt

frame number

c1c2c3

Figure 19: c1c2c3 color space for the Indoors Manual Change Light Sequence

The l1l2l3 color space shown in Figure 20 should also be invariant to illumination

changes, since this is white illumination. We see changes with lighting in the graphs, especially

for green and yellow. Black and blue get extremely noisy, corresponding to times when the R,

G, and B values are equal.

100 200 3000

1l1l2l3 cam1-in-man-change-light-orange.txt

frame number

l1l2l3

100 200 3000

1l1l2l3 cam1-in-man-change-light-red.txt

frame number

l1l2l3

100 200 3000

1l1l2l3 cam1-in-man-change-light-green.txt

frame number

l1l2l3

100 200 3000

1l1l2l3 cam1-in-man-change-light-blue.txt

frame number

l1l2l3

100 200 3000

1l1l2l3 cam1-in-man-change-light-yellow.txt

frame number

l1l2l3

100 200 3000

1l1l2l3 cam1-in-man-change-light-brown.txt

frame number

l1l2l3

100 200 3000

1l1l2l3 cam1-in-man-change-light-beige.txt

frame number

l1l2l3

100 200 3000

1l1l2l3 cam1-in-man-change-light-black.txt

frame number

l1l2l3

Figure 20: l1l2l3 color space for the Indoors Manual Change Light Sequence

The Derivative color space is plotted in Figure 21. The hue, shown in red, shows some

deviation, especially with the blue and black samples, but Cλ and Cλλ stay fairly constant during

the trial.

100 200 300

Derivative cam1-in-man-change-light-orange.txt

frame number

100 200 300

Derivative cam1-in-man-change-light-red.txt

frame number

100 200 300

Derivative cam1-in-man-change-light-green.txt

frame number

100 200 300

Derivative cam1-in-man-change-light-blue.txt

frame number

100 200 300

Derivative cam1-in-man-change-light-yellow.txt

frame number

100 200 300

Derivative cam1-in-man-change-light-brown.txt

frame number

100 200 300

Derivative cam1-in-man-change-light-beige.txt

frame number

100 200 300

Derivative cam1-in-man-change-light-black.txt

frame number

Figure 21: Derivative color space for the Indoors Manual Change Light Sequence

The last color space is Log Hue, shown in Figure 22. While this uses only one

parameter, it remains fairly constant for most of the samples. Blue is the notable exception, and

black gets noisy. These correspond to the times when the R, G, and B values were equal. The

discriminative capability of log hue is not exceptional, since brown and beige as well as red and

orange have very similar values.

100 200 300

log hue cam1-in-man-change-light-orange.txt

frame number

100 200 300

log hue cam1-in-man-change-light-red.txt

frame number

100 200 300

log hue cam1-in-man-change-light-green.txt

frame number

100 200 300

log hue cam1-in-man-change-light-blue.txt

frame number

100 200 300

log hue cam1-in-man-change-light-yellow.txt

frame number

100 200 300

log hue cam1-in-man-change-light-brown.txt

frame number

100 200 300

log hue cam1-in-man-change-light-beige.txt

frame number

100 200 300

log hue cam1-in-man-change-light-black.txt

frame number

Figure 22: Log Hue for the Indoors Manual Change Light Sequence

3.3.2 Indoor Manual Change Iris Sequence

This sequence was also indoors, with no daylight present. The lighting was fixed for the

duration of the trial, with the iris on the camera opening and then closing. The RGB values

recorded during this clip are shown in Figure 23. For all the colors except black, one or more

components saturated and were clipped at 255. While in saturation, four of the samples (green,

blue, yellow and beige) were recorded as white (255, 255, 255). The same data is plotted in 3D

in Figure 24. The inflection points are caused by clipping. For example, the beige sample clips

in red first, then green, before saturating all three colors.

100 200 300 400 5000

250RGB cam1-indoor-change-iris-orange.txt

frame number

redgreenblue

100 200 300 400 5000

250RGB cam1-indoor-change-iris-red.txt

frame number

redgreenblue

100 200 300 400 5000

250RGB cam1-indoor-change-iris-green.txt

frame number

redgreenblue

100 200 300 400 5000

250RGB cam1-indoor-change-iris-blue.txt

frame number

redgreenblue

100 200 300 400 5000

250RGB cam1-indoor-change-iris-yellow.txt

frame number

redgreenblue

100 200 300 400 5000

250RGB cam1-indoor-change-iris-brown.txt

frame number

redgreenblue

100 200 300 400 5000

250RGB cam1-indoor-change-iris-beige.txt

frame number

redgreenblue

100 200 300 400 5000

250RGB cam1-indoor-change-iris-black.txt

frame number

redgreenblue

Figure 23: RGB color space for the Indoor Manual Change Iris Sequence

RGB cam1-indoor-change-iris-orange.txt

RGB cam1-indoor-change-iris-red.txt

RGB cam1-indoor-change-iris-green.txt

RGB cam1-indoor-change-iris-blue.txt

RGB cam1-indoor-change-iris-yellow.txt

RGB cam1-indoor-change-iris-brown.txt

RGB cam1-indoor-change-iris-beige.txt

RGB cam1-indoor-change-iris-black.txt

Figure 24: RGB color space in 3D for the Indoors Manual Change Iris Sequence

In the YIQ space shown in Figure 25, the luminance in Y (red) shows the change in

illumination, and gets saturated in the same four samples. The worst of the fluctuations in I and

Q occur during saturation, but the signals are not constant outside the saturation time.

100 200 300 400 500-100

YIQ cam1-indoor-change-iris-orange.txt

frame number

Y (luminance)IQ

100 200 300 400 500-100

YIQ cam1-indoor-change-iris-red.txt

frame number

Y (luminance)IQ

100 200 300 400 500-100

YIQ cam1-indoor-change-iris-green.txt

frame number

Y (luminance)IQ

100 200 300 400 500-100

YIQ cam1-indoor-change-iris-blue.txt

frame number

Y (luminance)IQ

100 200 300 400 500-100

YIQ cam1-indoor-change-iris-yellow.txt

frame number

Y (luminance)IQ

100 200 300 400 500-100

YIQ cam1-indoor-change-iris-brown.txt

frame number

Y (luminance)IQ

100 200 300 400 500-100

YIQ cam1-indoor-change-iris-beige.txt

frame number

Y (luminance)IQ

100 200 300 400 500-100

YIQ cam1-indoor-change-iris-black.txt

frame number

Y (luminance)IQ

Figure 25: YIQ color space for the Indoor Manual Change Iris Sequence

Both the HSV color space in Figure 26 and HLS in Figure 27 show hue that is fairly

constant except when saturation occurs, although the hue does not discriminate well between the

samples. The value (blue) in HSV and the lightness (green) in HLS show the expected

illumination change. The saturation component (green in HSV and blue in HLS) changes

considerably even when the RGB values are not saturated.

100 200 300 400 500

HSV cam1-indoor-change-iris-orange.txt

frame number

100 200 300 400 500

HSV cam1-indoor-change-iris-red.txt

frame number

100 200 300 400 500

HSV cam1-indoor-change-iris-green.txt

frame number

100 200 300 400 500

HSV cam1-indoor-change-iris-blue.txt

frame number

100 200 300 400 500

HSV cam1-indoor-change-iris-yellow.txt

frame number

100 200 300 400 500

HSV cam1-indoor-change-iris-brown.txt

frame number

100 200 300 400 500

HSV cam1-indoor-change-iris-beige.txt

frame number

100 200 300 400 500

HSV cam1-indoor-change-iris-black.txt

frame number

Figure 26: HSV color space for the Indoor Manual Change Iris Sequence

100 200 300 400 500

HLS cam1-indoor-change-iris-orange.txt

frame number

100 200 300 400 500

HLS cam1-indoor-change-iris-red.txt

frame number

100 200 300 400 500

HLS cam1-indoor-change-iris-green.txt

frame number

100 200 300 400 500

HLS cam1-indoor-change-iris-blue.txt

frame number

100 200 300 400 500

HLS cam1-indoor-change-iris-yellow.txt

frame number

100 200 300 400 500

HLS cam1-indoor-change-iris-brown.txt

frame number

100 200 300 400 500

HLS cam1-indoor-change-iris-beige.txt

frame number

100 200 300 400 500

HLS cam1-indoor-change-iris-black.txt

frame number

Figure 27: HLS color space for the Indoor Manual Change Iris Sequence

The CIELAB color space shown in Figure 28 shows the change in light intensity in the

L* component (red). Change can be seen in the other two components as well, most notably in

the b* component (blue) for the yellow sample (lower left). Also, the a* and b* components are

close to zero for several of the samples, making it difficult to distinguish between colors.

100 200 300 400 500-200

200Lab cam1-indoor-change-iris-orange.txt

frame number

100 200 300 400 500-200

200Lab cam1-indoor-change-iris-red.txt

frame number

100 200 300 400 500-200

200Lab cam1-indoor-change-iris-green.txt

frame number

100 200 300 400 500-200

200Lab cam1-indoor-change-iris-blue.txt

frame number

100 200 300 400 500-200

200Lab cam1-indoor-change-iris-yellow.txt

frame number

100 200 300 400 500-200

200Lab cam1-indoor-change-iris-brown.txt

frame number

100 200 300 400 500-200

200Lab cam1-indoor-change-iris-beige.txt

frame number

100 200 300 400 500-200

200Lab cam1-indoor-change-iris-black.txt

frame number

Figure 28: CIELAB color space for the Indoor Manual Change Iris Sequence

Once again, the chromaticity in Figure 29 and the 321 ccc color space in Figure 30 look

nearly identical. The flat part in the middle of most of the plots corresponds to one or more of

the RGB components being in saturation, but the signals vary considerably outside of saturation.

100 200 300 400 5000

1chrom. space cam1-indoor-change-iris-orange.tx

frame number

redgreenblue

100 200 300 400 5000

1chrom. space cam1-indoor-change-iris-red.txt

frame number

redgreenblue

100 200 300 400 5000

1chrom. space cam1-indoor-change-iris-green.txt

frame number

redgreenblue

100 200 300 400 5000

1chrom. space cam1-indoor-change-iris-blue.txt

frame number

redgreenblue

100 200 300 400 5000

1chrom. space cam1-indoor-change-iris-yellow.txt

frame number

redgreenblue

100 200 300 400 5000

1chrom. space cam1-indoor-change-iris-brown.txt

frame number

redgreenblue

100 200 300 400 5000

1chrom. space cam1-indoor-change-iris-beige.txt

frame number

redgreenblue

100 200 300 400 5000

1chrom. space cam1-indoor-change-iris-black.txt

frame number

redgreenblue

Figure 29: Chromaticity color space for the Indoor Manual Change Iris Sequence

100 200 300 400 5000

c1c2c3 cam1-indoor-change-iris-orange.txt

frame number

c1c2c3

100 200 300 400 5000

c1c2c3 cam1-indoor-change-iris-red.txt

frame number

c1c2c3

100 200 300 400 5000

c1c2c3 cam1-indoor-change-iris-green.txt

frame number

c1c2c3

100 200 300 400 5000

c1c2c3 cam1-indoor-change-iris-blue.txt

frame number

c1c2c3

100 200 300 400 5000

c1c2c3 cam1-indoor-change-iris-yellow.txt

frame number

c1c2c3

100 200 300 400 5000

c1c2c3 cam1-indoor-change-iris-brown.txt

frame number

c1c2c3

100 200 300 400 5000

c1c2c3 cam1-indoor-change-iris-beige.txt

frame number

c1c2c3

100 200 300 400 5000

c1c2c3 cam1-indoor-change-iris-black.txt

frame number

c1c2c3

Figure 30: 321 ccc color space for the Indoor Manual Change Iris Sequence

The l1l2l3 space is displayed in Figure 31. The saturation effects are dramatic, but

remarkably constant when not saturated. The black sample is very noisy, due to the R, G, and B

values being equal.

100 200 300 400 5000

1l1l2l3 cam1-indoor-change-iris-orange.txt

frame number

l1l2l3

100 200 300 400 5000

1l1l2l3 cam1-indoor-change-iris-red.txt

frame number

l1l2l3

100 200 300 400 5000

1l1l2l3 cam1-indoor-change-iris-green.txt

frame number

l1l2l3

100 200 300 400 5000

1l1l2l3 cam1-indoor-change-iris-blue.txt

frame number

l1l2l3

100 200 300 400 5000

1l1l2l3 cam1-indoor-change-iris-yellow.txt

frame number

l1l2l3

100 200 300 400 5000

1l1l2l3 cam1-indoor-change-iris-brown.txt

frame number

l1l2l3

100 200 300 400 5000

1l1l2l3 cam1-indoor-change-iris-beige.txt

frame number

l1l2l3

100 200 300 400 5000

1l1l2l3 cam1-indoor-change-iris-black.txt

frame number

l1l2l3

Figure 31: 321 lll color space for the Indoor Manual Change Iris Sequence

The derivative color space is shown in Figure 32. Once again the hue component (red)

varies significantly, while the other two remain more constant, even during saturation.

100 200 300 400 500

Derivative cam1-indoor-change-iris-orange.txt

frame number

100 200 300 400 500

Derivative cam1-indoor-change-iris-red.txt

frame number

100 200 300 400 500

Derivative cam1-indoor-change-iris-green.txt

frame number

100 200 300 400 500

Derivative cam1-indoor-change-iris-blue.txt

frame number

100 200 300 400 500

Derivative cam1-indoor-change-iris-yellow.txt

frame number

100 200 300 400 500

Derivative cam1-indoor-change-iris-brown.txt

frame number

100 200 300 400 500

Derivative cam1-indoor-change-iris-beige.txt

frame number

100 200 300 400 500

Derivative cam1-indoor-change-iris-black.txt

frame number

Figure 32: Derivative color space for the Indoors Manual Change Iris Sequence

Finally, the Log Hue in Figure 33 is relatively constant when not in saturation, but there

is very little discriminating power on the bottom row of samples.

100 200 300 400 500

log hue cam1-indoor-change-iris-orange.txt

frame number

100 200 300 400 500

log hue cam1-indoor-change-iris-red.txt

frame number

100 200 300 400 500

log hue cam1-indoor-change-iris-green.txt

frame number

100 200 300 400 500

log hue cam1-indoor-change-iris-blue.txt

frame number

100 200 300 400 500

log hue cam1-indoor-change-iris-yellow.txt

frame number

100 200 300 400 500

log hue cam1-indoor-change-iris-brown.txt

frame number

100 200 300 400 500

log hue cam1-indoor-change-iris-beige.txt

frame number

100 200 300 400 500

log hue cam1-indoor-change-iris-black.txt

frame number

Figure 33: Log Hue for the Indoor Manual Change Iris Sequence

3.3.3 Indoor Manual Flashlight Sequence

Color detection algorithms need to deal with light changes that affect only part of the

scene as well as global illumination changes. There are schemes to adjust the entire frame to

maintain a constant maximum, average, or range, that may cancel out some of the variations seen

in the previous experiments, but there is still a need to recognize a color when the light changes

in a subset of the image. To address this case, we used fixed manual camera settings and fixed

indoor lighting with the exception of a flashlight, which traced a path from right to left along the

bottom row, then left to right along the top row. The RGB plots in Figure 34 show a spike as the

flashlight beam crossed each sample. Orange, green, yellow and beige saturated in at least one

component. There were no cases where the RGB values were zero, and the only time the RGB

values were all equal was during saturation. The same data is plotted in 3D in Figure 35.

100 200 300 400 5000

250RGB cam1-in-man-flashlight-orange.txt

frame number

redgreenblue

100 200 300 400 5000

250RGB cam1-in-man-flashlight-red.txt

frame number

redgreenblue

100 200 300 400 5000

250RGB cam1-in-man-flashlight-green.txt

frame number

redgreenblue

100 200 300 400 5000

250RGB cam1-in-man-flashlight-blue.txt

frame number

redgreenblue

100 200 300 400 5000

250RGB cam1-in-man-flashlight-yellow.txt

frame number

redgreenblue

100 200 300 400 5000

250RGB cam1-in-man-flashlight-brown.txt

frame number

redgreenblue

100 200 300 400 5000

250RGB cam1-in-man-flashlight-beige.txt

frame number

redgreenblue

100 200 300 400 5000

250RGB cam1-in-man-flashlight-black.txt

frame number

redgreenblue

Figure 34: RGB color space for the Indoor Manual Flashlight Sequence

RGB cam1-in-man-flashlight-orange.txt

RGB cam1-in-man-flashlight-red.txt

RGB cam1-in-man-flashlight-green.txt

RGB cam1-in-man-flashlight-blue.txt

RGB cam1-in-man-flashlight-yellow.txt

RGB cam1-in-man-flashlight-brown.txt

RGB cam1-in-man-flashlight-beige.txt

RGB cam1-in-man-flashlight-black.txt

Figure 35: RGB color space in 3D for the Indoor Manual Flashlight Sequence

The YIQ color space in Figure 36 shows the expected spike in Y (red), but also shows

changes in I and Q, even when there was no saturation.

100 200 300 400 500-100

YIQ cam1-in-man-flashlight-orange.txt

frame number

Y (luminance)IQ

100 200 300 400 500-100

YIQ cam1-in-man-flashlight-red.txt

frame number

Y (luminance)IQ

100 200 300 400 500-100

YIQ cam1-in-man-flashlight-green.txt

frame number

Y (luminance)IQ

100 200 300 400 500-100

YIQ cam1-in-man-flashlight-blue.txt

frame number

Y (luminance)IQ

100 200 300 400 500-100

YIQ cam1-in-man-flashlight-yellow.txt

frame number

Y (luminance)IQ

100 200 300 400 500-100

YIQ cam1-in-man-flashlight-brown.txt

frame number

Y (luminance)IQ

100 200 300 400 500-100

YIQ cam1-in-man-flashlight-beige.txt

frame number

Y (luminance)IQ

100 200 300 400 500-100

YIQ cam1-in-man-flashlight-black.txt

frame number

Y (luminance)IQ

Figure 36: YIQ color space for the Indoor Manual Flashlight Sequence

HSV in Figure 37 and HLS in Figure 38 show the expected spike in value (blue in HSV)

and lightness (green in HLS). The hue is consistent in both except when saturated, but the

saturation changes even in samples that were not saturated.

100 200 300 400 500

HSV cam1-in-man-flashlight-orange.txt

frame number

100 200 300 400 500

HSV cam1-in-man-flashlight-red.txt

frame number

100 200 300 400 500

HSV cam1-in-man-flashlight-green.txt

frame number

100 200 300 400 500

HSV cam1-in-man-flashlight-blue.txt

frame number

100 200 300 400 500

HSV cam1-in-man-flashlight-yellow.txt

frame number

100 200 300 400 500

HSV cam1-in-man-flashlight-brown.txt

frame number

100 200 300 400 500

HSV cam1-in-man-flashlight-beige.txt

frame number

100 200 300 400 500

HSV cam1-in-man-flashlight-black.txt

frame number

Figure 37: HSV color space for the Indoor Manual Flashlight Sequence

100 200 300 400 500

HLS cam1-in-man-flashlight-orange.txt

frame number

100 200 300 400 500

HLS cam1-in-man-flashlight-red.txt

frame number

100 200 300 400 500

HLS cam1-in-man-flashlight-green.txt

frame number

100 200 300 400 500

HLS cam1-in-man-flashlight-blue.txt

frame number

100 200 300 400 500

HLS cam1-in-man-flashlight-yellow.txt

frame number

100 200 300 400 500

HLS cam1-in-man-flashlight-brown.txt

frame number

100 200 300 400 500

HLS cam1-in-man-flashlight-beige.txt

frame number

100 200 300 400 500

HLS cam1-in-man-flashlight-black.txt

frame number

Figure 38: HLS color space for the Indoor Manual Flashlight Sequence

The CIELAB color space is shown in Figure 39. The L* component (red) shows the

increase in brightness when the flashlight beam moves across each sample, and the a* and b*

components appear stable. There isn’t much difference in the a* and b* components between

the brown, beige, and black samples.

100 200 300 400 500-200

200Lab cam1-in-man-flashlight-orange.txt

frame number

100 200 300 400 500-200

200Lab cam1-in-man-flashlight-red.txt

frame number

100 200 300 400 500-200

200Lab cam1-in-man-flashlight-green.txt

frame numberco

100 200 300 400 500-200

200Lab cam1-in-man-flashlight-blue.txt

frame number

100 200 300 400 500-200

200Lab cam1-in-man-flashlight-yellow.txt

frame number

100 200 300 400 500-200

200Lab cam1-in-man-flashlight-brown.txt

frame number

100 200 300 400 500-200

200Lab cam1-in-man-flashlight-beige.txt

frame number

100 200 300 400 500-200

200Lab cam1-in-man-flashlight-black.txt

frame number

Figure 39: CIELAB color space for the Indoor Manual Flashlight Sequence

For both the chromaticity space in Figure 40 and the 321 ccc space in Figure 41, there are

only small bumps where the flashlight beam crossed the sample except when there was

saturation.

100 200 300 400 5000

1chrom. space cam1-in-man-flashlight-orange.txt

frame number

redgreenblue

100 200 300 400 5000

1chrom. space cam1-in-man-flashlight-red.txt

frame number

redgreenblue

100 200 300 400 5000

1chrom. space cam1-in-man-flashlight-green.txt

frame number

redgreenblue

100 200 300 400 5000

1chrom. space cam1-in-man-flashlight-blue.txt

frame number

redgreenblue

100 200 300 400 5000

1chrom. space cam1-in-man-flashlight-yellow.txt

frame number

redgreenblue

100 200 300 400 5000

1chrom. space cam1-in-man-flashlight-brown.txt

frame number

redgreenblue

100 200 300 400 5000

1chrom. space cam1-in-man-flashlight-beige.txt

frame number

redgreenblue

100 200 300 400 5000

1chrom. space cam1-in-man-flashlight-black.txt

frame number

redgreenblue

Figure 40: Chromaticity color space for the Indoor Manual Flashlight Sequence

100 200 300 400 5000

c1c2c3 cam1-in-man-flashlight-orange.txt

frame number

c1c2c3

100 200 300 400 5000

c1c2c3 cam1-in-man-flashlight-red.txt

frame number

c1c2c3

100 200 300 400 5000

c1c2c3 cam1-in-man-flashlight-green.txt

frame number

c1c2c3

100 200 300 400 5000

c1c2c3 cam1-in-man-flashlight-blue.txt

frame number

c1c2c3

100 200 300 400 5000

c1c2c3 cam1-in-man-flashlight-yellow.txt

frame number

c1c2c3

100 200 300 400 5000

c1c2c3 cam1-in-man-flashlight-brown.txt

frame number

c1c2c3

100 200 300 400 5000

c1c2c3 cam1-in-man-flashlight-beige.txt

frame number

c1c2c3

100 200 300 400 5000

c1c2c3 cam1-in-man-flashlight-black.txt

frame number

c1c2c3

Figure 41: 321 ccc color space for the Indoor Manual Flashlight Sequence

The l1l2l3 space in Figure 42 shows consistent values except when the RGB signal was

saturated, although the blue and black samples are quite noisy.

100 200 300 400 5000

1l1l2l3 cam1-in-man-flashlight-orange.txt

frame number

l1l2l3

100 200 300 400 5000

1l1l2l3 cam1-in-man-flashlight-red.txt

frame number

l1l2l3

100 200 300 400 5000

1l1l2l3 cam1-in-man-flashlight-green.txt

frame number

l1l2l3

100 200 300 400 5000

1l1l2l3 cam1-in-man-flashlight-blue.txt

frame number

l1l2l3

100 200 300 400 5000

1l1l2l3 cam1-in-man-flashlight-yellow.txt

frame number

l1l2l3

100 200 300 400 5000

1l1l2l3 cam1-in-man-flashlight-brown.txt

frame number

l1l2l3

100 200 300 400 5000

1l1l2l3 cam1-in-man-flashlight-beige.txt

frame number

l1l2l3

100 200 300 400 5000

1l1l2l3 cam1-in-man-flashlight-black.txt

frame number

l1l2l3

Figure 42: 321 lll color space for the Indoor Manual Flashlight Sequence

All three components of the derivative space in Figure 43 remain fairly constant,

although the hue (red) shows some spikes during saturation.

100 200 300 400 500

Derivative cam1-in-man-flashlight-orange.txt

frame number

100 200 300 400 500

Derivative cam1-in-man-flashlight-red.txt

frame number

100 200 300 400 500

Derivative cam1-in-man-flashlight-green.txt

frame number

100 200 300 400 500

Derivative cam1-in-man-flashlight-blue.txt

frame number

100 200 300 400 500

Derivative cam1-in-man-flashlight-yellow.txt

frame number

100 200 300 400 500

Derivative cam1-in-man-flashlight-brown.txt

frame number

100 200 300 400 500

Derivative cam1-in-man-flashlight-beige.txt

frame number

100 200 300 400 500

Derivative cam1-in-man-flashlight-black.txt

frame number

Figure 43: Derivative color space for the Indoor Manual Flashlight Sequence

Log Hue in Figure 44 shows flat plots even when the RGB signal was saturated. Again,

there is not much discrimination between the samples in the bottom row.

100 200 300 400 500

log hue cam1-in-man-flashlight-orange.txt

frame number

100 200 300 400 500

log hue cam1-in-man-flashlight-red.txt

frame number

100 200 300 400 500

log hue cam1-in-man-flashlight-green.txt

frame number

100 200 300 400 500

log hue cam1-in-man-flashlight-blue.txt

frame number

100 200 300 400 500

log hue cam1-in-man-flashlight-yellow.txt

frame number

100 200 300 400 500

log hue cam1-in-man-flashlight-brown.txt

frame number

100 200 300 400 500

log hue cam1-in-man-flashlight-beige.txt

frame number

100 200 300 400 500

log hue cam1-in-man-flashlight-black.txt

frame number

Figure 44: Log Hue for the Indoor Manual Flashlight Sequence

3.3.4 Outdoor Automatic Change Light Sequence

The next trial used only outdoor light, which was indirect light from a window. The

amount of light was changed by closing then opening mini-blinds on the window, in addition to

any uncontrolled variation from clouds. This time, the camera was in “automatic” mode, free to

make adjustments for the changing light level. The RGB plots in Figure 45 show that all three

components of all the colors changed with the illumination level. No saturation was present in

this trial. The R, G, and B components were equal for the black sample, so we expect to see

noisy results for hue and 321 lll . The ending light level was approximately the same as the

beginning light level, yet the RGB values at the end were different than the beginning, especially

in the green and blue samples. The probable cause of this is that the automatic white balance

setting in the camera changed when the light was dim, resulting in different recorded colors at

the beginning and ending of the sequence. This can be avoided by using manual camera settings.

The white balance change can be seen more clearly in the 3D plots in Figure 46, where several

of the samples show two lines of points instead of one.

100 200 300 400 500 6000

250RGB cam1-out-auto-change-light2-orange.txt

frame number

redgreenblue

100 200 300 400 500 6000

250RGB cam1-out-auto-change-light2-red.txt

frame number

redgreenblue

100 200 300 400 500 6000

250RGB cam1-out-auto-change-light2-green.txt

frame number

redgreenblue

100 200 300 400 500 6000

250RGB cam1-out-auto-change-light2-blue.txt

frame number

redgreenblue

100 200 300 400 500 6000

250RGB cam1-out-auto-change-light2-yellow.txt

frame number

redgreenblue

100 200 300 400 500 6000

250RGB cam1-out-auto-change-light2-brown.txt

frame number

redgreenblue

100 200 300 400 500 6000

250RGB cam1-out-auto-change-light2-beige.txt

frame number

redgreenblue

100 200 300 400 500 6000

250RGB cam1-out-auto-change-light2-black.txt

frame number

redgreenblue

Figure 45: RGB color space for the Outdoor Automatic Change Light Sequence

RGB cam1-out-auto-change-light2-orange.txt

RGB cam1-out-auto-change-light2-red.txt

RGB cam1-out-auto-change-light2-green.txt

RGB cam1-out-auto-change-light2-blue.txt

RGB cam1-out-auto-change-light2-yellow.txt

RGB cam1-out-auto-change-light2-brown.txt

RGB cam1-out-auto-change-light2-beige.txt

RGB cam1-out-auto-change-light2-black.txt

Figure 46: RGB color space in 3D for the Outdoor Automatic Change Light Sequence

The YIQ plots in Figure 47 show that the luminance in Y (red) varies with light level, and

the I and Q components only changed slightly. The white balance change is not obvious here.

100 200 300 400 500 600-100

YIQ cam1-out-auto-change-light2-orange.txt

frame number

Y (luminance)IQ

100 200 300 400 500 600-100

YIQ cam1-out-auto-change-light2-red.txt

frame number

Y (luminance)IQ

100 200 300 400 500 600-100

YIQ cam1-out-auto-change-light2-green.txt

frame number

Y (luminance)IQ

100 200 300 400 500 600-100

YIQ cam1-out-auto-change-light2-blue.txt

frame number

Y (luminance)IQ

100 200 300 400 500 600-100

YIQ cam1-out-auto-change-light2-yellow.txt

frame number

Y (luminance)IQ

100 200 300 400 500 600-100

YIQ cam1-out-auto-change-light2-brown.txt

frame number

Y (luminance)IQ

100 200 300 400 500 600-100

YIQ cam1-out-auto-change-light2-beige.txt

frame number

Y (luminance)IQ

100 200 300 400 500 600-100

YIQ cam1-out-auto-change-light2-black.txt

frame number

Y (luminance)IQ

Figure 47: YIQ color space for the Outdoor Automatic Change Light Sequence

For HSV in Figure 48 and HLS in Figure 49, the value (blue in HSV) and lightness

(green in HLS) components reacted as expected to the illumination change. The hue (red in both

HSV and HLS) was fairly constant, although it was very noisy at times (when BGR == ), and

drifted a bit due to the white balance change in the camera. The saturation was not very

consistent.

100 200 300 400 500 600

HSV cam1-out-auto-change-light2-orange.txt

frame number

100 200 300 400 500 600

HSV cam1-out-auto-change-light2-red.txt

frame number

100 200 300 400 500 600

HSV cam1-out-auto-change-light2-green.txt

frame number

100 200 300 400 500 600

HSV cam1-out-auto-change-light2-blue.txt

frame number

100 200 300 400 500 600

HSV cam1-out-auto-change-light2-yellow.txt

frame number

100 200 300 400 500 600

HSV cam1-out-auto-change-light2-brown.txt

frame number

100 200 300 400 500 600

HSV cam1-out-auto-change-light2-beige.txt

frame number

100 200 300 400 500 600

HSV cam1-out-auto-change-light2-black.txt

frame number

Figure 48: HSV color space for the Outdoor Automatic Change Light Sequence

100 200 300 400 500 600

HLS cam1-out-auto-change-light2-orange.txt

frame number

100 200 300 400 500 600

HLS cam1-out-auto-change-light2-red.txt

frame number

100 200 300 400 500 600

HLS cam1-out-auto-change-light2-green.txt

frame number

100 200 300 400 500 600

HLS cam1-out-auto-change-light2-blue.txt

frame number

100 200 300 400 500 600

HLS cam1-out-auto-change-light2-yellow.txt

frame number

100 200 300 400 500 600

HLS cam1-out-auto-change-light2-brown.txt

frame number

100 200 300 400 500 600

HLS cam1-out-auto-change-light2-beige.txt

frame number

100 200 300 400 500 600

HLS cam1-out-auto-change-light2-black.txt

frame number

Figure 49: HLS color space for the Outdoor Automatic Change Light Sequence

For the CIELAB color space, shown in Figure 50, the intensity in the L* component (red)

does not change as much as the intensity components in the previous color spaces due to the

cube root. The a* and b* components vary with the light change, most obviously in the yellow

sample. Once again, there is not much difference in a* and b* between the brown, beige, and

black samples.

100 200 300 400 500 600-200

200Lab cam1-out-auto-change-light2-orange.txt

frame number

100 200 300 400 500 600-200

200Lab cam1-out-auto-change-light2-red.txt

frame number

100 200 300 400 500 600-200

200Lab cam1-out-auto-change-light2-green.txt

frame number

100 200 300 400 500 600-200

200Lab cam1-out-auto-change-light2-blue.txt

frame number

100 200 300 400 500 600-200

200Lab cam1-out-auto-change-light2-yellow.txt

frame number

100 200 300 400 500 600-200

200Lab cam1-out-auto-change-light2-brown.txt

frame number

100 200 300 400 500 600-200

200Lab cam1-out-auto-change-light2-beige.txt

frame number

100 200 300 400 500 600-200

200Lab cam1-out-auto-change-light2-black.txt

frame number

Figure 50: CIELAB color space for the Outdoor Automatic Change Light Sequence

Once again, the chromaticity in Figure 51 and the 321 ccc color space in Figure 52 appear

identical. The variation is not as severe as the RGB space, but most of the colors are

significantly affected by the illumination change as well as the shift in white balance.

100 200 300 400 500 6000

1chrom. space cam1-out-auto-change-light2-orange.txt

frame number

redgreenblue

100 200 300 400 500 6000

1chrom. space cam1-out-auto-change-light2-red.txt

frame number

redgreenblue

100 200 300 400 500 6000

1chrom. space cam1-out-auto-change-light2-green.t

frame number

redgreenblue

100 200 300 400 500 6000

1chrom. space cam1-out-auto-change-light2-blue.tx

frame number

redgreenblue

100 200 300 400 500 6000

1chrom. space cam1-out-auto-change-light2-yellow.txt

frame number

redgreenblue

100 200 300 400 500 6000

1chrom. space cam1-out-auto-change-light2-brown.txt

frame number

redgreenblue

100 200 300 400 500 6000

1chrom. space cam1-out-auto-change-light2-beige.txt

frame number

redgreenblue

100 200 300 400 500 6000

1chrom. space cam1-out-auto-change-light2-black.txt

frame number

redgreenblue

Figure 51: Chromaticity color space for the Outdoor Automatic Change Light Sequence

100 200 300 400 500 6000

c1c2c3 cam1-out-auto-change-light2-orange.txt

frame number

c1c2c3

100 200 300 400 500 6000

c1c2c3 cam1-out-auto-change-light2-red.txt

frame number

c1c2c3

100 200 300 400 500 6000

c1c2c3 cam1-out-auto-change-light2-green.txt

frame number

c1c2c3

100 200 300 400 500 6000

c1c2c3 cam1-out-auto-change-light2-blue.txt

frame number

c1c2c3

100 200 300 400 500 6000

c1c2c3 cam1-out-auto-change-light2-yellow.txt

frame number

c1c2c3

100 200 300 400 500 6000

c1c2c3 cam1-out-auto-change-light2-brown.txt

frame number

c1c2c3

100 200 300 400 500 6000

c1c2c3 cam1-out-auto-change-light2-beige.txt

frame number

c1c2c3

100 200 300 400 500 6000

c1c2c3 cam1-out-auto-change-light2-black.txt

frame number

c1c2c3

Figure 52: The 321 ccc color space for the Outdoor Automatic Change Light Sequence

The l1l2l3 color space shows the white balance change dramatically in the green sample in

Figure 53. The orange, red, and yellow samples remain fairly constant, but the black sample is

extremely noisy, due to the equal RGB components.

100 200 300 400 500 6000

1l1l2l3 cam1-out-auto-change-light2-orange.txt

frame number

l1l2l3

100 200 300 400 500 6000

1l1l2l3 cam1-out-auto-change-light2-red.txt

frame number

l1l2l3

100 200 300 400 500 6000

1l1l2l3 cam1-out-auto-change-light2-green.txt

frame number

l1l2l3

100 200 300 400 500 6000

1l1l2l3 cam1-out-auto-change-light2-blue.txt

frame number

l1l2l3

100 200 300 400 500 6000

1l1l2l3 cam1-out-auto-change-light2-yellow.txt

frame number

l1l2l3

100 200 300 400 500 6000

1l1l2l3 cam1-out-auto-change-light2-brown.txt

frame number

l1l2l3

100 200 300 400 500 6000

1l1l2l3 cam1-out-auto-change-light2-beige.txt

frame number

l1l2l3

100 200 300 400 500 6000

1l1l2l3 cam1-out-auto-change-light2-black.txt

frame number

l1l2l3

Figure 53: The l1l2l3 color space for the Outdoor Automatic Change Light Sequence

The derivative color space in Figure 54 shows that the hue is affected by the light

intensity, but the other two components react much less. The white balance change can be seen

most dramatically in the hue component of the black sample.

100 200 300 400 500 600

Derivative cam1-out-auto-change-light2-orange.txt

frame number

100 200 300 400 500 600

Derivative cam1-out-auto-change-light2-red.txt

frame number

100 200 300 400 500 600

Derivative cam1-out-auto-change-light2-green.txt

frame number

100 200 300 400 500 600

Derivative cam1-out-auto-change-light2-blue.txt

frame number

100 200 300 400 500 600

Derivative cam1-out-auto-change-light2-yellow.txt

frame number

100 200 300 400 500 600

Derivative cam1-out-auto-change-light2-brown.txt

frame number

100 200 300 400 500 600

Derivative cam1-out-auto-change-light2-beige.txt

frame number

100 200 300 400 500 600

Derivative cam1-out-auto-change-light2-black.txt

frame number

Figure 54: Derivative color space for the Outdoor Automatic Change Light Sequence

Lastly, the Log Hue in Figure 55 remains fairly stable until the white balance change

except for the black sample, which is very noisy in addition to changing in value. Once again,

the price for stability is limited ability to distinguish between different colors.

100 200 300 400 500 600

log hue cam1-out-auto-change-light2-orange.txt

frame number

100 200 300 400 500 600

log hue cam1-out-auto-change-light2-red.txt

frame number

100 200 300 400 500 600

log hue cam1-out-auto-change-light2-green.txt

frame number

100 200 300 400 500 600

log hue cam1-out-auto-change-light2-blue.txt

frame number

100 200 300 400 500 600

log hue cam1-out-auto-change-light2-yellow.txt

frame number

100 200 300 400 500 600

log hue cam1-out-auto-change-light2-brown.txt

frame number

100 200 300 400 500 600

log hue cam1-out-auto-change-light2-beige.txt

frame number

100 200 300 400 500 600

log hue cam1-out-auto-change-light2-black.txt

frame number

Figure 55: Log Hue for the Outdoor Automatic Change Light Sequence

3.3.5 Outdoor Manual Change Light Sequence

The last trial used indirect sunlight from a window, varied by closing and opening blinds,

as the light source, with fixed manual camera settings. The RGB plots in Figure 56 show one

more time that all the components change with the light intensity. No saturation occurred in this

trial. The 3D plots in Figure 57 look close to linear, but curvature can be seen, especially in the

orange and red samples. The black sample exhibits both equal R, G, and B components and

values close to zero.

200 400 6000

250RGB cam1-out-man-change-light2-orange.txt

frame number

redgreenblue

200 400 6000

250RGB cam1-out-man-change-light2-red.txt

frame number

redgreenblue

200 400 6000

250RGB cam1-out-man-change-light2-green.txt

frame number

redgreenblue

200 400 6000

250RGB cam1-out-man-change-light2-blue.txt

frame number

redgreenblue

200 400 6000

250RGB cam1-out-man-change-light2-yellow.txt

frame number

redgreenblue

200 400 6000

250RGB cam1-out-man-change-light2-brown.txt

frame number

redgreenblue

200 400 6000

250RGB cam1-out-man-change-light2-beige.txt

frame number

redgreenblue

200 400 6000

250RGB cam1-out-man-change-light2-black.txt

frame number

redgreenblue

Figure 56: RGB color space for the Outdoor Manual Change Light Sequence

RGB cam1-out-man-change-light2-orange.txt

RGB cam1-out-man-change-light2-red.txt

RGB cam1-out-man-change-light2-green.txt

RGB cam1-out-man-change-light2-blue.txt

RGB cam1-out-man-change-light2-yellow.txt

RGB cam1-out-man-change-light2-brown.txt

RGB cam1-out-man-change-light2-beige.txt

RGB cam1-out-man-change-light2-black.txt

Figure 57: RGB color space in 3D for the Outdoor Manual Change Light Sequence

The YIQ plots in Figure 58 show Y (red) proportional to the light intensity, and I and Q

are not constant.

200 400 600-100

YIQ cam1-out-man-change-light2-orange.txt

frame number

Y (luminance)IQ

200 400 600-100

YIQ cam1-out-man-change-light2-red.txt

frame number

Y (luminance)IQ

200 400 600-100

YIQ cam1-out-man-change-light2-green.txt

frame number

Y (luminance)IQ

200 400 600-100

YIQ cam1-out-man-change-light2-blue.txt

frame number

Y (luminance)IQ

200 400 600-100

YIQ cam1-out-man-change-light2-yellow.txt

frame number

Y (luminance)IQ

200 400 600-100

YIQ cam1-out-man-change-light2-brown.txt

frame number

Y (luminance)IQ

200 400 600-100

YIQ cam1-out-man-change-light2-beige.txt

frame number

Y (luminance)IQ

200 400 600-100

YIQ cam1-out-man-change-light2-black.txt

frame number

Y (luminance)IQ

Figure 58: YIQ color space for the Outdoor Manual Change Light Sequence

The HSV plots in Figure 59 and the HLS plots in Figure 60 once again show value (blue

in HSV) and lightness (green in HLS) proportional to the light intensity. Hue (red in both HSV

and HLS) remains mostly constant, with limited discrimination ability and lots of noise for

black, since the RGB components were close to zero. Saturation (green in HSV and blue in

HLS) does not remain constant.

200 400 600

HSV cam1-out-man-change-light2-orange.txt

frame number

200 400 600

HSV cam1-out-man-change-light2-red.txt

frame number

200 400 600

HSV cam1-out-man-change-light2-green.txt

frame number

200 400 600

HSV cam1-out-man-change-light2-blue.txt

frame number

200 400 600

HSV cam1-out-man-change-light2-yellow.txt

frame number

200 400 600

HSV cam1-out-man-change-light2-brown.txt

frame number

200 400 600

HSV cam1-out-man-change-light2-beige.txt

frame number

200 400 600

HSV cam1-out-man-change-light2-black.txt

frame number

Figure 59: HSV color space for the Outdoor Manual Change Light Sequence

200 400 600

HLS cam1-out-man-change-light2-orange.txt

frame number

200 400 600

HLS cam1-out-man-change-light2-red.txt

frame number

200 400 600

HLS cam1-out-man-change-light2-green.txt

frame number

200 400 600

HLS cam1-out-man-change-light2-blue.txt

frame number

200 400 600

HLS cam1-out-man-change-light2-yellow.txt

frame number

200 400 600

HLS cam1-out-man-change-light2-brown.txt

frame number

200 400 600

HLS cam1-out-man-change-light2-beige.txt

frame number

200 400 600

HLS cam1-out-man-change-light2-black.txt

frame number

Figure 60: HLS color space for the Outdoor Manual Change Light Sequence

Once again, the CIELAB color space in Figure 61 shows the intensity variation in the L*

component (red), small changes in the a* and b* components, and very little difference between

the brown, beige, and black samples.

200 400 600-200

200Lab cam1-out-man-change-light2-orange.txt

frame number

200 400 600-200

200Lab cam1-out-man-change-light2-red.txt

frame number

200 400 600-200

200Lab cam1-out-man-change-light2-green.txt

frame number

200 400 600-200

200Lab cam1-out-man-change-light2-blue.txt

frame number

200 400 600-200

200Lab cam1-out-man-change-light2-yellow.txt

frame number

200 400 600-200

200Lab cam1-out-man-change-light2-brown.txt

frame number

200 400 600-200

200Lab cam1-out-man-change-light2-beige.txt

frame number

200 400 600-200

200Lab cam1-out-man-change-light2-black.txt

frame number

Figure 61: CIELAB color space for the Outdoor Manual Change Light Sequence

Chromaticity in Figure 62 and 321 ccc in Figure 63 have no significant differences. As

before, the variation with light intensity is much less than in RGB space, but there is still

nontrivial deviation.

200 400 6000

1chrom. space cam1-out-man-change-light2-orange.txt

frame number

redgreenblue

200 400 6000

1chrom. space cam1-out-man-change-light2-red.txt

frame number

redgreenblue

200 400 6000

1chrom. space cam1-out-man-change-light2-green.txt

frame number

redgreenblue

200 400 6000

1chrom. space cam1-out-man-change-light2-blue.txt

frame number

redgreenblue

200 400 6000

1chrom. space cam1-out-man-change-light2-yellow.txt

frame number

redgreenblue

200 400 6000

1chrom. space cam1-out-man-change-light2-brown.txt

frame number

redgreenblue

200 400 6000

1chrom. space cam1-out-man-change-light2-beige.txt

frame number

redgreenblue

200 400 6000

1chrom. space cam1-out-man-change-light2-black.txt

frame number

redgreenblue

Figure 62: Chromaticity color space for the Outdoor Manual Change Light Sequence

200 400 6000

c1c2c3 cam1-out-man-change-light2-orange.txt

frame number

c1c2c3

200 400 6000

c1c2c3 cam1-out-man-change-light2-red.txt

frame number

c1c2c3

200 400 6000

c1c2c3 cam1-out-man-change-light2-green.txt

frame number

c1c2c3

200 400 6000

c1c2c3 cam1-out-man-change-light2-blue.txt

frame number

c1c2c3

200 400 6000

c1c2c3 cam1-out-man-change-light2-yellow.txt

frame number

c1c2c3

200 400 6000

c1c2c3 cam1-out-man-change-light2-brown.txt

frame number

c1c2c3

200 400 6000

c1c2c3 cam1-out-man-change-light2-beige.txt

frame number

c1c2c3

200 400 6000

c1c2c3 cam1-out-man-change-light2-black.txt

frame number

c1c2c3

Figure 63: The 321 ccc color space for the Outdoor Manual Change Light Sequence

The l1l2l3 plots in Figure 64 have less variation with light intensity, but gets very noisy in

dim light and with the black sample due to the equal RGB components.

200 400 6000

1l1l2l3 cam1-out-man-change-light2-orange.txt

frame number

l1l2l3

200 400 6000

1l1l2l3 cam1-out-man-change-light2-red.txt

frame number

l1l2l3

200 400 6000

1l1l2l3 cam1-out-man-change-light2-green.txt

frame number

l1l2l3

200 400 6000

1l1l2l3 cam1-out-man-change-light2-blue.txt

frame number

l1l2l3

200 400 6000

1l1l2l3 cam1-out-man-change-light2-yellow.txt

frame number

l1l2l3

200 400 6000

1l1l2l3 cam1-out-man-change-light2-brown.txt

frame number

l1l2l3

200 400 6000

1l1l2l3 cam1-out-man-change-light2-beige.txt

frame number

l1l2l3

200 400 6000

1l1l2l3 cam1-out-man-change-light2-black.txt

frame number

l1l2l3

Figure 64: The l1l2l3 color space for the Outdoor Manual Change Light Sequence

The derivative space plotted in Figure 65 shows small changes in hue when the light is at

its lowest level, with even smaller changes in the other two components.

200 400 600

Derivative cam1-out-man-change-light2-orange.txt

frame number

200 400 600

Derivative cam1-out-man-change-light2-red.txt

frame number

200 400 600

Derivative cam1-out-man-change-light2-green.txt

frame number

200 400 600

Derivative cam1-out-man-change-light2-blue.txt

frame number

200 400 600

Derivative cam1-out-man-change-light2-yellow.txt

frame number

200 400 600

Derivative cam1-out-man-change-light2-brown.txt

frame number

200 400 600

Derivative cam1-out-man-change-light2-beige.txt

frame number

200 400 600

Derivative cam1-out-man-change-light2-black.txt

frame number

Figure 65: Derivative color space for the Outdoor Manual Change Light Sequence

Finally, the Log Hue plots in Figure 66 remain fairly constant through the light intensity

variation except for the black sample, where BGR == . Noise increases as the light intensity

decreases, and yellow, beige and brown look almost the same.

200 400 600

log hue cam1-out-man-change-light2-orange.txt

frame number

200 400 600

log hue cam1-out-man-change-light2-red.txt

frame number

200 400 600

log hue cam1-out-man-change-light2-green.txt

frame number

200 400 600

log hue cam1-out-man-change-light2-blue.txt

frame number

200 400 600

log hue cam1-out-man-change-light2-yellow.txt

frame number

200 400 600

log hue cam1-out-man-change-light2-brown.txt

frame number

200 400 600

log hue cam1-out-man-change-light2-beige.txt

frame number

200 400 600

log hue cam1-out-man-change-light2-black.txt

frame number

Figure 66: Log Hue for the Outdoor Manual Change Light Sequence

3.4 Analysis

To quantify the performance in each of the color spaces, we computed the standard

deviation of each component for each trial. The magnitude of the standard deviation for RGB

space shows the magnitude of the illumination change. A larger change in lighting causes a

larger change in RGB levels, which will more robustly test another color space’s ability to

remain constant. The most invariant color space will have the smallest standard deviation.

The discriminative power was also computed for each color space for each trial. The

discriminative power is large when the inter-sample differences are large and the intra-sample

differences are small. The formula we used is

( )( ) ( )( )( )( ) ( )( )tctc

tctcDP

jiij stdevstdev

meanmean

−= , (19)

where ( )tci is the vector function representing one of the color spaces as it varied over time and

x is the magnitude of vector x. The numerator is the distance between two colors and the

denominator is the cumulative noise. Since we are seeking a color space invariant to

illumination, the luminance component was omitted from YIQ, HSV, and HLS. The hue

component was also removed from the derivative color space since it was observed to vary much

more than the other two components. With 8 color samples, there are 28 pairs (1 + 2 + … + 7).

Each graph of discriminative power shows the value computed for all 28 pairs for one color

space in one trial. Values less than one indicate that the noise is greater than the difference

between colors, and bigger numbers signify better discriminatory capability.

3.4.1 Indoors Manual Change Light Sequence

The standard deviations for the Indoors Manual Change Light sequence are in Figure 67.

This was the sequence with the smallest change in RGB values. In general, the standard

deviations of the luminance components (Y in YIQ, V in HSV, L in HLS, L* in CIELAB) are the

same magnitude as RGB. Components I and Q of YIQ have less deviation than RGB for six of

the samples, but not for orange and red. The hue in HSV and HLS has a small standard

deviation in most of the samples, but varies more than RGB in blue and black. The standard

deviation of the saturation in HSV and HLS is about the same magnitude as RGB in most of the

samples, but much worse in black. The a* and b* components of CIELAB are low in all the

samples. Chromaticity and 321 ccc are fairly consistent across the samples, although they have

higher standard deviations than RGB in the black sample. The l1l2l3 space has standard

deviations similar to chromaticity most of the time, but in green, blue, and black, it is the worst

of the color spaces. In the derivative space, the hue component varies more than RGB in the

blue and black samples. The other two components always vary less than RGB, even in the

black sample that has the smallest RGB deviation. Log hue has smaller standard deviation than

RGB in five of the eight samples, but is much worse than RGB in blue and black.

The bottom line is that only the Cλ and Cλλ components of the derivative color space and

the a* and b* componenets of the CIELAB color space always produced components with

smaller standard deviation than RGB. The YIQ space was the only other one that did not get

noisy for the black sample.

RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0

30Stdev cam1-in-man-change-light-orange.txt

30Stdev cam1-in-man-change-light-red.txt

30Stdev cam1-in-man-change-light-green.txt

30Stdev cam1-in-man-change-light-blue.txt

30Stdev cam1-in-man-change-light-yellow.txt

30Stdev cam1-in-man-change-light-brown.txt

30Stdev cam1-in-man-change-light-beige.txt

30Stdev cam1-in-man-change-light-black.txt

Figure 67: Standard Deviation for the Indoor Manual Change Light Sequence

The discriminative power for the color spaces in the Indoors Manual Change Light

sequence are shown in Figure 68. For RGB space, 13 of the 28 pairs have DP < 1.0, and only 3

pairs exceed 2.0, confirming that RGB is seriously affected by illumination changes. Of the

remaining color spaces, l1l2l3 has the fewest distinguishable pairs with 19, due to the noisy blue

and black samples. The rest of the color spaces only fail to distinguish between 4 or fewer color

pairs. Chromaticity, 321 ccc , and derivative color spaces have the most pairs with DP above any

given threshold.

5 10 15 20 250

10RGB cam1-in-man-change-light- all frames

color pair

5 10 15 20 250

10(Y)IQ cam1-in-man-change-light- all frames

color pair

5 10 15 20 250

10HS(V) cam1-in-man-change-light- all frames

color pair

5 10 15 20 250

10H(L)S cam1-in-man-change-light- all frames

color pair

5 10 15 20 250

10(L)ab cam1-in-man-change-light- all frames

color pair

5 10 15 20 250

10chrom. cam1-in-man-change-light- all frames

color pair

5 10 15 20 250

10c1c2c3 cam1-in-man-change-light- all frames

color pair

5 10 15 20 250

10l1l2l3 cam1-in-man-change-light- all frames

color pair

5 10 15 20 250

10deriv. cam1-in-man-change-light- all frames

color pair

5 10 15 20 250

10LogHue cam1-in-man-change-light- all frames

color pair

Figure 68: Discriminative power for the Indoor Manual Change Light Sequence

3.4.2 Indoor Manual Change Iris Sequence

The standard deviations for the various color spaces for the Indoor Manual Change Iris

sequence are shown in Figure 69. Since the RGB values saturated in some of the samples, we

also analyzed only the frames without saturation, and present the results in Figure 70. Note also

that the RGB values in this sequence started at values similar to the previous sequence, but got

brighter here versus darker in the previous sequence. Thus, this sequence measures the color

space responses to brighter colors. The only color component that had standard deviations worse

than RGB was luminance, which was expected to vary with illumination. The standard deviation

of saturation was generally at least half as large as RGB. The other color spaces all performed

much better, without the large amounts of noise in the previous darker sequence. The results are

significantly better without the frames where at least one of the RGB components was saturated.

Tracking applications should treat values that have been clipped differently, since the ratios

between the components are no longer valid in this case, but this is frequently overlooked. The

difference between the results with and without the data from the clipped frames shows the

importance of the treatment of clipped values.

35Stdev cam1-indoor-change-iris-orange.txt

35Stdev cam1-indoor-change-iris-red.txt

35Stdev cam1-indoor-change-iris-green.txt

35Stdev cam1-indoor-change-iris-blue.txt

35Stdev cam1-indoor-change-iris-yellow.txt

35Stdev cam1-indoor-change-iris-brown.txt

35Stdev cam1-indoor-change-iris-beige.txt

35Stdev cam1-indoor-change-iris-black.txt

Figure 69: Standard Deviation for all frames of the Indoor Manual Change Iris Sequence

35Stdev w/o sat cam1-indoor-change-iris-orange.txt

35Stdev w/o sat cam1-indoor-change-iris-red.txt

35Stdev w/o sat cam1-indoor-change-iris-green.txt

35Stdev w/o sat cam1-indoor-change-iris-blue.txt

35Stdev w/o sat cam1-indoor-change-iris-yellow.txt

35Stdev w/o sat cam1-indoor-change-iris-brown.txt

35Stdev w/o sat cam1-indoor-change-iris-beige.txt

35Stdev w/o sat cam1-indoor-change-iris-black.txt

Figure 70: Standard deviation for frames with no saturation for the Indoor Manual Change Iris Sequence

The discriminative power of the various color spaces for the Indoor Manual Change Iris

sequence are shown in Figure 71, and Figure 72 shows the results when saturated frames are

excluded from the analysis. The RGB color space shows DP < 1 for all color pairs. When

saturated frames are included, the results are not encouraging, with only the Log Hue having DP

> 2 for more than half the color pairs. When saturated frames are removed, the situation is very

different, with all but RGB having at least 19 of the 28 color pairs with DP > 1. Log Hue and

l1l2l3 did the best here, with at least 7 color pairs having DP > 10, and 23 color pairs with DP >

2. Most of the color pairs with low DP had black as one member. Recall that black was

extremely noisy for both the Log Hue and l1l2l3 color spaces. Chromaticity had 23 color pairs

with DP > 1 and 15 with DP > 2. CIELAB had 26 color pairs with DP > 1 and 22 with DP > 2.

5 10 15 20 250

10RGB cam1-indoor-change-iris- all frames

color pair

5 10 15 20 250

10(Y)IQ cam1-indoor-change-iris- all frames

color pair

5 10 15 20 250

10HS(V) cam1-indoor-change-iris- all frames

color pair

5 10 15 20 250

10H(L)S cam1-indoor-change-iris- all frames

color pair

5 10 15 20 250

10(L)ab cam1-indoor-change-iris- all frames

color pair

5 10 15 20 250

10chrom. cam1-indoor-change-iris- all frames

color pair

5 10 15 20 250

10c1c2c3 cam1-indoor-change-iris- all frames

color pair

5 10 15 20 250

10l1l2l3 cam1-indoor-change-iris- all frames

color pairdi

5 10 15 20 250

10deriv. cam1-indoor-change-iris- all frames

color pair

5 10 15 20 250

10LogHue cam1-indoor-change-iris- all frames

color pair

Figure 71: Indoor Manual Change Iris Sequence Discriminative Power for all frames

5 10 15 20 250

10RGB cam1-indoor-change-iris- no sat.

color pair

5 10 15 20 250

10(Y)IQ cam1-indoor-change-iris- no sat.

color pair

5 10 15 20 250

10HS(V) cam1-indoor-change-iris- no sat.

color pair

5 10 15 20 250

10H(L)S cam1-indoor-change-iris- no sat.

color pair

5 10 15 20 250

10(L)ab cam1-indoor-change-iris- no sat.

color pair

5 10 15 20 250

10chrom. cam1-indoor-change-iris- no sat.

color pair

5 10 15 20 250

10c1c2c3 cam1-indoor-change-iris- no sat.

color pair

5 10 15 20 250

10l1l2l3 cam1-indoor-change-iris- no sat.

color pair

5 10 15 20 250

10deriv. cam1-indoor-change-iris- no sat.

color pair

5 10 15 20 250

10LogHue cam1-indoor-change-iris- no sat.

color pair

Figure 72: Indoor Manual Change Iris Sequence Discriminative Power excluding frames with saturation

3.4.3 Indoor Manual Flashlight Sequence

The Indoor Manual Flashlight sequence measures local changes in illumination. The

standard deviations for all frames are shown in Figure 73, and the graphs with saturated frames

excluded are in Figure 74. As expected, the luminance component standard deviations have

magnitudes similar to the RGB components. Saturation standard deviation exceeds RGB for

yellow, green and beige. Standard deviations for log hue, chromaticity, 321 ccc , a* and b*

components of CIELAB, and Cλ and Cλλ components from the derivative space are low for all

the samples, while l1l2l3 gets noisy for blue and black. The large deviations in l1l2l3 for the beige

sample only occur when there is saturation.

15Stdev cam1-in-man-flashlight-orange.txt

15Stdev cam1-in-man-flashlight-red.txt

15Stdev cam1-in-man-flashlight-green.txt

15Stdev cam1-in-man-flashlight-blue.txt

15Stdev cam1-in-man-flashlight-yellow.txt

15Stdev cam1-in-man-flashlight-brown.txt

15Stdev cam1-in-man-flashlight-beige.txt

15Stdev cam1-in-man-flashlight-black.txt

Figure 73: Standard Deviation for all frames of the Indoor Manual Flashlight Sequence.

15Stdev w/o sat cam1-in-man-flashlight-orange.txt

15Stdev w/o sat cam1-in-man-flashlight-red.txt

15Stdev w/o sat cam1-in-man-flashlight-green.txt

15Stdev w/o sat cam1-in-man-flashlight-blue.txt

15Stdev w/o sat cam1-in-man-flashlight-yellow.txt

15Stdev w/o sat cam1-in-man-flashlight-brown.txt

15Stdev w/o sat cam1-in-man-flashlight-beige.txt

15Stdev w/o sat cam1-in-man-flashlight-black.txt

Figure 74: Standard Deviation for frames with no saturation for the Indoor Manual Flashlight Sequence

The discriminative power for the Indoor Manual Flashlight sequence is shown in Figure

75, and the results when saturated frames are excluded are shown in Figure 76. When the

saturated frames are included, CIELAB, chromaticity, 321 ccc , and derivative spaces do the best,

with DP > 5 for at least 21 of the 28 pairs. Log Hue is not far behind, with DP > 5 for 19 pairs.

Without the saturated frames, all the color spaces have at least 26 pairs with DP > 1. RGB and

HLS are the worst, with 10 or fewer pairs having DP > 4. CIELAB, chromaticity, 321 ccc ,

derivative and log hue are the best, with more than half the samples having DP > 10.

5 10 15 20 250

10RGB cam1-in-man-flashlight- all frames

color pair

5 10 15 20 250

10(Y)IQ cam1-in-man-flashlight- all frames

color pair

5 10 15 20 250

10HS(V) cam1-in-man-flashlight- all frames

color pair

5 10 15 20 250

10H(L)S cam1-in-man-flashlight- all frames

color pair

5 10 15 20 250

10(L)ab cam1-in-man-flashlight- all frames

color pair

5 10 15 20 250

10chrom. cam1-in-man-flashlight- all frames

color pair

5 10 15 20 250

10c1c2c3 cam1-in-man-flashlight- all frames

color pair

5 10 15 20 250

10l1l2l3 cam1-in-man-flashlight- all frames

color pair

5 10 15 20 250

10deriv. cam1-in-man-flashlight- all frames

color pair

5 10 15 20 250

10LogHue cam1-in-man-flashlight- all frames

color pair

Figure 75: Discriminative Power for the Indoor Manual Flashlight Sequence with all frames

5 10 15 20 250

10RGB cam1-in-man-flashlight- no sat.

color pair

5 10 15 20 250

10(Y)IQ cam1-in-man-flashlight- no sat.

color pair

5 10 15 20 250

10HS(V) cam1-in-man-flashlight- no sat.

color pair

5 10 15 20 250

10H(L)S cam1-in-man-flashlight- no sat.

color pair

5 10 15 20 250

10(L)ab cam1-in-man-flashlight- no sat.

color pair

5 10 15 20 250

10chrom. cam1-in-man-flashlight- no sat.

color pair

5 10 15 20 250

10c1c2c3 cam1-in-man-flashlight- no sat.

color pair

5 10 15 20 250

10l1l2l3 cam1-in-man-flashlight- no sat.

color pair

5 10 15 20 250

10deriv. cam1-in-man-flashlight- no sat.

color pair

5 10 15 20 250

10LogHue cam1-in-man-flashlight- no sat.

color pair

Figure 76: Discriminative power for the Indoor Manual Flashlight Sequence with no saturated frames

3.4.4 Outdoor Automatic Change Light Sequence

The standard deviations from the Outdoor Automatic Change Light sequence are shown

in Figure 77. This sequence had no saturation, but the white balance changed during the

sequence. The black sample shows very noisy hue, l1l2l3, and log hue components. The

chromaticity, 321 ccc , a* and b* components of CIELAB, Cλ and Cλλ components from the

derivative space, and I and Q components of YIQ all have deviations smaller the RGB.

30Stdev cam1-out-auto-change-light2-orange.txt

30Stdev cam1-out-auto-change-light2-red.txt

30Stdev cam1-out-auto-change-light2-green.txt

30Stdev cam1-out-auto-change-light2-blue.txt

30Stdev cam1-out-auto-change-light2-yellow.txt

30Stdev cam1-out-auto-change-light2-brown.txt

30Stdev cam1-out-auto-change-light2-beige.txt

30Stdev cam1-out-auto-change-light2-black.txt

Figure 77: Standard deviations for all frames of the Outdoor Automatic Change Light Sequence

The discriminative power for the Outdoor Automatic Change Light sequence is shown in

Figure 78. RGB shows 16 pairs with DP > 1 and only 5 pairs with DP > 2. All the rest have at

least 21 pairs with DP > 1. Chromaticity, 321 ccc , and derivative spaces have at least 22 pairs

with DP > 2, CIELAB has 20, and log hue has 18. However, if the threshold is changed to DP >

3, log hue is the best, discriminating between 14 pairs.

5 10 15 20 250

10RGB cam1-out-auto-change-light2- all frames

color pair

5 10 15 20 250

10(Y)IQ cam1-out-auto-change-light2- all frames

color pair

5 10 15 20 250

10HS(V) cam1-out-auto-change-light2- all frames

color pair

5 10 15 20 250

10H(L)S cam1-out-auto-change-light2- all frames

color pair

5 10 15 20 250

10(L)ab cam1-out-auto-change-light2- all frames

color pair

5 10 15 20 250

10chrom. cam1-out-auto-change-light2- all frames

color pair

5 10 15 20 250

10c1c2c3 cam1-out-auto-change-light2- all frames

color pair

5 10 15 20 250

10l1l2l3 cam1-out-auto-change-light2- all frames

color pair

5 10 15 20 250

10deriv. cam1-out-auto-change-light2- all frames

color pair

5 10 15 20 250

10LogHue cam1-out-auto-change-light2- all frames

color pair

Figure 78: Discriminative power for the Outdoor Automatic Change Light Sequence

3.4.5 Outdoor Manual Change Light Sequence

In the final sequence, Outdoor Manual Change Light, with standard deviations shown in

Figure 79, saturation only occurred for a few frames at the end of the green sample, and got quite

dark in the middle. The luminance components had standard deviations comparable to RGB,

and saturation components got as high as half the RGB magnitude. The I and Q components of

YIQ had small deviations except for orange, while chromaticity, 321 ccc , and derivative spaces

and the a* and b* components of CIELAB were low for all the samples. The l1l2l3 color space

had small deviations except for brown and black.

30Stdev cam1-out-man-change-light2-orange.txt

30Stdev cam1-out-man-change-light2-red.txt

30Stdev cam1-out-man-change-light2-green.txt

30Stdev cam1-out-man-change-light2-blue.txt

30Stdev cam1-out-man-change-light2-yellow.txt

30Stdev cam1-out-man-change-light2-brown.txt

30Stdev cam1-out-man-change-light2-beige.txt

30Stdev cam1-out-man-change-light2-black.txt

Figure 79: Standard Deviation for all frames of the Outdoor Manual Change Light Sequence

Discriminative power for the Outdoor Manual Change Light sequence is shown in Figure

80. RGB only had DP > 1 for 3 pairs. The rest were all better than 23 pairs with DP > 1.

Chromaticity, 321 ccc , and derivative spaces do the best up to DP > 4, but log hue is better for DP

> 5 and above.

5 10 15 20 250

10RGB cam1-out-man-change-light2- all frames

color pair

5 10 15 20 250

10(Y)IQ cam1-out-man-change-light2- all frames

color pair

5 10 15 20 250

10HS(V) cam1-out-man-change-light2- all frames

color pair

5 10 15 20 250

10H(L)S cam1-out-man-change-light2- all frames

color pair

5 10 15 20 250

10(L)ab cam1-out-man-change-light2- all frames

color pair

5 10 15 20 250

10chrom. cam1-out-man-change-light2- all frames

color pair

5 10 15 20 250

10c1c2c3 cam1-out-man-change-light2- all frames

color pair

5 10 15 20 250

10l1l2l3 cam1-out-man-change-light2- all frames

color pair

5 10 15 20 250

10deriv. cam1-out-man-change-light2- all frames

color pair

5 10 15 20 250

10LogHue cam1-out-man-change-light2- all frames

color pair

Figure 80: Discriminative power for the Outdoor Manual Change Light Sequence

3.5 Conclusions

We explored the responses of nine different color spaces with eight different color

samples in five different situations. We learned several things from these experiments.

• Automatic camera settings can cause more than brightness levels to change. The

Outdoor Automatic Change Light sequence showed the problems when the

camera automatically changes the white balance during the sequence. This can be

avoided by using manual camera settings. The need for keeping camera settings

constant was noted by Reinhard et al. [75].

• Saturation (clipping) causes shifts in all the color models. The Indoor Manual

Change Iris and Indoor Manual Flashlight sequences showed great improvement

when the saturated frames were excluded.

• The color spaces with some degree of intensity invariance can be partitioned into

those that become unstable at black (R = G = B = 0), and those that become

unstable at gray (R = G = B). Hue (in HSV, HLS, Log Hue, and Derivative) and

321 lll get noisy at gray, due to the denominator approaching zero. When the color

approaches black (a subset of gray), saturation (in HSV and HLS), chromaticity,

and 321 ccc become noisy as well.

• CIELAB never showed the extreme noise apparent in the other color spaces. The

discriminative power tended to be slightly less than that of chromaticity and

321 ccc , especially for higher thresholds of DP.

• The chromaticity and 321 ccc color spaces have comparable performance in all the

experiments. The only case that their standard deviations were larger than RGB

was the black sample in the darkest sequence, and even that sample did not

exhibit the extreme noise that some of the components did. Between chromaticity

and 321 ccc , chromaticity requires fewer operations to compute (2 adds and 3

divides, vs. 3 maxes, 3 divides, and 3 arctans; or if a 256x256 table is used for

each, 2 adds and 3 lookups vs. 3 maxes and 3 lookups).

• The Cλ and Cλλ components from the derivative space worked well in all cases,

even when saturation was present, and avoided the noisy results of other color

spaces for dark colors. The discriminatory power analysis showed that there is

enough information to distinguish between different colors.

• Even though there appears to be little difference in log hue for different colors,

the lack of noise except in very dark scenes still gives it good discrimination

ability. Log hue is also easier to manage than of the hue used in HSV and HSL

because it doesn’t wrap around like the angular measures do.

We also experimented with variations on these color spaces that avoided the instability by

keeping the denominator of the various ratios away from zero. We found that there appears to be

a tradeoff between stability and invariance to illumination – more stability yields less invariance.

We also tried applying the idea of the logs from Log Hue to other expressions, but they did not

yield any significant improvement. Given a choice between noise for gray and noise for black,

problems with black are preferred, since black is a subset of the gray shades.

Most algorithms that use color for tasks like tracking or motion detection update the color

model over time to account for illumination changes. This requires choices about how quickly to

change the model. Updating too slowly may result in missing sudden changes, while changing

the model too quickly may lead to missing changes that are not caused by illumination. This

dilemma can be avoided if there is a color space that does not require updating, but can still

distinguish between colors.

The assumption of white light that varies only in intensity (which could be caused by

changes in the light itself, the light direction, or the surface normal) is a reasonable one for many

cases. It holds for outdoor daylight scenes, and most indoor scenes use fixed white lights as

well. Our experiments have sampled how a real camera reacts to a real (if simple) scene under

these conditions.

Color spaces such as YIQ, HLS and HSV that are commonly used to avoid illumination

effects were found to not work very well in real conditions. Even though l1l2l3 was designed to

work better, the noise present in gray scenes makes it unusable. The 321 ccc space takes longer to

compute than chromaticity, but doesn’t work any better. The remaining spaces (CIELAB,

chromaticity, derivative, and log hue), while not demonstrating ideal constancy, have results that

are good enough to pursue further.

4 AUGMENTING A SOLID RECTANGLE

Our first augmented reality system tracks a rectangle of a solid color in the video and

replaces it with a prestored image [92]. A person standing in front of the camera holding and

moving the solid rectangle watching the display sees themselves holding a picture instead. As

discussed in Section 2.5.3, tracking simple objects requires a different approach than tracking

complex objects. Figure 81 shows a frame of both the raw and augmented video. In keeping

with the low tech nature of the problem domain, the rectangle we use is a piece of construction

paper mounted on a piece of cardboard, so it is not absolutely solid, nor completely rigid. The

camera is a consumer digital video camera.

(a) (b)

Figure 81: Original and augmented video frame.

4.1 Color Representations

Since the object we wish to track is a solid color, it makes sense to use color as one of the

cues to detect the object. However, the RGB colors measured by a camera are dependent on the

illumination. Based on the observations from the experiment described in the previous chapter,

we chose chromaticity space for detecting our solid colored rectangle object.

4.2 Proposed Method

The components of our method are an object color model, an object pixel mask, and a

quadrilateral shape. The object color is modeled by a single Gaussian in chromaticity space, and

is updated online. The object pixel mask keeps track of the pixels that are part of the tracked

object in the current frame. The output of the process is the vertices of a quadrilateral bounding

the object.

The steps of the algorithm are as follows:

1. Train the system with a single frame in which the object fills the camera field of view.

Initialize the color model with the mean and standard deviation of all the pixels, and

label all the pixels as “object”.

2. In the next frame, use the color model to mark the pixels whose probability of belonging

to the object exceeds a threshold.

3. For the pixels that have changed since the previous frame, update the object mask with

the result of the color test in step 2.

4. Find the minimum bounding quadrilateral around the largest group of connected pixels

that passed the color test in step 2.

5. Refine the quadrilateral using edge information.

6. Clear the object mask for all pixels outside the quadrilateral.

7. Update the color model using the pixels marked as “object”.

8. If there are more video frames, go to step 2.

4.2.1 Color Model

We use a single Gaussian color model to represent the object color. The probability that

a given color vector x matches the object color is

( )( )

−−

x ep (20)

where μ and σ are the mean and standard deviation of the object color distribution.

Testing whether this probability is above a threshold is equivalent to evaluating σμ t>−x ,

where t controls the threshold.

In an ideal world, there would be a color space where no adaptation was needed.

Shadows and lighting changes could be eliminated and only the true material colors would be

evaluated. We tried several color spaces including RGB and HSV, but obtained the best results

with chromaticity [31]. Chromaticity factors out the light intensity by dividing each component

by the sum of the components:

⎟⎠⎞

⎜⎝⎛

++++++=

BGRRbgr ,,),,( . (21)

Images (a) and (b) in Figure 82 show a video frame before and after conversion to chromaticity

space. Image (c) shows the pixels that match the color model highlighted in red.

The model is updated online to adjust for changing lighting conditions using Equation 22.

( )( ) currentprevioustrainingupdated μβαβμαμμ +−++= 1 (22)

where trainingμ is the mean of the colors sampled during the training frame, previousμ is the mean

resulting from the previous frame’s calculations, and currentμ is the mean of the colors in the

object mask in the current frame. The result, updatedμ will be used for object color detection in

the following frame. The standard deviation, σ, is updated similarly. The parameters α and β,

which are between 0 and 1, control how quickly the model adapts. Retaining a portion of the

training model with α > 0 helps prevent drift. Increasing β keeps the model more stable, but less

able to keep up with rapid lighting changes, such as the transition from pointing the rectangle

directly at the primary light source to tilting it away.

Updating the model can cause it to drift if non-object pixels are included. We reduced

this problem by tracking object pixels separately. If rectangle is occluded, for instance by a hand

that is holding it, the skin pixels will not get included as object pixels because they do not match

the color model. Since they are not tagged as object pixels, they will not affect the color model

update. The logic to only update the object mask in locations where the image has changed

serves to minimize the chance that background pixels that match the object color will get

included in the object. Of course, if the whole background is dynamic this will not help, but this

is not the usual case. More complex background models can be used, such as a mixture of

Gaussians, e.g. [87], if enough computation time is available.

(a) (b)

(c) (d)

(e) (f)

(g) (h)

Figure 82: Processing stages: original frame, converted to chromaticity space, detected object color pixels, bounding quadrilateral, saturation, horizontal gradient, vertical gradient, and difference between subsequent frames.

4.2.2 Minimum Bounding Quadrilateral

The model fitting step converts the object description from a general contour (the contour

of the group of connected pixels that passed the color test with the greatest area) to a

quadrilateral. We used Low and Ilie’s greedy method for finding a tight bounding quadrilateral

from a convex hull [64]. The process begins with a convex hull containing the pixels that

matched the object color model. For each vertex vi in the perimeter, the algorithm calculates the

area added if vertices vi and vi+1 were replaced by a single vertex at the intersection of the lines

defined by (vi-1, vi) and (vi+1, vi+2), as shown in Figure 83. This calculation is required initially

for all n vertices. Each of the n-4 times that two vertices are replaced with one, only a few

vertices in the vicinity of the deleted vertex must be updated, making this process Ο(n). Since

the convex hull requires Ο(n log n), the whole algorithm is therefore Ο(n log n).

Figure 83: Area calculation for finding the bounding quadrilateral

vi-1 vi+2

Area that would be added

vi vi+1

New vertex to replace vi and vi-1

When the color detection step is accurate, this produces an accurate quadrilateral.

However, if the pixels near the edges are not detected, the result will be too small, and a single

connected pixel outside the actual rectangle will skew the whole edge. Thus, the bounding

quadrilateral determined this way is only a rough approximation of the desired result.

4.2.3 Quadrilateral Refinement

Particle filters, such as Isard and Blake’s CONDENSATION algorithm [56], track curves in

clutter using a multiple hypothesis framework. Their method for evaluating a hypothesis

involves measuring the distance between the hypothesized contour and high contrast features

along a sparse set of normals to the contour. Since these measurements are repeated for each of

a hundred or more hypotheses, this is too slow for our frame rate target. Instead, we search

along a sparse set of normals to each edge of the bounding quadrilateral approximation for edge

points, and then fit a line through the points found. This is illustrated in Figure 84. The corners

of the bounding quadrilateral are then the intersections of the lines.

Figure 84: Edge refinement process

Various methods were tested for finding the best edge pixel along the normal. Using the

pixel with the highest color gradient magnitude didn’t work when the rectangle was resting on a

shelf because the shelf had stronger gradients than the rectangle border. The distance between

the pixel color and the mean of the object color model did not show a consistent pattern at the

rectangle edge.

The best results so far were achieved by using the saturation channel, where

S = (max – min) / max, (23)

and min and max are the minimum and maximum respectively among the R, G, and B pixel

values. The normal is traversed from a finite distance inside the edge towards the outside, and

the edge pixel is the first pixel found in which the gradient of the saturation channel exceeds a

threshold. These results can be further improved using RANSAC to discard outliers.

The resulting bounding quadrilateral is shown in Figure 82d. The last four images show

intermediate calculation stages for refining the quadrilateral. Figure 82e is the saturation

Edge of bounding quadrilateral

Since the quad edge is closer to horizontal, vertical lines are searched for high gradient points.

Line fit through gradient points

channel. The edges tend to be better defined in this space than the others tried. Compression

done by the camera interferes with some of the others. Images (f) and (g) in Figure 82 are the

gradients in the horizontal and vertical direction. To avoid the overhead of computing and

comparing with a magnitude and direction, the horizontal gradient is used for edges that are

closer to vertical, and the vertical gradient is used for horizontal edges. Lastly, the difference

between subsequent frames is shown in Figure 82h. The brighter pixels in this image are the

only places where the object mask is updated in step 3. More results showing the improvement

achieved by the refinement step are in Figure 85.

Color detection resultColor detection result Bounding quadrilateralBounding quadrilateral

Refined quadrilateralRefined quadrilateral

Figure 85: Refining the quadrilateral. The rough bounding quadrilateral is computed from the color detection result, which misses the bottom edge. The refinement process results in a more accurate outline.

Limiting the marked object pixels to those inside the quadrilateral in step 6 eliminates

pixels with the right color from further consideration. The color model is updated similar to

[87], but using only those pixels inside the quadrilateral that are marked as “object”, so the new

sample should have a minimum amount of contamination of pixels that are not actually part of

the object.

4.2.4 Deinterlacing

When interlaced cameras are used at their full resolution, rapid motion results in

“tearing” of the image because the even and odd lines are recorded at different time instants.

When viewed on an interlaced display, such as a standard TV set, this is not visible, but when

projected on a higher resolution display or when the full frame is used for processing, this

tearing becomes a problem. Figure 86 shows an example of this effect. The area in the white

square in the left image is expanded in the right image. The location of the red rectangle is

different for the even lines than it is for the odd lines. This dual position makes it difficult to

identify the edges of the object and creates many extra sharp gradients. But in the area without

motion, the extra resolution adds important detail.

Figure 86: Tearing caused by interlacing

Many deinterlacing algorithms exist with both hardware and software implementations.

Line doubling just repeats the even or odd lines for the whole image. This keeps the horizontal

resolution, but loses vertical resolution where there is no motion. This may also cause smooth

diagonal lines to become jagged. A median filter will retain the edges, but will destroy detail in

the static regions. Recreating the even lines by interpolating between odd lines also keeps

smooth edges, but destroys detail in stationary areas. Motion compensation algorithms are more

sophisticated, tracking moving objects and adjusting their position in one frame to match the

other, but are slow. We propose a fast algorithm that cleans up the moving regions while

retaining the detail present in the static regions.

Our algorithm is as follows:

For each pixel p(i, j) on an even line,

If ),(),1(),(),2( jipjipjipjip −+<−+

Then ( )),(),2(5.0),(),1( jipjipjipjip −++=+

Basically, if the pixels two lines apart more similar than the pixels one line apart, replace

the pixel one line below with the interpolated value. While not catching every possible case, this

quickly compensates for the interlacing in areas where there is motion while leaving most of the

rest of the frame untouched. Figure 87 shows the results from this approach. The boundary of

the red rectangle is much better defined than in Figure 86, yet the fabric detail remains.

Figure 87: Fast deinterlacing algorithm results

4.2.5 Displaying the Result

Displaying the combined result with OpenGL requires drawing three polygons. The first

is original frame as a texture map on a rectangle that fills the viewport and sets the depth buffer

to a near value. The second also fills the viewport and updates only the depth buffer with a

distant value for pixels that are part of the object mask. The final polygon is formed by the

vertices of the tracked rectangle with the inserted image as a texture map. Drawing this

quadrilateral with an intermediate depth causes it to only be visible where the object color was

detected.

4.3 Optimization Methods

In order to achieve real-time update rates, considerable code optimization was necessary.

The use of Intel’s OpenCV library [55] for low level image processing functions provided a good

basis, but the initial implementation ran much too slowly. The profiler in Microsoft Visual

Studio 6.0 was used to find the slow parts of the algorithm and measure improvements. A test

version of the application built for benchmarking the timing runs the complete algorithm

described in the previous sections 500 times on the same image. Several runs were averaged to

obtain the reported results. While some of the individual changes may be specific to this

compiler or processor, the general method for optimizing and the concepts applied can be used

for any system. The techniques most useful when optimizing the code were:

• Use integer arithmetic instead of floating point On most processors, integer

operations are several times faster than the corresponding floating point operations.

However, care must be taken to ensure that integer values do not overflow, and that

the truncation that occurs as a byproduct of integer division does not cause incorrect

results.

• Use lookup tables instead of arithmetic. If the required calculation is more complex

than computing an array address, a lookup table may run faster. This is easiest if the

inputs are discrete (integer). Memory constraints must also be considered, even

though available memory is increasing rapidly. For example, a 256x256 table with

one byte per entry takes 64 Kb of memory, which is probably acceptable; whereas a

256x256x256 table takes 16 Mb, which can get expensive if there are many of these.

• Make the lookup tables global. While it is cleaner to keep everything inside a class,

accessing a table that is a member of a class requires a memory access to get the base

address for the class, then an offset to get the base address of the table. The base

address of a global table can be resolved at compile time, resulting in fewer memory

accesses and faster code.

• Avoid conditional branches. Most modern processors are pipelined to increase

speed. Conditional branches disrupt the pipeline, slowing down execution. If-then-

else constructs should only be used when the instructions skipped are more

expensive than the overhead of the branch. Since loops require conditional

branches, executing more instructions in each loop results in fewer branches. For

example, to operate on every pixel in an image, it is more efficient to process two

pixels inside the loop and halve the number of times through the loop than to process

a single pixel inside the loop.

• Use zero as the ending loop index. If we wanted to execute some section of code

1000 times, it is easy to create a loop where an index goes from 0 to 999. But the

loop termination test must then load the ending index (999) each time or store it in a

dedicated register in order to compare it to the counter. The same loop that goes

from 999 to 0 will run faster because the native instruction that compares the index

to zero can be used.

• Use bit shift instead of divide. Divide is usually the most expensive of the basic

arithmetic operators. An arithmetic shift typically runs significantly faster, and

accomplishes the same purpose when the divisor is an integer power of two. It is

often worthwhile to approximate a division by a bit shift. Multiplication can also be

replaced with arithmetic shifts for a lesser gain.

• Operate on the minimum number of pixels. For image processing applications, the

results of some operations may only be needed in localized areas. For example, in

the rectangle tracking application described previously, the gradient is only needed

in the vicinity of the estimated quadrilateral, so it is more efficient to compute a

mask indicating where the gradient is needed and then only calculate the gradient for

those pixels, than it is to calculate the gradient for the whole image.

• Use the most efficient precision. Some operations may run faster if the arguments

are 8-bit integers, while others execute faster with 32-bit integers. For example, to

add a pair of 8-bit integers, the compiler generated instructions to clear two 32-bit

registers, then load the arguments in the least significant byte, add the two registers,

then store the result. If the arguments had been 32-bit integers, the clear instructions

would not have been needed, resulting in faster execution (assuming the 8-bit load

takes the same number of cycles as the 32-bit load). On the other hand, indexing

into a 32-bit array requires the index to be multiplied by 4 (or shifted left by 2)

before adding the index and the array base address, making 8-bit arrays more

efficient.

As an example, the original C++ code to calculate the chromaticity components from the

RGB image is shown in Table 1. The functions that are prefixed with “cv” are calls to the

OpenCV library. Recall that the chromaticity conversion is

⎟⎠⎞

⎜⎝⎛

++++++=

BGRRbgr ,,255),,( . (24)

The multiplication by 255 is necessary for the output to be a triple of 8-bit integers. The

“cvSplit” function separates the three channel RGB image (m_inputImage) into three single

channel images (m_planes[0], m_planes[1], and m_planes[2]) containing the red,

green, and blue components. This is necessary because OpenCV can’t add channels within an

image; it can only add separate images. The “cvConvertScale” function multiplies the first

argument by the scalar third argument and puts the result in the second argument. Dividing by

three was necessary to avoid overflows, and only affects the result by a slight loss of precision.

A better solution would be to use a higher precision result, but the “cvAdd” function requires

that the images have the same resolution. In order for the sum to be 16 bits, each of the

components would have to first be converted to 16 bits. The “cvAdd” function puts the

pixelwise sum of the first two arguments in the third argument. The final step, “cvDiv”, divides

each pixel in the first argument by the corresponding pixel in the second argument, multiplies the

result by the scalar fourth argument, and stores the result in the third argument.

Table 1: Original code to calculate chromaticity. Functions prefixed by "cv" reference the OpenCV library. // split into R, G, and B cvSplit(m_inputImage, m_planes[2], m_planes[1], m_planes[0], 0); // divide by 3 so sum will still fit in 8 bits cvConvertScale(m_planes[0], m_planes[0], 0.3333); cvConvertScale(m_planes[1], m_planes[1], 0.3333); cvConvertScale(m_planes[2], m_planes[2], 0.3333); // find R+G+B cvAdd(m_planes[0], m_planes[1], sumImage); cvAdd(m_planes[2], sumImage, sumImage); // divide each plane by sum cvDiv(m_planes[0], sumImage, m_planes[0], 255.0); cvDiv(m_planes[1], sumImage, m_planes[1], 255.0); cvDiv(m_planes[2], sumImage, m_planes[2], 255.0);

This code produces the right result, is easy to read and debug, and uses integer math

wherever possible. The divide can’t be done as integers because the result would always be

zero. However, the function takes 45 ms. to execute for a 720x480 resolution image. Running

only this function (no video input, no rendering, and no further processing) would result in a

frame rate of 22.2 Hz. Of course, there is much more to be done each frame.

The floating point division is a good candidate for optimization. With two integer inputs

in the range of 0 to 255, this can be implemented as a lookup table. The code replacing the three

cvDiv lines is shown in Table 2. The initialization code builds the table as a class member when

the program is started. This execution time is not counted in the frame rate calculations, since

the time per frame will approach zero when the number of frames is large. The inner loop only

needs to go from 0 to 255/3, but it’s still more efficient to keep the full 256 values so the array

element can be accessed by shifting and adding the input indices, instead of requiring a

multiplication. The new loop basically performs a two dimensional table lookup for each of the

three components for each pixel. The modified function takes 15 ms. to execute, for a speedup

Table 2: Code for table lookup to replace division (three cvDiv lines) in Table 1 // at initialization for (i=0; i<256; i++) { for (j=0; j<256; j++) { if (i == 0) m_chromaticityTable[j] = 0; else m_chromaticityTable[(i<<8) + j] = (unsigned char)cvRound(255.0f * (float)j / (float)i); } } // every frame unsigned char *p0 = (unsigned char *)m_planes[0]->imageData; unsigned char *p1 = (unsigned char *)m_planes[1]->imageData; unsigned char *p2 = (unsigned char *)m_planes[2]->imageData; unsigned char *sumPtr = (unsigned char *)sumImage->imageData; int i; for (i=0; i<m_imageSize.width*m_imageSize.height; i++) { *p0++ = m_chromaticityTable[(*sumPtr << 8) + *p0]; *p1++ = m_chromaticityTable[(*sumPtr << 8) + *p1]; *p2++ = m_chromaticityTable[(*sumPtr++ << 8) + *p2]; }

Eliminating the floating point multiply by 0.3333 should increase efficiency as well as

precision. To do this, we integrate the sum operation into the loop, where it can be stored in a

32-bit integer to avoid overflow. The lookup table must now be larger, since the denominator

now ranges from 0 to 3 × 255 and the numerator may use the full 0 to 255 range. The resulting

code is shown in Table 3. This version executes in 10 ms.; a speedup of 1.5 from the previous

iteration.

Table 3: Chromaticity calculation after integrating scale and sum into loop. // at initialization for (i=0; i<256*3; i++) { for (j=0; j<256; j++) { if (i == 0) m_chromaticityTable[j] = 0; else m_chromaticityTable[(i<<8) + j] = (unsigned char)cvRound(255.0f * (float)j / (float)i); } } // every frame cvSplit(m_inputImage, m_planes[2], m_planes[1], m_planes[0], 0); unsigned char *p0 = (unsigned char *)m_planes[0]->imageData; unsigned char *p1 = (unsigned char *)m_planes[1]->imageData; unsigned char *p2 = (unsigned char *)m_planes[2]->imageData; int i; int sum; for (i=0; i<m_imageSize.width*m_imageSize.height; i++) { sum = *p0 + *p1 + *p2; *p0++ = m_chromaticityTable[(sum << 8) + *p0]; *p1++ = m_chromaticityTable[(sum << 8) + *p1]; *p2++ = m_chromaticityTable[(sum << 8) + *p2]; }

There is one remaining OpenCV function: cvSplit. We used this in the first place

because subsequent OpenCV functions required single channel images. Since we have replaced

all of these functions, we no longer need cvSplit. The modified code is shown in Table 4. This

function takes 7.1 ms. to execute, for a speedup of 1.4 from the previous version.

Table 4: Chromaticity calculation after all OpenCV functions have been replaced. unsigned char *iPtr = (unsigned char *)m_inputImage->imageData; unsigned char *p0 = (unsigned char *)m_planes[0]->imageData; unsigned char *p1 = (unsigned char *)m_planes[1]->imageData; unsigned char *p2 = (unsigned char *)m_planes[2]->imageData; int i; int sum; for (i=0; i<m_imageSize.width*m_imageSize.height; i++, iPtr+=3) { sum = iPtr[0] + iPtr[1] + iPtr[2]; *p0++ = m_chromaticityTable[(sum << 8) + iPtr[0]]; *p1++ = m_chromaticityTable[(sum << 8) + iPtr[1]]; *p2++ = m_chromaticityTable[(sum << 8) + iPtr[2]]; }

There are still some smaller improvements that can be made. Since the variable i only

counts the number of times through the loop, decrementing the counter will functionally work as

well as incrementing it. Decrementing from some value to zero is faster than incrementing from

0 to some value because testing against zero is a native instruction, whereas testing against a

non-zero value requires loading that value each time through the loop.

The three values extracted from the table each time through the loop are in the same row.

While the compiler may recognize the duplicate operation and consolidate the instructions, we

can modify the code to compute a pointer to the row needed from the table to ensure that this

calculation is only done once.

When the table base address is a class member, the processor must perform a memory

access (“this” pointer plus offset) to get the table address. When the table base address is global,

the address can be resolved by the compiler, resulting in a faster lookup.

Conditional branches, including checking for the terminal condition in a for-loop, disrupt

the instruction pipeline, and tend to slow things down. Fewer times through the loop will result

in fewer conditional branches, and should run faster. This is known as loop unrolling. Pointers

used in the loop can be incremented fewer times as well. The disadvantage is that the resulting

code is more difficult to maintain, since changes inside the loop must be made multiple times

instead of just once. To balance the speed requirements with readable code (and because the

video sizes have widths that are multiples of four) we repeat our code four times within a loop.

The final code implementation that integrates all these efficiencies is shown in Table 5.

This function executes in 5.0 ms. This is a speedup of 1.4 from the previous version, and a

speedup of 9 from the original. The biggest gain in the last set of improvements was from using

a global variable for the lookup table, and the next was from loop unrolling. The contribution of

the others was measurable, but much smaller.

Table 5: Final optimized code for chromaticity calculation. unsigned char *iPtr = (unsigned char *)m_inputImage->imageData; unsigned char *p0 = (unsigned char *)m_planes[0]->imageData; unsigned char *p1 = (unsigned char *)m_planes[1]->imageData; unsigned char *p2 = (unsigned char *)m_planes[2]->imageData; unsigned char *tablePtr; int i; int sum; for (i=(m_imageSize.width*m_imageSize.height)/4; i>0; i--, iPtr+=12, p0+=4, p1+=4, p2+=4) { sum = iPtr[0] + iPtr[1] + iPtr[2]; tablePtr = g_chromaticityTable + (sum << 8); p0[0] = tablePtr[iPtr[0]]; p1[0] = tablePtr[iPtr[1]]; p2[0] = tablePtr[iPtr[2]]; sum = iPtr[3] + iPtr[4] + iPtr[5]; tablePtr = g_chromaticityTable + (sum << 8); p0[1] = tablePtr[iPtr[3]]; p1[1] = tablePtr[iPtr[4]]; p2[1] = tablePtr[iPtr[5]]; sum = iPtr[6] + iPtr[7] + iPtr[8]; tablePtr = g_chromaticityTable + (sum << 8); p0[2] = tablePtr[iPtr[6]]; p1[2] = tablePtr[iPtr[7]]; p2[2] = tablePtr[iPtr[8]]; sum = iPtr[9] + iPtr[10] + iPtr[11]; tablePtr = g_chromaticityTable + (sum << 8); p0[3] = tablePtr[iPtr[9]]; p1[3] = tablePtr[iPtr[10]]; p2[3] = tablePtr[iPtr[11]]; }

The original and final implementations are both Ο(n), where n is the number of pixels.

The multiplier hidden by the order notation is the source of the improvement for this function.

Since we have to visit every pixel, Ο(n) is the best that can be done for this situation. The time

spent coding the algorithm to reduce the multiplier is well worth it for this real time application,

since a process that takes 45 ms. per frame cannot be part of a system that runs near 30 Hz. (33.3

ms. per frame). Since the process only took 5 ms. after optimization, it can be used as part of a

30 Hz. system. Processors will eventually be fast enough to run the original implementation in 5

ms., but by using the optimized version with the faster processor, even more complex

calculations can be packed into the available processing time. Thus, optimization is likely to be

a critical part of real time programming for the foreseeable future.

4.4 Discussion

The system described was implemented using Microsoft DirectShow for interfacing with

the video camera, and Intel’s OpenCV library for low level image processing functions. The

code was optimized using techniques including lookup tables, avoiding floating point

calculations, and operating on only the necessary pixels. Frame rates of 28 Hz. were achieved

on a 2 GHz. Pentium 4 laptop, for 720x480 resolution DV video.

This method is able to track the rectangle successfully in a wide variety of lighting

conditions, including glare and shadows that change quickly. It uses no assumptions about

motion models or camera parameters and runs at interactive rates. Figure 88 shows some sample

frames from a sequence that was accurately tracked. The detected bounding quadrilateral is

shown in red. The frames show that the method can handle occlusions on edges and corners and

diverse lighting conditions.

Figure 88: Sample frames from the video sequence showing occlusion, a corner off the screen, and challenging lighting conditions. The tracked boundary is drawn in red.

4.5 Conclusions

We have described and demonstrated a tracking method for augmented reality

applications, in which real-time rates and accurate registration are required. Our setup uses low-

tech equipment, and no markers or elaborate training, yet can accomplish this task for simple

objects (which are sometimes more difficult than complex objects). The method draws on

principles of more sophisticated algorithms that run too slowly for this application, yet runs on

full resolution video at interactive rates.

The system tracks a simple, solid colored, rigid object (a rectangle) and augments it by

replacing the rectangle with a prestored image in the real time video stream. To the user, it looks

like the object he is moving around has a picture on it. This system uses color detection to locate

the rectangle in the image, and edge detection to refine the outline of the rectangle so that it

accurately overlays the target. The system works with shadows, glare, global lighting changes,

and occlusion.

5 AUGMENTING HUMAN FACES AND HEADS

The lessons learned in implementing the solid rectangle augmented reality system

described in the previous chapter were applied to a system that augments human faces and heads.

The complexity of the problem is increased because human heads vary in shape, size, color, and

features. Heads are not as rigid as the rectangle, the shape is more complex, and there is great

variety with skin color as well as hair color, texture, and length. Items such as glasses, earrings,

and facial hair may or may not be present. However, augmentation of one’s face or head is much

more personal and compelling than a separate object, like the rectangle. A mirror is most

commonly used to see ourselves, so augmenting faces is a natural extension.

We have made three contributions in this area. The first is a method for integrating face

detection and tracking to filter out the false positives from the detection algorithm while tracking

the valid faces. The second is a novel method for estimating the initial pose of a newly detected

face. The third is an extension of the Active Appearance Model (AAM) to use a parameterized

generic 3D head model to aid in both generating the AAM shape model more easily and

extracting meaningful quantities from the tracked face. This uses the results of face detection,

point-based tracking, and initial pose estimation to improve the initial guess for the iterative

solution.

5.1 Integration of Detection and Tracking

Detection and tracking of faces are often done independently. We discussed methods for

each in Section 2.5. If each were perfect, they could remain independent, but integrating them

provides ways to deal with false positives in the detection algorithms (which leads to tracking

objects that aren’t really faces) and drift in the tracking algorithm (where the tracking latches on

to background features). We can avoid both problems by combining the two steps. We will

describe the face detection method in Section 5.1.1, our tracking method in Section 5.1.2, the

integration of the two in Section 5.1.3, and analysis in Section 5.1.4.

5.1.1 Face Detection and Localization

The AdaBoost face detection method of Viola and Jones [100] described earlier was used

for locating face regions in each frame. Specifically, the OpenCV implementation [14] and

trained classifier were used as a starting point. On a 2.2 GHz Opteron processor, at a resolution

of 360x240, this ran at 11 Hz with the parameters we used. The main reason that this algorithm

is computationally expensive is that it must search each frame multiple times, looking for face

candidates at each location and at a range of scales.

To decrease the search area, we added a motion detection module, using the algorithm

from Stauffer and Grimson [87]. In their method, the background is modeled as a mixture of

Gaussians for each pixel. The mean, standard deviation, and weight of each distribution are

updated online as new frames are observed. The multimodal model can handle repetitive

motion, such as a tree blowing in the wind or specular reflections from water where a

background pixel might switch back and forth between two or more colors. A new pixel is

checked against each of the existing distributions. If it matches, the mean and standard deviation

for the matched distribution are updated and the weight is increased. If it doesn’t match, a new

distribution is created, replacing the lowest weighted distribution. After sorting the distributions

from highest weight to lowest weight, the background model is the first B distributions, where

⎟⎠

⎞⎜⎝

⎛>= ∑

1minarg ω , (25)

where T is the amount of data considered “background” and kω is the kth largest weight. Thus,

if the weights at a pixel are 0.85, 0.10, and 0.05 (summing to 1.0), the first distribution represents

the background if T = 0.8 and the first two distributions qualify as background if T = 0.9. A

foreground object that is stationary will eventually become part of the background, since the

weight is increased each time the color is observed. Pixels matching distributions beyond the

first B are designated foreground.

If no face was previously detected in an area and there was no motion, there is no need to

search that area again. Therefore, a mask was created with non-zero values where motion was

present or faces were previously detected. The face detection algorithm was modified to only

test rectangles where more than a threshold percentage of pixels are contained in the mask. The

motion detection algorithm alone runs at 25 Hz, and using this motion mask in conjunction with

the face detection algorithm increases its typical performance from 11 Hz. to 17 Hz. Of course,

if the entire background were dynamic the motion detection would only slow things down, but

for more static situations, the benefits of the reduced search area outweigh the cost of the motion

detection step.

5.1.2 Face Tracking

Once a face has been detected, we need to track it in subsequent frames, updating

position and orientation. Of the tracking categories listed in Section 2.5.6, we chose feature-

based tracking, using generic corners found in the detected face region.

By “corner”, we don’t necessarily mean a right angle, but a point with a high gradient in

both the horizontal and vertical directions. A point on a vertical edge would have a large

gradient in the horizontal direction but not in the vertical direction, so there would not be a

unique match for that point in another similar image. On the other hand, a point on a corner

would have large horizontal and vertical gradients, allowing a precise match in another image.

Methods for identifying these corners include Harris [48] and Shi and Tomasi [88]. We used the

latter, as implemented in OpenCV. In addition to finding corners that meet a quality threshold,

this function performs non-maxima suppression, and discards weaker corners that are closer than

a distance threshold from stronger corners.

These corners are located in each following frame using a pyramidal implementation of

the Lucas Kanade optical flow method [13]. Optical flow is based on Equation 26, which

basically says whatever image intensity (color) is visible at location (x, y) at time t will also

appear at a slightly different location (x + dx, y + dy) at a slightly later time t + dt.

( ) ( )dttdyydxxftyxf +++= ,,,, (26)

A Taylor series expansion yields Equations 27 and 28, where Equation 28 is the classic optical

flow equation.

( ) ( ) dttfdy

xftyxftyxf

∂∂

+∂∂

+= ,,,, (27)

0=++ dtfdyfdxf tyx (28)

The basic premise for Lucas Kanade optical flow for a small region [66] is that the

motion in a small window is constant. For a window size of 3x3 and frames that are 1 time unit

apart, this gives a system of 9 equations that can be solved by least squares, as in Equation 29.

Offsetting the position in the second image by (dx, dy) and iterating allows the estimate to be

refined.

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

−−

=⎥⎦

⎤⎢⎣

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

MMM (29)

This only works if the motion is less than one pixel. To increase both speed and the scale

of the motion that can be tracked, image pyramids are used. The pyramid is built by

successively downsizing the image by a factor of 2 in each direction each time, creating a

hierarchy. The tracking is done in the coarsest level first, and the offset calculated there is scaled

appropriately and used as the initial estimate at the next finest level. Thus a pyramid with 4

levels allows motion of up to 8 pixels without searching the whole 16x16 pixel area.

Once the point locations in the new frame have been found, the two sets of points

(original points (xi, yi) and next frame points (xi’, yi’)) can be fit to a model to find the change in

object (face) position and orientation. In the simplest model, the points are assumed to all have

the same translation, and the displacement ( )21,bb can be found using Equation 30.

⎥⎦

⎤⎢⎣

⎡−−

=⎥⎦

⎤⎢⎣

⎥⎦

⎤⎢⎣

⎡+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

−−

=⎥⎦

⎤⎢⎣

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

1 MMM (30)

Expanding the model to incorporate a scale factor s results in Equation 31. In the

translation model, the origin of the image coordinates didn’t matter, but here the face needs to be

scaled relative to the center of the face. The transformation includes shifting the coordinates so

that the face center at ( )cc yx , becomes the origin, and shifting back after the scale operation.

The three parameters can be obtained by solving the system of equations in Equation 32.

( )[ ] ( )[ ][ ] ( )[ ]

⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

⎡++−++−

=⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

⎡−−

⎥⎥⎥

⎢⎢⎢

=⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

⎡−−

⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

⎡−−=

⎥⎥⎥

⎢⎢⎢

1001001

11001001

1000000

1001001

1,by Translateby scaleby Translate,by Translate

bysysbxsxs

syssxs

yxsyxbbyx

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

−−

=⎥⎥⎥

⎢⎢⎢

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

−−

MMMM (32)

If we continue to build the model by adding an in-plane rotation, we get Equation 33.

Once again, the scale and rotation must be relative to the face center, so the coordinates are

translated first, and then translated back after the scale and rotation. Since the scale is the same

in the x and y directions, the order of the scale and the rotation don’t matter.

⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

⎡++−−+++−−

=⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

⎡−−

⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

⎡ −

⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

1100cossincossinsincossincos

11001001

1000000

1000cossin0sincos

1001001

bysysxssbxsysxss

θθθθθθθθ

θθθθ

In order to obtain a linear system of equations, we substitute θcos1 sa = and θsin2 sa = . The

original parameters can be extracted as 22

21 aas += and

21tanaa−=θ . The four parameters in

this model can be found using Equation 34.

( )⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

−−

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

−−−−−

xxyyyyxx

MMMMM (34)

The models up to this point have parameters that directly correspond to the rigid motion

in the video, although these parameters don’t describe all possible motion, such as yaw and pitch

(a.k.a. pan and tilt). The affine model, in Equation 35, combines scaling and rotation and allows

shear, which is not likely as an actual occurrence in the video, but could be falsely detected from

errors in the tracking of the point features.

⎥⎥⎥

⎢⎢⎢

⎡−−

⎥⎥⎥

⎢⎢⎢

⎥⎥⎥

⎢⎢⎢

⎡−−

11001''

baabaa

Other models, such as biquadratic in Equation 36,

xyayaxayaxaay

xyayaxayaxaax

+++++=

+++++= (36)

bilinear in Equation 37,

xyayaxaayxyayaxaax

+++=+++=

and pseudo perspective in Equation 38

yaxyayaxaay

xyaxayaxaax

++++= (38)

provide a linear solution, but don’t have parameters that relate directly back to 3D orientation of

the head. While more complex models may describe the change in the scene more accurately,

they also run the risk of overfitting, i.e., introducing changes in the global model that do not

match the data in an effort to decrease the error at selected points.

5.1.3 Integration of Detection and Tracking

Combining the detection and tracking modules is not as straightforward as it would first

appear. If we simply create a new object for every face detected, several problems are quickly

apparent. A scene with one stationary face will produce a pile of tracked objects, as a new object

is initialized on top of the old one when a face is detected in each new frame. If a piece of the

background is falsely detected as a face for a single frame, it will be tracked as a face (and

remain stationary), with no means to terminate it. Another possibility is that the tracked points

will drift away from the face and latch on to parts of the background.

Many face tracking applications avoid these problems by assuming that a scene has

exactly one face in it. This is reasonable for a video conversation using a webcam, but we wish

to allow multiple people to interact with our augmented reality application. To solve this

problem we propose the following method for integrating detection and tracking data. This

process is illustrated in Figure 89.

Figure 89: Integrating detection and tracking data

First, motion detection is done, and the positions of the faces present in the previous

frame are updated. Next, face detection is done in areas where motion occurred and a currently

Motion Detection

Update face poses

Create face search mask

Face detection on masked area

Overlap between

detected and

Keep face

Detected face but no previous

Add face

Detected face?

Warp face to frontal pose

Face detection on warped

Keep face

Discard face

tracked face does not exist. Since we already updated the face locations, these positions should

be more accurate than using the positions from the previous frame. We mentioned previously

that reducing the search area for the face detector yields faster execution. We exclude tracked

faces from the search area because we have a faster method for handling those areas.

The idea behind the next part is that we want to stop tracking areas that fail the face

detector. Previously tracked faces that overlap with a detected face are retained. Those that

don’t are tagged for further testing. Any detected faces that don’t overlap with a previously

tracked face are added to the list of tracked faces. The tagged faces may have failed the detector

for two main reasons: the region isn’t really a face, or the face may be at an orientation that our

frontal face detector can’t handle. In order to keep tracking faces that are no longer frontal, we

warp the face image using the tracked orientation and render a frontal view. If the tracked

orientation was correct and it really was a face, the frontal face detector should yield a positive

result when applied to the warped image. Since this test for a face is pass/fail (i.e., we don’t

need the exact location, just whether or not the region contains a face), we can use a faster

version of the face detector that returns when it finds the first face.

By integrating the tracking data and the face detector, we eliminate patterns that were

only detected as a face for a single frame. Eliminating static regions from the face detection

search area not only speeds up the execution, but also prevents the possibility of static regions of

the background being falsely detected as faces. This idea can be expanded by requiring that a

face be detected for n frames before augmentation is displayed, or allowing it to fail the face

detector for n frames before the track is discarded.

5.1.4 Results

We processed a 491 frame sequence with various configurations. We will use precision

and recall to quantify the results. Precision is the percentage of detections that were correct:

true positives / (true positives + true negatives). Recall is the percentage of actual faces that the

detector found: true positives / (true positives + false negatives). Ideally, both should be as close

to 100% as possible. The results are summarized in Table 6.

Using only the face detector (no motion detection, no tracking), 298 faces were detected.

Of these, 241 were actually faces, and 57 were not. In addition, 71 actual faces were not

detected because the roll angle of the face exceeded the limits of the frontal face detector. This

yields a precision of 81% and recall of 77%.

Next, we added motion detection. This eliminated false positive faces in areas where

there was no motion. The same 241 actual faces were found, with just 18 false positives. The

same 71 false negatives remained, since we did not do anything that would compensate for the

roll angle yet. This increased the precision to 93% and the recall remained at 77%.

Tracking was added using a simple translation model. The detection criterion was

changed to require detection in two consecutive frames. True positives were reduced by 6 to 235

because the first frames of each of 6 continuous segments where a face was present were

rejected. In batch mode, we could go back and accept those 6 detections, but with real-time

processing, we can’t see into the future, and don’t want the delay associated with lagging the

output video by a frame. False positives were reduced to just 5. There were two cases where

false detections occurred for consecutive frames; one for 2 frames resulting in one false positive,

and the other for 5 frames resulting in 4 false positives. The same 71 false negatives were still

present since the motion model doesn’t deal with roll angle yet, plus the 6 frames at the

beginning of tracked segments, for a total of 77. The precision is 98% and the recall is 75%.

The last step increased the complexity of the tracking model to handle the roll angle. If

the detector failed to detect a tracked face, the image was rotated by the tracked roll angle and

the face detector was invoked on this warped image. The only remaining false negatives were

the first frames of each of two tracked segments, leaving 310 true positives. The same 5 false

positives were still present. The precision is 98% and the recall is 99%.

Table 6: Results from integrating detection and tracking

Description True Positive

False Positive

False Negative

Precision

Recall

Detection only 241 57 71 81% 77% With motion detection 241 18 71 93% 77% Translation tracking model 235 5 77 98% 75% Translation, rotation, and scale tracking model

310 5 2 98% 99%

5.2 Initial Pose Estimation

It is tempting to assume that faces found by a system that detects frontal vertical faces are

perfectly aligned, but this is not valid in practice. We observed the faces detected to be rotated

by as much as 20° from straight ahead. Detectors for features like eyes exist, but involve

searching the face at a range of scales, which is computationally expensive. Glasses, beards,

scarves, and other occlusions further complicate the process. Eyes may not be visible with dark

glasses. A mouth may be hidden under a beard.

One method that has been used to determine the initial face pose is based on finding the

pixels that match a skin color model. The distribution of these pixels can be used to find the

orientation of the face when it is first detected. We implemented two variations of this method

with disappointing results.

To improve on this bridge between face detection and pose tracking, we propose a novel

method for quickly determining this initial face orientation based on symmetry. The vast

majority of faces have left-right symmetry. Even the aforementioned occlusions tend to be

symmetric. By operating on a relatively small number of high contrast points (which can later be

used to track the face), the algorithm is much faster than one that uses all the image pixels, and

more accurate than using the distribution of skin color pixels.

5.2.1 Pose from Skin Color Detection

In the first method for finding the skin color pixels for each face, we compute a color

histogram of the region returned by the face detector, and assume that the most commonly

occurring colors correspond to the skin of the face. We used 16 bins for each of the three

channels, for a total of 4096 bins. Since the skin color in even a single face image covers

multiple bins, we designated the bins with at least half as many pixels as the bin with the most

pixels as skin color. For example, if the most populous bin had 100 pixels, all bins with at least

50 pixels were considered skin color. Once pixels that matched this skin color model were

identified, moments were calculated to find the orientation of the skin color pixels using

⎟⎟⎠

⎞⎜⎜⎝

⎛−

111 2tan21

μμμφ , (39)

where ( ) ( )∑ −−=k

ikij yyxxμ .

Figure 90 shows the face detection results for a color image with 24 detected faces. The

skin color pixels detected are shown in magenta in Figure 91. The threshold of 50% for the bins

designated as skin color was used because it provides a balance between too many and too few

skin pixels. This also shows the orientation calculated using the moment of the skin pixels. The

vertical axis is shown in green, and the horizontal axis in red. By visual inspection, only two of

the orientations look correct, and two more are 90 degrees off.

Figure 90: Face detection results.

Figure 91: Skin detection results using local histogram. Pixels detected as skin for each face are shown in magenta. The orientation calculated for each face is shown with the horizontal axis in red and the vertical axis in green.

We also calculated a global skin color model, which could be used to aid the search for

faces. We collected samples of skin images cropped from images showing a wide variety of skin

colors and lighting conditions. Based on our findings from Chapter 3, we converted the images

to chromaticity space and found the histogram of all the images, ignoring very dark pixels

(because chromaticity is unstable at black) as well as those with at least one RGB component at

255 (since those colors may be saturated). The bins with fewer than 10 samples were discarded

as noise. The non-zero bins remaining were considered to represent skin color. The result of

backprojecting this histogram on our test image is shown in blue in Figure 92. Most of the

actual skin was detected, but hair and white shirts were also found, since these are valid skin

colors. The trees (except for the darkest parts), grass and less neutral clothing were not detected

as skin color.

Figure 92: Global skin detection results.

Since the faces were not limited to clearly defined blobs, we masked out the pixels in a

circular region determined by the face detector for our analysis. The same computations as

before were performed using this different set of skin pixels. The results are shown in Figure 93

with the same color scheme as before. Skin pixels are magenta, and the orientation is shown

with a red horizontal axis and a green vertical axis. By visual inspection, five look correct.

Figure 93: Face orientation using global skin detection.

5.2.2 Proposed Point Symmetry Method

Our method [91] is based on the bilateral symmetry of faces, but does not recover

specific facial features. The assumption is that if high contrast points are found on a face image

(i.e., the region returned by a face detection algorithm), they will be distributed equally on the

left and right sides of the face. There are likely to be points on the corners of the eyes,

eyebrows, and mouth, since these are the highest contrast locations on most faces. Since

occlusions like glasses and beards are usually symmetric, the algorithm will handle them without

modification, and may even benefit from them.

The points may be found by a method such as the Harris corner detector [48], which

detects points with high gradients in both the horizontal and vertical directions. An additional

constraint discards corners that are closer together than a threshold. Making this threshold

proportional to the face region size allows the algorithm to work on a wide range of resolutions.

The first approach we tried was to find the orientation θ of the point set from the second central

moments using Equation 39. Likewise, we also tried using least squares to fit a line to the data.

Both of these produced results worse than assuming that the face was vertical. Instead, we

evaluate the symmetry of the point set at various rotations.

Our measurement of symmetry extends the work of Zabrodsky et al. [114]. They

formalize the concept of symmetry as a continuous measure, i.e., expanding from the binary

choice of “symmetric” or “not symmetric” to the continuous idea of “more symmetric” for a

shape. They begin by defining the symmetry transform of a contour that finds the closest

symmetric shape to a given shape, representing a shape as a set of points on its contour. The

symmetry distance is then the distance between the points representing the original shape and the

corresponding points of the symmetry transform. However, the selection of the sets of

corresponding points is left open. In a related paper [113], the symmetry of molecules is

measured in a similar manner. The two sets of corresponding points are limited to isomorphic

pairs of the graphs formed by the atomic bonds.

In our work, we extend the concept of continuous symmetry to general point sets. This

is illustrated in Figure 94. An example point set Pi is shown in (a). For every point Pi in the set,

we flip it about a potential symmetry axis to get Pi’. This is shown in (b) with solid points

representing Pi and hollow points representing Pi’. Then, for every flipped point Pi’, we find the

closest original point Pj. For points on the axis, i = j. A one-to-one correspondence is not

enforced, so if two points were found on the left eye and three were found on the right eye, the

third point can still match an eye point. This matching is shown in (c). The distances between

each closest pair Pi’ and Pj are averaged to get the symmetry distance,

∑−

iji PP

nSD . (40)

Thus, a symmetry distance of zero indicates that a point set is perfectly symmetric, and larger

distances mean the shape is less symmetric. The process of finding the axis of symmetry for a

set of points then consists of finding the symmetry distance for each potential axis. The axis

with the smallest symmetry distance is taken to be axis of symmetry for the shape.

(a) (b) (c)

Figure 94: Symmetry distance. (a) is the original point set, (b) shows the original filled points and the reflected hollow points, (c) shows the closest filled point to each hollow point. The average of these line lengths is the symmetry distance.

To apply this idea to initial pose estimation for detected faces, we begin by assuming that

the center of the detected face region falls on the symmetry axis. We experimented with using

the mean of the points, but got better results using the region center, due to points detected at the

edge of one side of the face that were not visible on the other side. Once we have a point on the

axis, we can rotate the points about this point and measure the symmetry at each potential angle.

The range of angles we need to test is practically limited by the face detector. Ours only

detected faces within 20° of vertical, so searching from -30° to +30° is reasonable. We used

increments of one degree in this range. The N2 distance computations for each axis appear at

first to be inefficient, but for low-resolution faces, N = 20 is reasonable, and 400 operations per

rotation angle is not overwhelming. Computational geometry methods that improve on the

exhaustive O(N2) search, such as building a kd-tree [26], can be applied to speed up this process

if needed. A correlation or feature detection algorithm would quickly become more expensive.

An example is shown in Figure 95. Image (a) shows a face region with the high corners

displayed as black circles. Images (b) and (c) show two different rotations. The rotation of the

actual image is not necessary, and is displayed here for illustration purposes. The black points

show the rotated corners, and correspond to the same facial features on all three images. The

white points have been reflected around the vertical axis, which is the rotation axis being

evaluated. If the rotation being tested makes the face vertical, then a reflected point should land

on the same feature on the other side of the face. The white lines connect each white point with

the closest black point, and the symmetry distance is the sum (or average) of the lengths of all

the white lines. In image (c), not only does the face look vertical, but most of the white lines are

very short, and the sum of the line lengths listed at the bottom of the image is the minimum.

(a) (b) (c)

Figure 95: Measuring symmetry. (a) shows the corners in black, (b) and (c) show two different rotations, with the reflected points in white, and lines between matched points.

5.2.3 Point Symmetry Results

To test our algorithm, we used the rotated test set of images from [78]. This test set of

contains gray scale images that each have one or more frontal faces at many different rotations

and resolutions. A list of the image coordinates of the left and right eye, nose, and left, center,

and right points on the mouth is also provided. From these points, we computed the ground truth

rotation angle as the angle between the vertical axis and the best line through the nose, center

mouth point, and the midpoint between the eyes for each face. Figure 96 shows some sample

input images. A black triangle is shown that connects the ground truth coordinates for the eyes

and mouth center. The symmetry axis is assumed to be on a line through the mouth point and

the midpoint of the line connecting the eyes. There is also a white number showing the index for

that face (“1” for a single face).

Figure 96: Test images. The original image is shown with a black triangle that connects the ground truth coordinates for the eyes and mouth center.

The face detector used in our experiments is the OpenCV [14] implementation and

trained classifiers that are based on Viola and Jones [100]. As previously mentioned, the

algorithm detects frontal faces up to about 20 degrees rotation from vertical. The orientation

algorithm was applied only to the face regions detected.

Some sample results are shown in Figure 97, corresponding to the input images from

Figure 96. The box shows the region detected as a face, small green circles show the high

contrast points found in that region, and an axis shows the orientation found, with red pointing in

the positive x direction and green in the positive y direction. In each of these examples, the

algorithm result was within one degree of the ground truth.

Figure 97: Algorithm results. The box shows the region detected as a face, small green circles show the high contrast points found in that region, and an axis shows the orientation found, with red pointing in the positive x direction and green in the positive y.

The quantitative results are summarized in Table 7. The average rotation in the original

data set was 14.3 degrees, with only 4 of the 30 faces rotated less than 5 degrees. A least squares

line fit (fit the best line, and then offset the slope by a multiple of 90° so that the result is

between ±45°) yielded a larger average error of 26.0 degrees, with only 3 of the 30 within 5

degrees of the correct rotation.

For our proposed method, the average rotation error in the 30 faces detected was 6

degrees. The error was less than 1 degree for 23% of the detected faces (<1 degree is the best

error possible when using 1 degree increments), and within 5 degrees for 73% of the detected

faces.

Table 7: Roll angle calculation results Description Average Error

(degrees) % within 1 degree % within 5 degrees

Assume face is vertical

14.3 0% 13%

Least squares line fit 26.0 0% 10% Proposed method 6.0 23% 73%

We also tested our method on the color image used for the methods based on skin color

detection. The results are shown in Figure 98 The high contrast points are shown in blue, the

red line indicates the horizontal axis of the face, and the green line shows the vertical axis. By

visual inspection, six appear to be incorrect.

Figure 98: Face orientation results using our point symmetry method. The red line indicates the horizontal axis, the green line indicates the vertical axis, and the high contrast points are shown in blue.

5.2.4 Conclusion

We have presented a novel method for determining the initial face orientation using the

symmetry of high contrast points. This method is fast, since the computations are based on the

point coordinates, not image pixels, and is more accurate than other fast methods, such as finding

the orientation of skin color pixels. While not as precise as more complex methods, such as

Active Appearance Models presented later, this is still useful as a way to quickly estimate a

starting pose for face tracking.

5.3 Extension of the Active Appearance Model for Face Tracking

An overview of Active Appearance Models (AAMs) was presented in Section 2.5.6.

Tracking with AAMs requires first building a model offline, then determining the model

parameters that match each input frame. The standard 2D AAM provides 2D position, size, in-

plane rotation, and shape parameters, but obtaining the full 3D orientation that we need for

augmentation requires a combined 2D and 3D AAM. In this section, we will present the existing

methods and our extensions for each task.

5.3.1 Building an Active Appearance Model

The active appearance model was introduced by Cootes et al. [20] as the next step from

their earlier work with Active Shape Models (ASMs) [21]. The approach creates linear models

of shape variation and appearance (gray level) variation to model the object.

The input to the process is a set of images with feature points marked (usually by hand)

on each image. Cootes et al. [20] used 122 points for their face images, while Matthews and

Baker [67] used 68 vertices. First the variation of the 2D point locations in the images is

analyzed to create a mean shape and orthogonal shape modes. The shape s is defined as the

coordinates of the v vertices of the mesh:

( )Tvv yxyxyx K2211=s . (41)

Any shape in the input set can then be represented as

10 sss , (42)

where 0s is the mean shape, and the n shape vectors is and coefficients pi represent the shape

variation. This is a 2D model, and will record 3D variations such as turning the head to the side

as changes in the projected 2D shape.

Each image is then warped to match the mean shape, and the intensity variations within

the mean face shape are processed to create a mean appearance and orthogonal appearance

modes. The appearance )A(x is the image defined over the pixels x inside mean shape 0s . Any

appearance in the set can be represented as

( ) 01

0 A)(A)A( sxxxx ∈∀+= ∑=

iiiλ (43)

where ( )x0A is the mean appearance, and the m appearance modes ( )xiA and coefficients iλ

model the appearance variation.

Cootes et al. created a model that combined shape and appearance modes, dubbed a

combined AAM, but both Matthews and Baker and our method treat shape and appearance

separately, and call this an independent AAM.

The input needed for creating the shape model is a set of shapes with corresponding

points identified. Our first contribution to AAMs is in collecting these shapes. Instead of

collecting images or video that show different people a wide range of face poses and expressions

and hand marking on the order of 100 points on each, we require only about 30 points marked on

a single frontal image for each different person. The 30 points are used to determine the global

translation and rotation as well as the shape and animation parameters to fit a modified version

of the Candide generic 3D face model [12] to the image. From these parameters, we can then

synthesize a wide variety of poses and expressions for each person from the model.

The Candide model consists of a set of base vertices, a list of triangles that connect these

vertices and shape and animation units. Shape and animation units provide linear variation of

the base vertices. Mathematically, shape and animation units are equivalent, but shape units

represent rigid shape variations between individuals and should remain constant for a given

person. Animation units represent non-rigid expression changes, like raising the eyebrows or

opening the mouth.

We made some minor modifications to the Candide model for this application. To avoid

problems with hair or hats at the top of the head, we removed the vertices and associated

triangles above the eyebrows. This matches the face area that was used by Cootes et al. and by

Matthews and Baker, although both used denser sets of points along the edges of the face. We

also limited the animation units to match [25], leaving six animation parameters that control the

mouth and eyebrows. For shape parameters, we eliminated four of fourteen where insufficient

information was available in the 2D image.

In our implementation, for each input face image, the user marks points in a sequence.

The location of the next point to be marked is prompted by highlighting the point on a rendering

of the wireframe model. Figure 99(a) shows the image when all but the last point have been

marked. The prompt for the last point is shown in (b), and the resulting mesh fit on the face is

shown in (c). When the process is complete, a text file containing the shape and animation

parameters is saved, along with a 256x256 image file containing texture map for the face, that is,

the result of warping each triangle in (c) to match the base shape in (b).

(a) (b) (c)

Figure 99: Offline generic face model fitting process. (a) shows the points identified so far, (b) shows the prompt for the last point, and (c) shows the mesh that has been fit overlaid on the image.

The fitting was done by solving for the translation (tx, ty), scale k and in-plane rotation φ,

parameterized as (a, b) = (k cos φ, k sin φ), shape parameters σi, and animation parameters τi,

with the linear system of equations:

⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢

nynynynynxny

nxnxnxnxnynx

yyyyxy

xxxxyx

yyyyxy

xxxxyx

ssssvvssssvv

ssssvvssssvvssssvvssssvv

MOMMOMMMMM

1222212

222222

111111

~~~~~~00

~~~~~~11

~~~~~~10

~~~~~01

~~~~~~10

~~~~~~01

ττσσ

where (mix, miy) is the image coordinates of the ith marked point, ( )iyix vv ~,~ is the coordinates of

the base vertex in the Candide model (ignoring z) corresponding to the ith marked point,

( )jjiyix ss σσ ~,~ is the displacement for the ith vertex in the jth shape unit, and ( )jj

iyix ss ττ ~,~ is the

displacement for the ith vertex in the jth animation unit. In this discussion, 3D vectors will be

denoted with a tilde, to distinguish from 2D vectors.

The input to the second step of the model generation process is a list of the images

processed in the first step, and the 3D shape and animation parameters and texture obtained from

each. For each input image, 100 2D shapes are generated by randomly changing the azimuth,

elevation, and animation parameters of the 3D model and projecting the vertex coordinates back

to 2D. The roll angle is not varied because it is removed when the shapes are aligned. Not only

does this provide many shapes that cover the whole envelope of possible parameter variation

after hand marking only one image, but the projected shape model contains all the points from

the 3D model, not just the few that were marked to establish the shape and animation parameters.

Following the procedure from Cootes et al. [21], the shapes must then be aligned and

normalized, so the 2D shape modes that are created capture only true shape change, not scale,

translation, or in-plane rotation. They use a modification of the Procrustes method to minimize a

weighted sum of squares of distances between equivalent points on different shapes.

First, weights for each vertex are calculated to give more importance to those that vary

the least among the set. The weight, wk, for the kth vertex is given by

−−

=⎟⎟⎠

⎞⎜⎜⎝

⎛= ∑

jRk kj

Vw (45)

where Rkj is the distance between points k and j, and kjRV is the variance in this distance over the

set of shapes. A point that varies greatly will have a large sum of variances, resulting in a small

weight. A point that remains relatively stable with have a smaller sum of variances, giving a

greater weight, so will have greater affect on the shape normalization. The weights for each

point are collected in a diagonal matrix W.

Aligning shape 1 with shape 2 consists of finding the rotation φ, scale s, and translation t

= (tx, ty) that minimizes

⎟⎟⎠

⎞⎜⎜⎝

⎛−⎥

⎤⎢⎣

⎡ −−⎟⎟

⎞⎜⎜⎝

⎛−⎥

⎤⎢⎣

⎡ −− txxWtxx 2121 cossin

sincoscossinsincos

φφφφ

. (46)

Solving this equation using least-squares results in the following linear system:

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

sincos

XYZYXZWXY

, (47)

∑−

kikki xwX (48)

∑−

kikki ywY (49)

( )∑−

kkkk yxwZ (50)

∑−

kkwW (51)

( )∑−

021211

kkkkkk yyxxwC (52)

( )∑−

021212

kkkkkk yxxywC . (53)

The process to align the set of shapes is iterative. First, each shape in the set is aligned

with the first. For each iteration, the mean shape is calculated from the aligned shapes, the new

mean shape is aligned with the first shape, and then all shapes are aligned to the mean. This

process is repeated until it converges.

Armed with a set of aligned shapes and a mean, we are ready to perform principal

component analysis (PCA) to find the shape variation modes. First the mean shape is subtracted

from each shape, and then a 2n by 2n covariance matrix is formed from the outer product of the

delta shape vectors. Singular value decomposition (SVD) on the covariance matrix yields

orthogonal eigenvectors, which describe the 2D shape mode variation. Figure 100 shows the

first three shape modes for our model.

Figure 100: The first three shape modes. The circles show the mean shape, and the lines show the magnitude and direction of the positive (green) and negative (red) displacements.

In order to create the mean appearance and the appearance modes, we first warp each

input image to the mean 2D shape determined in the previous step. This is done by using an

affine warp to map pixels in each triangle from the old image into the new image. Once all the

faces are the same shape, the brightness and contrast must be matched. To match image A with

image B, we want to find α and β to minimize

( )∑ −+ 2B1A βα . (54)

This yields the linear system:

⎥⎥⎥⎥

⎢⎢⎢⎢

=⎥⎦

⎤⎢⎣

⎥⎥⎥⎥

⎢⎢⎢⎢

. (55)

Premultiplying both sides by the transpose of the first matrix simplifies to:

⎥⎦

⎤⎢⎣

⎡ ⋅=⎥

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡ ⋅

⎥⎦

⎤⎢⎣

⎡⋅⋅

=⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡⋅⋅⋅⋅

11A11AAA

, (56)

where A and B are the mean of A and B respectively, and n is the number of pixels. Solving

for α and β gives:

( ) ( )22 AAA

BAAAABAnAA

n−⋅

⋅−⋅=

−⋅−⋅

= βα . (57)

If B is zero, this simplifies to

( ) αβα AAAABAA

22 −=−⋅⋅−

=−⋅⋅

The matched image is therefore

( ) ( )1AAAAA

BA1AA1AA1AA 2 −−⋅⋅

=−=−=+=n

αααβα' (59)

Following the procedure of Cootes et al. [20], we start by using the first image as an

estimate of the mean, setting its mean to zero and variance to one. During each iteration, each

image is aligned to the mean using Equation 59, and then the mean is recalculated from all the

images. This process is repeated until converged.

Once the mean is calculated, it is subtracted from all the images, and PCA is applied to

create the appearance modes. Figure 101 shows the mean appearance on the left and two of the

appearance modes.

Figure 101: The mean appearance and two appearance modes.

5.3.2 Tracking with an Active Appearance Model

Tracking consists of recovering the model parameters, generating a synthetic image that

resembles the input image as closely as possible. To guide the optimization process, Cootes et

al. [20] assumed a linear relation between the error image (the difference between the input

image and the synthesized image) and the incremental update to the model parameters. This

fitting process was reported to take an average of 4.1 seconds on a Sun Ultra. Matthews and

Baker [67] showed a counterexample where the error image is the same for two examples, yet

different increments to the parameters are needed. They then detail a gradient descent approach

to the nonlinear least squares optimization problem that is reported to run at 4.3 ms. per iteration

on a dual 2.4GHz P4 machine. The number of iterations needed depends on the accuracy of the

initial guess.

In the next sections, we will discuss the basic 2D AAM shape fitting process, then the

full system, including appearance variation, brightness matching, 2D shape normalization, and

3D model constraints.

The basic idea of AAM shape fitting is to minimize the square of the difference between

the mean appearance and the input image warped by shape parameters p. Following Matthews

and Baker [67], we use W(x; p) to denote the piecewise affine transform that maps each pixel x

inside the mean 2D shape s0 to a pixel in shape s, using the 2D shape modes obtained from the

method described previously, where

10 sss . (60)

That is, to warp a single pixel in x, determine which triangle it is in, and then use

Equation 60 to map each vertex of that triangle to a new location. Find the affine transform that

maps the three old vertices to the three new vertices, and apply that affine transform to the pixel

coordinates. The warp is assumed to be parameterized such that W(x; 0) = x.

The image intensity as a function of the pixel location is denoted I(x). Thus, we want to

minimize

( )( ) ( )[ ]∑∈

xpxW AI (61)

This is therefore a nonlinear least squares minimization problem, since it has the form

( ) ( )∑=

jj xrxf

21 (62)

Defining the Jacobian of r as

( )nimji

,2,1,2,1

⎥⎦

⎤⎢⎣

⎡∂

∂= (63)

For such problems,

( ) ( ) ( ) ( ) ( )∑=

∇+=∇m

T xrxrxJxJxf1

22 . (64)

The Gauss-Newton method drops the second order term, approximating the Hessian as

( ) ( ) ( )xJxJxf T=∇2 . (65)

For standard Newton’s method line search, the kth parameter update pkN is

kkNk ffp ∇−∇= −12 . (66)

With this approximation for the Hessian, the parameter update GNkp is generated by

( ) kTkk

GNk rJJJp 1−

−= . (67)

Obtaining the parameter update requires calculating the residual, rk, and the Jacobian, Jk.

To take advantage of the efficiency of the inverse compositional image alignment

method, we expand Equation 61 as follows:

( )( ) ( )( )[ ]∑∈

Δ−0

pxWpxW AI (68)

Since the incremental parameters are applied to the second term while p is applied to the first

term, p must be updated by composing W(x; p) with W(x; Δp)-1. Details of this operation are in

[67]. A Taylor series expansion around Δp = 0 gives

( )( ) ( )∑∈

⎥⎦

⎤⎢⎣

∂∂

∇−−0

ppWxpxW AAI (69)

since W(x; 0) = x. The Jacobian of Δp is pW∂∂

∇− 0A , which does not depend on p, so only

needs to be computed at initialization. We can obtain the parameter update by either setting the

derivative of Equation 69 to zero or substituting the Jacobian and residual in Equation 67:

( )( ) ( )[ ]∑∑ −⎥⎦

⎤⎢⎣

⎡∂∂

∇⎟⎟

⎜⎜

⎛⎥⎦

⎤⎢⎣

⎡∂∂

∇⎥⎦

⎤⎢⎣

⎡∂∂

∇=Δ

xxxpxW

pWp 00

00 ; AIAAATT

The first term is the Gauss-Newton approximation for the Hessian. The only term that depends

on p or the input image is the residual, so the gradient of the mean appearance, pW∂∂ , the

Jacobian, and the inverse of the Hessian only need to be computed once, at initialization. At

each iteration, we only need to warp the input image with the current estimate of p, subtract the

mean image, find the dot product with the Jacobian, multiply by the inverse Hessian, and

compose Δp with p.

The warp Jacobian pW∂∂ is a pair of images (x and y) for each component of p. The value

at each pixel is found by determining the triangle that contains the pixel and forming a linear

combination of the displacements for the triangle vertices from shape vector s. Figure 102

shows the x and y gradients of the mean appearance, 0A∇ , and Figure 103 shows the warp

Jacobians pW∂∂ for the first three 2D shape modes. Multiplying these two sets to get

pW∂∂

∇ 0A

results in Figure 104.

Figure 102: X and Y gradients of the mean appearance.

Figure 103: Warp Jacobians for the first three 2D shape modes. The top row is the x component, and the bottom row is the y component.

Figure 104: The steepest descent images obtained by multiplying the gradients in Figure 102 by the warp Jacobians in Figure 103.

Several more sets of parameters are needed to complete our model. The first is

appearance variation. To model both shape and appearance variation, Matthews and Baker [67]

use the technique from [47], which rewrites the sum of squares as the L2 norm:

( )( ) ( ) ( )2

10; ∑

−−m

iii AAI xxpxW λ (71)

This can be decomposed into the vectors projected into the linear subspace spanned by Ai and its

orthogonal complement:

( )( ) ( ) ( )( )

ii Aspan

iii AAIAAI ∑∑

−−+−− xxpxWxxpxW λλ (72)

Due to orthogonality, the appearance variation drops out of the first term, so it no longer depends

on λi. The expression can be minimized by finding p from the first term, then λ from the second

term, but we don’t need λ for our application. Solving for λ, then comparing with the

appearance and shape parameters needed to reconstruct the input images from the appearance

modes would allow different individuals to be recognized. Without appearance parameters, we

just need to minimize

( )( ) ( ) ( )2

0;iAspan

AI xpxW − (73)

This can be done as before, with the additional step at initialization modifying the

Jacobian to map it into this subspace:

( ) ( ) ( )∑ ∑= ∈ ⎥

⎥⎦

⎢⎢⎣

∂∂

∇⋅−∂∂

∇=−m

xWxWxJsx

The next set of parameters is a global 2D shape similarity transform N(x; q). Since

building the shape model involved removing translation, scale, and in-plane rotation from the

shape data, our model needs to handle such transformations separately. Following Matthews and

Baker [67], we define

⎥⎦

⎤⎢⎣

⎡+⎥

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡+−+

);( qxN (75)

where 1 + a = k cos φ, b = k sin φ, k is the scale, φ is the in-plane rotation, and tx and ty are the

translations. We can put this in the same form as the shape model by defining

( )( )( )( )T

yxyxss

1010*4

−−=

and normalizing. Therefore

( ) ∑=

iiiq ssqxN . (77)

Note that N(x; 0) = x. Our function to minimize is now

( )( )( ) ( ) ( )2

0;;iAspan

AI xqpxWN − (78)

The composition of N and W is achieved by first warping the mean shape by N, finding

the affine transformation for each triangle, then warping the mean shape by W, and applying the

affine transformations to the result. If a vertex is a member of more than one triangle, the

transformation for each triangle is applied, and the results are averaged.

To avoid precision problems possible when subtracting images of different ranges, we

added brightness and contrast parameters b = (α β)T. The new function to minimize is

( ) ( )( )( ) ( ) ( )2

0;;1iAspan

AI xqpxWN −++ βα (79)

Since the parameters so far only describe the 2D shape, and we wish to recover the 3D

rotation of the head, we consider the approach by Xiao et al. [106], who combined the 2D AAM

with a constraint that the 2D shape matches a projection of a non-rigid 3D model, but provided

few details. Since the 3D model projection parameters include the 3D model orientation, this

will allow us to recover the 3D head rotation. However, Xiao et al. used a 3D model created

from 900 frames of a video of an individual for tracking. Our augmented reality application

cannot handle a 30 second delay between first observing an individual and starting the

augmentation. Therefore, we used the Candide model instead. In addition to the benefits we

derived from using this model for creating the 2D AAMs, using this model during tracking also

allows us to recover meaningful parameters during tracking. The 3D model built by Xiao et al.

doesn’t have useful labels for the non-rigid parameters, but the Candide model was built to use

MPEG-4 action units, so can be used for video compression as well as other applications such as

recognizing facial expressions.

The 3D model constraints are added as a second term, so our final function to minimize

( ) ( )( )( ) ( ) ( )

( )( ) ( ) ( ) .~~~,,1;;

tassqpxWN

xqpxWN

−⎥⎦

⎤⎢⎣

⎡+++−

+−++

∑∑i

τσψθφ

The first term describes the 2D AAM fitting and is evaluated for each pixel inside the mean

shape. The second term penalizes differences between the 2D shape and the projection of the 3D

shape, and is evaluated at each vertex. The parameter K makes this a soft constraint. The new

parameters are a vector r = (f, φ, θ, ψ, tx, ty)T, and σ and τ, which are the coefficients for the 3D

shape and animation parameters respectively. The scale factor is (1+f), R(φ, θ, ψ) is the 3x3

rotation matrix formed by roll angle φ, pitch angle θ, and yaw angle ψ:

⎥⎥⎥

⎢⎢⎢

−−−

−+−+=

⎥⎥⎥

⎢⎢⎢

⎡ −

⎥⎥⎥

⎢⎢⎢

−⎥⎥⎥

⎢⎢⎢

⎡ −=

θψφθψφψφθψφψθφθφθ

θψψθψφψφθψφψ

φφφϕ

θθθθ

ψψψθφ

coscoscossincossinsinsinsincoscossinsincoscossincos

cossincossinsinsincossinsinsincoscos

1000cossin0sincos

cossin0sincos0

cos0sin010

sin0cos,,R

and the translation is t = (tx, ty)T. It is not strictly necessary to use q in the 3D term, but using it

allows the vertex warps from the 2D term to be reused. Note that r is parameterized so that r = 0

results in the identity transform.

Expanding to show the incremental update yields:

( ) ( )( )( ) ( ) ( )( )( ) ( )

( )( )( )( )

( )( ) ( ) ( ) ( ) .~~~,,1

;;1;;12

qpqpxWNWN

qpxWNqpxWN

Δ−−⎥⎦

⎤⎢⎣

⎡Δ++Δ++

⎥⎥⎥

⎢⎢⎢

Δ−ΔΔΔΔ−Δ−

−Δ−Δ−

+Δ−ΔΔΔ+−++

∑∑i

ττσσψθφθψ

θφψφ

βαβα

Taylor series expansion around [Δb Δq Δp Δr Δσ Δτ] = 0 results in:

( ) ( )( )( ) ( ) ( )( )

( )( )

( ) ( ) ( ) ( )

~,,1~,,110

));;(());;((;;

τψθφσψθφ

ψθφ

βαβα

Δ+−Δ+

−Δ⎥⎦

⎤⎢⎣

⎡−Δ⎥

⎤⎢⎣

⎡−Δ⎥

⎤⎢⎣

⎡−−Δ⎥

⎤⎢⎣

⎡−Δ⎥

⎤⎢⎣

⎡−−Δ

−Δ−Δ−

Δ∂∂

∇−Δ∂∂

∇−Δ−Δ−−++

pqpsWNqqpsWNqpxWN

qNxxqpxWN

Aspan i

( ) ( ) ⎥⎦

⎤⎢⎣

⎡+++= ∑∑

iiiRf assv ~~~,,1~

0 τσψθφ . (84)

The Jacobian for the 2D portion (evaluated per pixel) is

( ) ( )( )

( ) ( ) ( ) 0xJxJxJpWxJ

1xJxxJ

===∂∂

−∇=

∂∂

−∇=

−=−=

Note that all these quantities can be computed at initialization. The Jacobian for the 3D portion

(evaluated per vertex) is

( )( ) ( )( )( ) ( )( )( )( ) ( )( ) ( )( ) ( )( )( ) ( ) ( )( ) ( ) ( )avJ

qpsWNvJ

ψθφψθφ

RfKRfK

+−=+−=

−−=

The Jacobians for the 3D term generally do depend on the model parameters, so they must be

recomputed each time, but since there are fewer vertices than pixels, this is not nearly as

expensive as recomputing the 2D Jacobians each iteration.

The Hessian must be recomputed each iteration as well, since it depends on the

Jacobians. As described in [4], the Hessian for the sum of the 2D and 3D terms is the sum of the

2D Hessian (which is constant) and the 3D Hessian (which is not). This combined Hessian must

then be inverted each iteration to solve for the incremental parameters.

5.3.3 AAM Experimental Results

The solution of any gradient descent algorithm is dependent on the initial guess. For

now, we initialize the translation and size using the center and width of the rectangle returned by

the face detection module, and initialize the orientation angles to zero. Further refinements will

be discussed later on. We apply progressive transformation complexity, as suggested by

Matthews and Baker [67], solving for the shape normalization parameters q, before the shape

variation parameters p. This gets the mesh in the correct location and orientation before

adapting to the non-rigid changes. Convergence is declared when the change in the parameters

between iterations is less than a threshold.

Figure 105 shows several iterations of the fitting process. Of these images, only the last

two without the wireframe overlay are computed as part of the process. The others are for

visualization and debug only. In each set, (a) shows the 2D mesh warped by the current

estimates for p and q overlaid on the original image. In addition, the result of the face detection

module is shown with a rectangle. The 3D mesh warped by the current estimates for r, σ, and τ

is overlaid on the original image and shown in (b). The input image warped by p and q is in (c),

with the mean shape mesh added for visualization. The error image, ( )( )( ) ( )xqpxWN 0;; AI − , is

displayed in (d).

The first row shows the initial guess. Note that the mouth is open in (a) and closed in (b).

This is due to the mean 2D shape having the mouth part way open (because it averaged shapes

with the mouth closed and the mouth fully open) and the base 3D shape model having the mouth

closed. The animation parameters τ are adjusted during the first iteration so that the 2D and 3D

meshes look alike in subsequent frames. The second row shows the intermediate result after 5

iterations, when the q parameters (the 2D shape normalization) have converged. It takes 8 more

iterations for the p parameters to converge, as shown in the third row. Since we used the 3D

model to constrain the 2D shape to valid projections of the 3D shape, we can recover the 3D

position, size, and orientation of the resulting fit, providing the parameters needed for

augmentation.

(a) (b) (c) (d)

Figure 105: Iterations 0, 5, and 12 of fitting the AAM. Column (a) is the 2D shape on the original image with the face detection result shown by a rectangle. Column (b) is the 3D shape on the original image. The warped input image I(N(W(x; p); q)) with the mesh overlaid is shown in Column (c), and the error image ( )( )( ) ( )xqpxWN 0;; AI − is shown in column (d).

The example above shows the fitting process for a single image. We can refine the initial

guess, and therefore reduce the number of iterations required by employing the initial pose

estimation technique presented in Section 5.2. In order to use AAMs to track a face efficiently

in video, we need to provide the best possible guess for starting each subsequent frame. Using

the results from frame i to initialize frame i + 1 resulted in an average of 9 iterations per frame.

By applying the translation obtained by point tracking (from Section 5.1.2), that dropped to 3

iterations per frame. Using the translation, rotation, and scale obtained from point tracking

further improved the initial guess so that only 2 iterations were required on average. The cost of

point tracking is negligible compared to a single iteration of AAM. We further reduced the

model complexity by making the 3D shape constant after the first frame. The 3D animation

units are still updated to reflect expression changes, but the 3D shape units that reflect the rigid

head shape are fixed after the initial fit for a given head instance.

5.3.4 Optimization

Extensive optimization was done with the aid of the profiler in Microsoft Visual Studio

6.0. This optimization achieved a speedup of nearly 50, and made the difference between an

interactive application and one that runs too slowly for interaction. In this section, we will

discuss the kinds of improvements that were made to achieve this performance boost. The

profiler results that follow were measured on a single processor 1.7 GHz. Pentium 4. For

repeatable testing of the algorithms without the overhead of capturing the live video and

displaying it, we used a timing test application that processed frames from a movie file, calling

the same functions as the live application. The use of a profiler is crucial to optimization in

order to identify the operations that take the most time and to verify the speed improvement

achieved at each step. For example, a 10% improvement in a function that takes 50% of the

processing time yields a 5% improvement in overall speed, but making a function that takes 1%

of the time 10 or even 100 times faster will not yield more than a 1% improvement in

performance of the whole application.

In the first profiler run, using the first 50 frames of the 720x480 resolution video, motion

detection was reported to take 538 ms. per frame, face detection took 92 ms. per frame, and the

AAM took 263 ms. per iteration (not including initialization). The first round of optimization

replaced the most frequently used simple functions with inline code, thus eliminating the

function call overhead. Good candidates for this optimization were class methods that simply

returned class member variables, because these functions were called frequently, yet were only

one or two lines of code. This optimization makes the code run faster while still maintaining the

object oriented control of the data within the class. One example is

CActiveAppearanceModel::GetTriangle(), which returns the indices of a triangle within the

AAM mesh. This function was called 79 million times, and took 11.4% of the total time.

Data was converted to more computationally efficient quantities, like 32-bit fixed point

integer images instead of floating point, and storing the variance for the motion model instead of

the standard deviation. The integer images allowed operations on the images (like dot products)

to be done with integer arithmetic, which is typically significantly faster than floating point

math. Care was taken to be sure that the precision was sufficient for the algorithm to still work,

and that the integer values did not overflow.

The full resolution frame was needed for tracking, but not for motion detection or face

detection, so these two operations were done 360x240 images. The additional step of

subsampling the high resolution image to generate the smaller image was less computationally

expensive than the time to perform these two operations on four times as many pixels.

Another type of improvement sacrificed accuracy for speed, after verifying that the

reduced simpler operation still had sufficient fidelity for the algorithm to converge correctly. In

particular, after making some of the aforementioned optimizations, the OpenCV library function

cvGetRectSubPix() was taking the longest. This function uses interpolation to find the pixel

value at non-integer coordinates, and was used while warping the input image by the current

parameter estimate. Using the nearest neighbor pixel instead of performing the interpolation

resulted in a 12% speed improvement, with the algorithm still converging at the correct answer.

not all attempts were this successful. We tried to reduce the size of the images used for the

gradient descent process, starting with the mean appearance, from 256x256 to 128x128, but

found that the algorithm tended to not converge at that resolution.

After the code was fast enough to run 100 frames for the timing test, motion detection

took 91 ms. per frame (6 times speedup), face detection took 33 ms. (2.8 times speedup), and the

AAM took 20 ms. per iteration (13 times speedup). The most expensive component of each

AAM iteration was the dot product between the error image and each of the steepest descent

images for the 2D shape fitting. Optimization was done on this function, ensuring all math was

integer and skipping the borders outside the envelope of the valid error image. This was an

example where there were tradeoffs between reducing and simplifying operations. Image

operations like finding the difference between the warped input image and the mean image or

finding the dot product between two images are only valid for the pixels inside the mean shape.

There are pixels around the edges that are always zero, and do not affect the result of the

operation. Operating on all the pixels results in simpler code, since no conditional test is

required to identify which pixels are inside the mean shape. On the other hand, testing a mask

image to determine which pixels are potentially non-zero disrupts the pipelined operation of the

processor in addition to the extra instruction required for each pixel, so each loop iteration takes

longer. In our case, we found that the code ran slower using the mask. On the other hand, it

doesn’t cost anything to start at the first non-zero pixel and end at the last non-zero pixel, instead

of the corners of the image, since that only changes the loop start and end values. This is what

we used to achieve the best performance of the image processing steps like dot products.

Memory allocation has traditionally been a time consuming process. Although modern

techniques have decreased this cost, for optimal performance it is better to allocate temporary

arrays needed each frame or each iteration of the algorithm at initialization. This saves both the

allocation and the deallocation time.

The order of the operations was examined to avoid recalculating the same quantities. For

example, the vertex positions after warping by the current parameter estimate were needed for

both the 2D and 3D portions of the calculation. These vertex positions were calculated once and

used for both. A related optimization was to factor the equations efficiently. Each of the terms

of the 3D Jacobian the factor K. In the process of computing the dot product between the

Jacobian and the warped vertices, the terms are summed. This can be implemented more

efficiently by computing the Jacobians without the factor K, then multiplying the dot product

result by K.

By the time the test was increased to 200 frames, motion detection took 16 ms. (33 times

speedup) and the AAM took 10.7 ms. per iteration (24 times speedup). A general equation

solving function was replaced with the explicit solution for finding the affine transformations for

the triangle warps. The function was called for every triangle in each iteration of the AAM

fitting process. This avoided validity checks on the input arguments, tests to determine the type

of solving algorithm needed, and a couple levels of function call overhead. This particular

solution required inverting two 3x3 matrices, which can be done with closed form equations.

Use of appropriate data structures also contributed to this phase of optimization.

Computing the parameter update for the inverse compositional algorithm requires averaging the

warp for each vertex over all the triangles that contain the vertex. Since the connectivity of the

mesh does not change, a data structure listing the triangles that use each vertex allows for more

efficient execution than traversing the whole triangle list for each vertex.

For further streamlining, we created a version of the face detection module that

terminates after finding the first face. This aids the track verification process, which confirms

that the object being tracked is actually a face. In this case, the location and size of the face is

not needed. Because of this faster verification process, we remove the areas currently being

tracked from the region where new faces may be detected.

In the last run, motion detection took 11.5 ms. per frame (47 times speedup), face

detection took 24.7 ms. per frame (3.7 times speedup), and the AAM took 5.7 ms. per iteration

(46 times speedup). As an final improvement, we decreased the frequency for the major

components. On even frames, motion detection and face detection are done and the high contrast

points are used to update the tracking parameters for existing faces. On odd frames, the AAM

model is updated. The resulting track is smooth, and enables real time operation.

We experimented with adding skin detection as an additional filter to reduce the search

area for new faces, but in a room with white walls, the expense of performing color-based skin

detection exceeds the savings.

On a 2.2 GHz. AMD Opteron processor, the application runs at 10 to 15 Hz. tracking one

face in live 720x480 video. Since multiple factors affect the execution rate, including the

amount of motion in the scene and the rate of change of the tracked faces, it is difficult to give a

more precise execution rate.

5.3.5 Conclusion

As with any minimization process, it is possible to end up with a local minimum instead

of the global minimum, converging to an incorrect solution. Our integration of detection and

tracking, discussed in Section 5.1, handles this situation. If the AAM converges to the wrong

solution, the resulting region is likely to fail the face detector, so the face is removed from the

list. A new face will be detected in the next frame, reinitializing the process.

Although AAMs were initially too slow to be useful for real-time face tracking

applications, the evolution of the inverse compositional image alignment method as well as the

ever-increasing processing power in consumer PCs has made it practical. The original 2D-only

AAM also did not provide 3D head orientation, but adding 3D constraints to the original 2D

model has overcome that hurdle as well. We have implemented the existing AAM method and

extended it by simplifying the 2D AAM shape generation process so it requires orders of

magnitude less input data and hand processing. Our modification using an existing non-rigid 3D

head model allows meaningful facial expressions to be extracted from the tracking process,

which was not possible with the prior 2D+3D AAM, making this a useful tool for augmented

reality applications that track human faces.

5.4 Augmentation

Augmentation of a human face adds something that is attached to the head. It can be

external, such as a hat or glasses; on the surface, like a tattoo; or it can modify the face, for

example by substituting a different nose. The pose determined in the previous step must be

stable and accurate to position the virtual piece properly as the head moves.

We have used a simple eyeglass model and a wireframe graduation cap to demonstrate

the concept, but these could easily be replaced with any three dimensional model. The size,

position, and orientation of the face are applied as the virtual model is rendered on top of the

video image so it looks like it is attached to each face in the video. A sample frame is shown in

Figure 106.

Figure 106: Video frame augmented with cap and glasses.

5.5 Conclusions

Augmented reality is a rapidly expanding field that uses both computer vision and

computer graphics. Its interactive nature draws people in as they react to virtual objects added to

the real environment, but also creates challenges in achieving real-time update rates and precise

registration between virtual and real objects.

In our work, we have expanded the capabilities of augmented reality tracking systems

that use minimal hardware with no specialized environment. This simplified setup is suitable for

consumer applications targeted to people’s homes. It also places more of a burden on the

computer vision portion of the system to detect the applicable characteristics of the surroundings

solely from a single video input in an uncontrolled environment, where the background is

unknown and the lighting may vary.

This second augmented reality application detects and tracks human faces. Faces have a

variety of colors, shapes and additions such as glasses and facial hair. They also are non-rigid,

with local changes include jaw movement, eyes opening and closing, and lip motion. The faces

in the image are augmented in real time with virtual glasses, but this can easily be any virtual

object.

We have combined existing methods for face detection, motion detection, and tracking in

an integrated system that detects faces with higher precision and recall, handles a variable

number of faces, and maintains tracks of only those objects that are actually faces. Most face

tracking applications assume exactly one face is present that fills the majority of the screen, but

ours handles a much wider range of conditions.

We have presented a novel algorithm for computing the initial rotation of a detected face

quickly and accurately. We extended earlier symmetry measures to handle unstructured point

sets without the initial step of dividing the points into corresponding pairs. The use of high

contrast points in the detected face region speeds up the algorithm since image coordinates are

used instead of costly image correlations. The points used by the algorithm can be used later to

track the face.

The accuracy achieved by the method is sufficient for visual applications, i.e., those that

are judged by human eyes, including mixed reality and game control from video input.

Additional accuracy typically requires more complex computation, which is not feasible in real-

time applications.

We have implemented Active Appearance Models to refine the face pose estimation and

have presented extensions to the method that simplify model generation as well as provide

meaningful parameters as a result of the tracking to describe the 3D head orientation as well as

facial expressions. These parameters are suitable for video compression, such as MPEG-4.

This provides a critical step in the development of an augmented reality application that

automatically detects faces and determines their pose without prior calibration or manual

intervention.

6 CONCLUSIONS

6.1 Contributions