Post on 30-Jul-2022
University of Central Florida University of Central Florida
STARS STARS
Electronic Theses and Dissertations, 2004-2019
2006
Real-time Monocular Vision-based Tracking For Interactive Real-time Monocular Vision-based Tracking For Interactive
Augmented Reality Augmented Reality
Lisa Spencer University of Central Florida
Part of the Computer Sciences Commons, and the Engineering Commons
Find similar works at: https://stars.library.ucf.edu/etd
University of Central Florida Libraries http://library.ucf.edu
This Doctoral Dissertation (Open Access) is brought to you for free and open access by STARS. It has been accepted
for inclusion in Electronic Theses and Dissertations, 2004-2019 by an authorized administrator of STARS. For more
information, please contact STARS@ucf.edu.
STARS Citation STARS Citation Spencer, Lisa, "Real-time Monocular Vision-based Tracking For Interactive Augmented Reality" (2006). Electronic Theses and Dissertations, 2004-2019. 975. https://stars.library.ucf.edu/etd/975
REAL-TIME MONOCULAR VISION-BASED TRACKING FOR INTERACTIVE AUGMENTED REALITY
by
LISA G. SPENCER
B.S. University of Arizona, 1984 M.S. University of Central Florida, 2002
A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy
in the School of Electrical Engineering and Computer Science in the College of Engineering and Computer Science
at the University of Central Florida Orlando, Florida
Spring Term 2006
Major Professor: Ratan K. Guha
ii
© 2006 Lisa G. Spencer
iii
ABSTRACT
The need for real-time video analysis is rapidly increasing in today’s world. The
decreasing cost of powerful processors and the proliferation of affordable cameras, combined
with needs for security, methods for searching the growing collection of video data, and an
appetite for high-tech entertainment, have produced an environment where video processing is
utilized for a wide variety of applications. Tracking is an element in many of these applications,
for purposes like detecting anomalous behavior, classifying video clips, and measuring athletic
performance. In this dissertation we focus on augmented reality, but the methods and
conclusions are applicable to a wide variety of other areas. In particular, our work deals with
achieving real-time performance while tracking with augmented reality systems using a
minimum set of commercial hardware. We have built prototypes that use both existing
technologies and new algorithms we have developed. While performance improvements would
be possible with additional hardware, such as multiple cameras or parallel processors, we have
concentrated on getting the most performance with the least equipment.
Tracking is a broad research area, but an essential component of an augmented reality
system. Tracking of some sort is needed to determine the location of scene augmentation. First,
we investigated the effects of illumination on the pixel values recorded by a color video camera.
We used the results to track a simple solid-colored object in our first augmented reality
application. Our second augmented reality application tracks complex non-rigid objects, namely
human faces.
iv
In the color experiment, we studied the effects of illumination on the color values
recorded by a real camera. Human perception is important for many applications, but our focus
is on the RGB values available to tracking algorithms. Since the lighting in most environments
where video monitoring is done is close to white, (e.g., fluorescent lights in an office,
incandescent lights in a home, or direct and indirect sunlight outside,) we looked at the response
to “white” light sources as the intensity varied. The red, green, and blue values recorded by the
camera can be converted to a number of other color spaces which have been shown to be
invariant to various lighting conditions, including view angle, light angle, light intensity, or light
color, using models of the physical properties of reflection. Our experiments show how well
these derived quantities actually remained constant with real materials, real lights, and real
cameras, while still retaining the ability to discriminate between different colors. This color
experiment enabled us to find color spaces that were more invariant to changes in illumination
intensity than the ones traditionally used.
The first augmented reality application tracks a solid colored rectangle and replaces the
rectangle with an image, so it appears that the subject is holding a picture instead. Tracking this
simple shape is both easy and hard; easy because of the single color and the shape that can be
represented by four points or four lines, and hard because there are fewer features available and
the color is affected by illumination changes. Many algorithms for tracking fixed shapes do not
run in real time or require rich feature sets. We have created a tracking method for simple solid
colored objects that uses color and edge information and is fast enough for real-time operation.
We also demonstrate a fast deinterlacing method to avoid “tearing” of fast moving edges when
recorded by an interlaced camera, and optimization techniques that usually achieved a speedup
v
of about 10 from an implementation that already used optimized image processing library
routines.
Human faces are complex objects that differ between individuals and undergo non-rigid
transformations. Our second augmented reality application detects faces, determines their initial
pose, and then tracks changes in real time. The results are displayed as virtual objects overlaid
on the real video image. We used existing algorithms for motion detection and face detection.
We present a novel method for determining the initial face pose in real time using symmetry.
Our face tracking uses existing point tracking methods as well as extensions to Active
Appearance Models (AAMs). We also give a new method for integrating detection and tracking
data and leveraging the temporal coherence in video data to mitigate the false positive
detections. While many face tracking applications assume exactly one face is in the image, our
techniques can handle any number of faces.
The color experiment along with the two augmented reality applications provide
improvements in understanding the effects of illumination intensity changes on recorded colors,
as well as better real-time methods for detection and tracking of solid shapes and human faces
for augmented reality. These techniques can be applied to other real-time video analysis tasks,
such as surveillance and video analysis.
vi
TABLE OF CONTENTS
LIST OF TABLES.............................................................................................................. x
LIST OF FIGURES ........................................................................................................... xi
LIST OF ACRONYMS/ABBREVIATIONS ................................................................ xviii
1 INTRODUCTION ...................................................................................................... 1
1.1 Augmented Reality Overview............................................................................. 1
1.2 Virtual Looking Glass......................................................................................... 2
1.3 Contributions ...................................................................................................... 4
2 BACKGROUND ........................................................................................................ 7
2.1 Current AR Applications .................................................................................... 7
2.2 Augmented Reality Challenges ........................................................................ 13
2.3 Augmented Reality Implementation ................................................................. 14
2.3.1 Sensing the Real Environment...................................................................... 14
2.3.2 Adding Virtual Elements .............................................................................. 16
2.3.3 Presenting the Combined Result................................................................... 17
2.3.4 Tools ............................................................................................................. 18
2.4 Augmented Looking Glass System Description ............................................... 19
2.5 Related Work .................................................................................................... 23
2.5.1 Color Constancy ........................................................................................... 23
2.5.2 Virtual Mirror Applications .......................................................................... 33
2.5.3 Rectangle Tracking ....................................................................................... 36
vii
2.5.4 Face Detection in Images.............................................................................. 38
2.5.5 Face Detection in Video ............................................................................... 41
2.5.6 Real-time Face Tracking............................................................................... 43
2.5.7 Face Pose Estimation .................................................................................... 49
3 COLOR ANALYSIS WITH VARIABLE LIGHT INTENSITY............................. 51
3.1 Color Models .................................................................................................... 52
3.1.1 YIQ ............................................................................................................... 53
3.1.2 HSV............................................................................................................... 53
3.1.3 HLS............................................................................................................... 54
3.1.4 CIELAB ........................................................................................................ 55
3.1.5 Chromaticity (Normalized RGB).................................................................. 56
3.1.6 c1c2c3 ............................................................................................................. 57
3.1.7 l1l2l3 ............................................................................................................... 57
3.1.8 Derivative...................................................................................................... 57
3.1.9 Log Hue ........................................................................................................ 59
3.2 Experimental Setup........................................................................................... 59
3.3 Results............................................................................................................... 61
3.3.1 Indoors Manual Change Light Sequence...................................................... 62
3.3.2 Indoor Manual Change Iris Sequence........................................................... 69
3.3.3 Indoor Manual Flashlight Sequence ............................................................. 76
3.3.4 Outdoor Automatic Change Light Sequence ................................................ 82
3.3.5 Outdoor Manual Change Light Sequence..................................................... 89
viii
3.4 Analysis ............................................................................................................ 95
3.4.1 Indoors Manual Change Light Sequence...................................................... 96
3.4.2 Indoor Manual Change Iris Sequence........................................................... 99
3.4.3 Indoor Manual Flashlight Sequence ........................................................... 102
3.4.4 Outdoor Automatic Change Light Sequence .............................................. 105
3.4.5 Outdoor Manual Change Light Sequence................................................... 107
3.5 Conclusions..................................................................................................... 109
4 AUGMENTING A SOLID RECTANGLE ............................................................ 113
4.1 Color Representations..................................................................................... 113
4.2 Proposed Method ............................................................................................ 114
4.2.1 Color Model ................................................................................................ 115
4.2.2 Minimum Bounding Quadrilateral.............................................................. 118
4.2.3 Quadrilateral Refinement............................................................................ 119
4.2.4 Deinterlacing............................................................................................... 122
4.2.5 Displaying the Result.................................................................................. 124
4.3 Optimization Methods .................................................................................... 125
4.4 Discussion....................................................................................................... 135
4.5 Conclusions..................................................................................................... 136
5 AUGMENTING HUMAN FACES AND HEADS................................................ 137
5.1 Integration of Detection and Tracking............................................................ 137
5.1.1 Face Detection and Localization................................................................. 138
5.1.2 Face Tracking ............................................................................................. 140
ix
5.1.3 Integration of Detection and Tracking........................................................ 144
5.1.4 Results......................................................................................................... 148
5.2 Initial Pose Estimation .................................................................................... 149
5.2.1 Pose from Skin Color Detection ................................................................. 150
5.2.2 Proposed Point Symmetry Method ............................................................. 154
5.2.3 Point Symmetry Results.............................................................................. 158
5.2.4 Conclusion .................................................................................................. 161
5.3 Extension of the Active Appearance Model for Face Tracking ..................... 162
5.3.1 Building an Active Appearance Model ...................................................... 162
5.3.2 Tracking with an Active Appearance Model .............................................. 170
5.3.3 AAM Experimental Results........................................................................ 181
5.3.4 Optimization ............................................................................................... 184
5.3.5 Conclusion .................................................................................................. 188
5.4 Augmentation.................................................................................................. 189
5.5 Conclusions..................................................................................................... 190
6 CONCLUSIONS..................................................................................................... 193
6.1 Contributions .................................................................................................. 193
6.2 Future Directions ............................................................................................ 195
REFERENCES ............................................................................................................... 199
x
LIST OF TABLES
Table 1: Original code to calculate chromaticity. Functions prefixed by "cv" reference the
OpenCV library................................................................................................................... 129
Table 2: Code for table lookup to replace division (three cvDiv lines) in Table 1 .................... 130
Table 3: Chromaticity calculation after integrating scale and sum into loop. ............................ 131
Table 4: Chromaticity calculation after all OpenCV functions have been replaced. ................. 132
Table 5: Final optimized code for chromaticity calculation. ...................................................... 134
Table 6: Results from integrating detection and tracking........................................................... 149
Table 7: Roll angle calculation results........................................................................................ 160
xi
LIST OF FIGURES
Figure 1: Milgram's virtuality continuum....................................................................................... 1
Figure 2: Pepper's Ghost ................................................................................................................. 8
Figure 3: Augmented first down line in football. ........................................................................... 9
Figure 4: Virtual advertising......................................................................................................... 10
Figure 5: Example of an aircraft HUD. ........................................................................................ 11
Figure 6: Image guided surgery. ................................................................................................... 12
Figure 7: Overview of the AR application.................................................................................... 21
Figure 8: Screen capture from "Magic Mirror" project. Green highlights show areas where
motion is detected. ................................................................................................................ 33
Figure 9: Sample display from the virtual mirror of Darrell et al. which distorts detected faces. 34
Figure 10: Screenshots from the face augmentation demo of Lepetit et al. ................................. 36
Figure 11: A sample frame from the experiment. The labels were added later to accommodate
printing in black and white. .................................................................................................. 60
Figure 12: RGB color space for the Indoor Manual Change Light Sequence.............................. 62
Figure 13: RGB color space in 3D plot for the Indoors Manual Change Light Sequence ........... 63
Figure 14: YIQ color space for the Indoors Manual Change Light Sequence ............................. 63
Figure 15: HSV color space for the Indoors Manual Change Light Sequence............................. 64
Figure 16: HLS color space for the Indoors Manual Change Light Sequence ............................. 65
Figure 17: CIELAB color space for the Indoors Manual Change Light Sequence ...................... 65
Figure 18: Chromaticity space for the Indoors Manual Change Light Sequence......................... 66
xii
Figure 19: c1c2c3 color space for the Indoors Manual Change Light Sequence ........................... 67
Figure 20: l1l2l3 color space for the Indoors Manual Change Light Sequence ............................. 67
Figure 21: Derivative color space for the Indoors Manual Change Light Sequence.................... 68
Figure 22: Log Hue for the Indoors Manual Change Light Sequence.......................................... 69
Figure 23: RGB color space for the Indoor Manual Change Iris Sequence ................................. 70
Figure 24: RGB color space in 3D for the Indoors Manual Change Iris Sequence...................... 70
Figure 25: YIQ color space for the Indoor Manual Change Iris Sequence .................................. 71
Figure 26: HSV color space for the Indoor Manual Change Iris Sequence ................................. 71
Figure 27: HLS color space for the Indoor Manual Change Iris Sequence.................................. 72
Figure 28: CIELAB color space for the Indoor Manual Change Iris Sequence ........................... 72
Figure 29: Chromaticity color space for the Indoor Manual Change Iris Sequence .................... 73
Figure 30: 321 ccc color space for the Indoor Manual Change Iris Sequence................................. 73
Figure 31: 321 lll color space for the Indoor Manual Change Iris Sequence ................................... 74
Figure 32: Derivative color space for the Indoors Manual Change Iris Sequence....................... 75
Figure 33: Log Hue for the Indoor Manual Change Iris Sequence .............................................. 75
Figure 34: RGB color space for the Indoor Manual Flashlight Sequence.................................... 76
Figure 35: RGB color space in 3D for the Indoor Manual Flashlight Sequence.......................... 77
Figure 36: YIQ color space for the Indoor Manual Flashlight Sequence..................................... 77
Figure 37: HSV color space for the Indoor Manual Flashlight Sequence .................................... 78
Figure 38: HLS color space for the Indoor Manual Flashlight Sequence .................................... 78
Figure 39: CIELAB color space for the Indoor Manual Flashlight Sequence.............................. 79
xiii
Figure 40: Chromaticity color space for the Indoor Manual Flashlight Sequence....................... 80
Figure 41: 321 ccc color space for the Indoor Manual Flashlight Sequence................................... 80
Figure 42: 321 lll color space for the Indoor Manual Flashlight Sequence..................................... 81
Figure 43: Derivative color space for the Indoor Manual Flashlight Sequence ........................... 81
Figure 44: Log Hue for the Indoor Manual Flashlight Sequence ................................................. 82
Figure 45: RGB color space for the Outdoor Automatic Change Light Sequence....................... 83
Figure 46: RGB color space in 3D for the Outdoor Automatic Change Light Sequence............. 83
Figure 47: YIQ color space for the Outdoor Automatic Change Light Sequence........................ 84
Figure 48: HSV color space for the Outdoor Automatic Change Light Sequence....................... 84
Figure 49: HLS color space for the Outdoor Automatic Change Light Sequence ....................... 85
Figure 50: CIELAB color space for the Outdoor Automatic Change Light Sequence ................ 85
Figure 51: Chromaticity color space for the Outdoor Automatic Change Light Sequence.......... 86
Figure 52: The 321 ccc color space for the Outdoor Automatic Change Light Sequence.............. 86
Figure 53: The l1l2l3 color space for the Outdoor Automatic Change Light Sequence ................ 87
Figure 54: Derivative color space for the Outdoor Automatic Change Light Sequence.............. 88
Figure 55: Log Hue for the Outdoor Automatic Change Light Sequence.................................... 88
Figure 56: RGB color space for the Outdoor Manual Change Light Sequence ........................... 89
Figure 57: RGB color space in 3D for the Outdoor Manual Change Light Sequence ................. 90
Figure 58: YIQ color space for the Outdoor Manual Change Light Sequence ............................ 90
Figure 59: HSV color space for the Outdoor Manual Change Light Sequence ........................... 91
Figure 60: HLS color space for the Outdoor Manual Change Light Sequence............................ 91
xiv
Figure 61: CIELAB color space for the Outdoor Manual Change Light Sequence ..................... 92
Figure 62: Chromaticity color space for the Outdoor Manual Change Light Sequence .............. 93
Figure 63: The 321 ccc color space for the Outdoor Manual Change Light Sequence................... 93
Figure 64: The l1l2l3 color space for the Outdoor Manual Change Light Sequence..................... 94
Figure 65: Derivative color space for the Outdoor Manual Change Light Sequence .................. 94
Figure 66: Log Hue for the Outdoor Manual Change Light Sequence ........................................ 95
Figure 67: Standard Deviation for the Indoor Manual Change Light Sequence .......................... 97
Figure 68: Discriminative power for the Indoor Manual Change Light Sequence ...................... 98
Figure 69: Standard Deviation for all frames of the Indoor Manual Change Iris Sequence ...... 100
Figure 70: Standard deviation for frames with no saturation for the Indoor Manual Change Iris
Sequence ............................................................................................................................. 100
Figure 71: Indoor Manual Change Iris Sequence Discriminative Power for all frames............. 101
Figure 72: Indoor Manual Change Iris Sequence Discriminative Power excluding frames with
saturation............................................................................................................................. 102
Figure 73: Standard Deviation for all frames of the Indoor Manual Flashlight Sequence......... 103
Figure 74: Standard Deviation for frames with no saturation for the Indoor Manual Flashlight
Sequence ............................................................................................................................. 103
Figure 75: Discriminative Power for the Indoor Manual Flashlight Sequence with all frames . 104
Figure 76: Discriminative power for the Indoor Manual Flashlight Sequence with no saturated
frames.................................................................................................................................. 105
Figure 77: Standard deviations for all frames of the Outdoor Automatic Change Light Sequence
............................................................................................................................................. 106
xv
Figure 78: Discriminative power for the Outdoor Automatic Change Light Sequence ............. 107
Figure 79: Standard Deviation for all frames of the Outdoor Manual Change Light Sequence 108
Figure 80: Discriminative power for the Outdoor Manual Change Light Sequence.................. 109
Figure 81: Original and augmented video frame. ....................................................................... 113
Figure 82: Processing stages: original frame, converted to chromaticity space, detected object
color pixels, bounding quadrilateral, saturation, horizontal gradient, vertical gradient, and
difference between subsequent frames. .............................................................................. 117
Figure 83: Area calculation for finding the bounding quadrilateral ........................................... 118
Figure 84: Edge refinement process ........................................................................................... 120
Figure 85: Refining the quadrilateral. The rough bounding quadrilateral is computed from the
color detection result, which misses the bottom edge. The refinement process results in a
more accurate outline.......................................................................................................... 121
Figure 86: Tearing caused by interlacing ................................................................................... 123
Figure 87: Fast deinterlacing algorithm results .......................................................................... 124
Figure 88: Sample frames from the video sequence showing occlusion, a corner off the screen,
and challenging lighting conditions. The tracked boundary is drawn in red..................... 136
Figure 89: Integrating detection and tracking data ..................................................................... 146
Figure 90: Face detection results. ............................................................................................... 151
Figure 91: Skin detection results using local histogram. Pixels detected as skin for each face are
shown in magenta. The orientation calculated for each face is shown with the horizontal
axis in red and the vertical axis in green............................................................................. 152
Figure 92: Global skin detection results. .................................................................................... 153
xvi
Figure 93: Face orientation using global skin detection. ............................................................ 154
Figure 94: Symmetry distance. (a) is the original point set, (b) shows the original filled points
and the reflected hollow points, (c) shows the closest filled point to each hollow point. The
average of these line lengths is the symmetry distance. ..................................................... 156
Figure 95: Measuring symmetry. (a) shows the corners in black, (b) and (c) show two different
rotations, with the reflected points in white, and lines between matched points................ 158
Figure 96: Test images. The original image is shown with a black triangle that connects the
ground truth coordinates for the eyes and mouth center..................................................... 159
Figure 97: Algorithm results. The box shows the region detected as a face, small green circles
show the high contrast points found in that region, and an axis shows the orientation found,
with red pointing in the positive x direction and green in the positive y. ........................... 160
Figure 98: Face orientation results using our point symmetry method. The red line indicates the
horizontal axis, the green line indicates the vertical axis, and the high contrast points are
shown in blue. ..................................................................................................................... 161
Figure 99: Offline generic face model fitting process. (a) shows the points identified so far, (b)
shows the prompt for the last point, and (c) shows the mesh that has been fit overlaid on the
image................................................................................................................................... 165
Figure 100: The first three shape modes. The circles show the mean shape, and the lines show
the magnitude and direction of the positive (green) and negative (red) displacements. .... 168
Figure 101: The mean appearance and two appearance modes.................................................. 170
Figure 102: X and Y gradients of the mean appearance............................................................. 174
xvii
Figure 103: Warp Jacobians for the first three 2D shape modes. The top row is the x component,
and the bottom row is the y component. ............................................................................. 175
Figure 104: The steepest descent images obtained by multiplying the gradients in Figure 102 by
the warp Jacobians in Figure 103. ...................................................................................... 175
Figure 105: Iterations 0, 5, and 12 of fitting the AAM. Column (a) is the 2D shape on the
original image with the face detection result shown by a rectangle. Column (b) is the 3D
shape on the original image. The warped input image I(N(W(x; p); q)) with the mesh
overlaid is shown in Column (c), and the error image ( )( )( ) ( )xqpxWN 0;; AI − is shown in
column (d). .......................................................................................................................... 183
Figure 106: Video frame augmented with cap and glasses. ....................................................... 190
xviii
LIST OF ACRONYMS/ABBREVIATIONS
2D Two Dimensional
3D Three Dimensional
AAM Active Appearance Model
API Application Program Interface
AR Augmented Reality
CCD Charge-Coupled Device
DOF Degree of Freedom
DP Discriminative Power
HLS Hue Lightness and Saturation
HMD Head Mounted Display
HSV Hue Saturation and Value
Hz. Hertz
ICIA Inverse Compositional Image Alignment
MR Mixed Reality
NTSC National Television System Committee
PC Personal Computer
PCA Principal Component Analysis
RANSAC RANdom SAmple Consensus
RGB Red, Green and Blue
SDK Software Development Kit
xix
SVD Singular Value Decomposition
VR Virtual Reality
1
1 INTRODUCTION
1.1 Augmented Reality Overview
Augmented Reality (AR) is a situation where real objects are augmented by adding
virtual elements. It is related to the more commonly known Virtual Reality (VR), in which
everything is synthesized. The Webopedia [102] defines virtual reality as “An artificial
environment created with computer hardware and software and presented to the user in such a
way that it appears and feels like a real environment.” Milgram and Kishino [68] defined a
continuum relating augmented reality and virtual reality under the umbrella of Mixed Reality
(MR), which is shown in Figure 1. This structure categorizes applications by their degree of
virtual and real elements, with no synthesized elements at one extreme and no real elements at
the other. Our research focuses on augmented reality, which is near the “real” end of the
spectrum, introducing a few virtual items attached to objects tracked in a real environment. This
is a rapidly expanding sector that uses both computer vision and computer graphics to achieve an
interactive result that seamlessly mixes real and virtual elements.
Figure 1: Milgram's virtuality continuum.
2
The biggest challenges for augmented reality are speed and accuracy. Speed is necessary
for an interactive environment, requiring processing frame rates of at least 10 Hz., and preferably
closer to video input rates of 25 to 30 Hz. Many video tracking algorithms either run in “batch”
mode, which uses information about all the frames when processing each frame, or they require
seconds or minutes to process each frame. While good results are achieved, these algorithms are
unsuitable for augmented reality. Fast algorithms that do not require future frames are needed
for interactive applications such as augmented reality.
Tracking accuracy is necessary to maintain the illusion that virtual objects are attached to
real objects. The human eye is sensitive to relative motion between objects, and errors or noise
in the position of virtual objects will be apparent to the observer. The high standards imposed by
these two challenges make augmented reality applications difficult to do well.
To meet these challenges, many augmented reality systems use specialized environments
or additional hardware to create realistic results. Carefully lit green screens are often used to
identify where to substitute virtual elements. Tracking devices may be used to pinpoint the
location and orientation of real objects. Markers may be applied to real objects to make them
easier to track. Knowledge about the environment may be collected beforehand, in the form of
3D models or locations of static objects. Multiple cameras may be used to get depth information
or to deal with occlusions in a single camera view.
1.2 Virtual Looking Glass
Our work is at the minimalist end of the equipment spectrum, with a minimum of
equipment and no specialized environment. We use a standard PC, a consumer video camera,
3
and a display to create a virtual looking glass. Real objects visible to the camera are mirrored on
the display, and virtual objects are added that augment some of the real objects.
This type of application is worthwhile on its own, both for the technical challenges and
the entertainment value of the result. Beyond entertainment, the augmentation could be used to
convey information, e.g., showing a hat appropriate for the day’s weather. In addition, the
modules needed are useful in other situations. For example, real time face detection and tracking
are needed in automated video surveillance systems.
In order to better understand the relationship between the tracked object color and the
color recorded by the camera under different lighting conditions, we performed an experiment to
see if the theoretical changes predicted for the color components in various color spaces matched
the results measured with real objects, lights, and cameras. We evaluated the constancy of a
number of color spaces as the intensity of a nearly white light changed, as well as their ability to
discriminate between different colors. The results showed that the traditional color spaces used
for illumination invariant analysis did not work as well as some others, but all fall short of the
ideal..
We created two applications as part of our research. The first one augments a solid
colored rectangle. The object is detected and tracked in real time, and replaced with a prestored
image. While the simplicity of the object is easier to handle in some aspects, the limited number
of features available and the change in the recorded color due to illumination variations creates
its own challenges. The findings of the color experiment were used to make the rectangle
tracking more robust.
4
The second application tracks and augments human faces. While face detection and
tracking have both been widely researched, robust and fast algorithms are still rare, especially
with our equipment restrictions, allowing new faces to appear while another is being tracked, and
with various sizes and orientations. Since we are accustomed to viewing our own faces in a
mirror, the virtual mirror setup is a natural platform for creating a real-time interactive
augmented face system.
1.3 Contributions
We have made several contributions while creating these virtual looking glass
applications, including
• Color space selection experiments. Theory and practice for color constancy in the
presence of varying illumination in different color spaces do not always agree. We have
performed experiments measuring the color values recorded by consumer cameras in
varying illumination. From this, a number of color spaces were compared to see how
well they remained constant as the light changed, yet still discriminated between
different colors. We found that some less well-known color spaces perform better than
the traditional ones.
• A real-time algorithm for tracking simple objects in monocular video. Many common
algorithms that track object silhouettes only work for objects with complex feature sets
and run too slowly for interactive use. Our tracking method uses color, edge, and
motion information to achieve real-time rates.
5
• Optimization methods for real time video processing. Careful optimization can make
the difference between an implementation being “fast enough” or not. We have
achieved speedups of two to ten on various functions involved in tracking a rectangle
and speedups close to 50 for motion detection and face tracking, even starting with an
optimized image processing library.
• Fast video deinterlacing. Consumer level video cameras are interlaced, with the even
and odd lines captured at different times. Moving objects appear to tear when displayed
on a noninterlaced computer display. Our deinterlacing algorithm is fast, maintains the
detail in stationary parts of the frame, and removes the tearing in the moving parts of the
frame.
• Extended continuous symmetry measure to handle not only shapes and graphs but
general 2D point sets as well.
• Real-time initial face pose estimation in video. After a face is detected, its orientation
must be determined. This is often done manually, which is not suitable for a general
augmented reality application. Other methods perform costly correlations to find facial
features. We present a novel method for recovering face orientation using generic high
contrast points. Using the coordinates of these points results in a faster algorithm than
using appearance comparisons, and sufficient accuracy, outperforming methods based
on skin detection.
• Enhanced Active Appearance Models (AAMs) for face tracking. Existing methods for
AAMs require more than a hundred manually marked points on each of several hundred
images to create the 2D and 3D models used for tracking, and the parameters extracted
6
during tracking do not have meaningful labels. By incorporating a parameterized 3D
nonrigid head model, we reduced the manual labeling effort to 30 points on tens of
images, and derive meaningful parameters from the result, such as “jaw drop” or
“eyebrows raised”, while still maintaining the robustness afforded by AAMs.
• Robust face tracking for augmented reality. We integrate motion detection, face
detection, and tracking to create a system that can track multiple faces, filter out
spurious detections in dynamic regions, eliminate detections in static regions, and verify
the track of faces outside the native range of the face detector.
7
2 BACKGROUND
2.1 Current AR Applications
Many examples of augmented reality exist today. In some cases, they are so well
accepted that many people don’t realize that they are artificial. The ideal augmented reality
system should be one in which the virtual elements are indistinguishable from the real elements,
and cannot be identified without outside knowledge of the situation. A sample of these
applications will be explored here.
One of the earliest examples of augmented reality is known as “Pepper’s Ghost”, named
after John Henry Pepper, a chemistry professor at London Polytechnic Institute. In the 1860’s,
audiences saw the startling effect of a transparent, three-dimensional ghost interacting freely
with a live actor on a seemingly ordinary stage. The effect is achieved by a piece of glass
mounted at an angle. The audience sees both the real scene through the glass and a transparent
image reflected by the glass. The arrangement is illustrated in Figure 2. The effect was used for
ghosts in Shakespeare’s plays, and is still used today in places like Disney’s Haunted Mansion
ride.
8
Figure 2: Pepper's Ghost
Local weather broadcasts have been using augmented reality for years, with the help of a
blue (or green) screen. The person appears to be standing in front of graphics showing
temperatures or radar, when he is actually in front of a solid color screen. A chroma-keying
system replaces the pixels that are in a certain color range with an alternate image. The person
must look at a monitor to see what he is pointing at, and adjust accordingly. While
straightforward, this system has stringent lighting requirements for the screen, since any shadows
or changes in lighting will change the color viewed by the camera, which will result in holes in
the blended image.
During football broadcasts, augmented reality is used to draw the first down line on a
football field for each play. With a system called “1st & Ten” from SportVision [86], the pan,
tilt, zoom and focus for each camera are tracked and registered with a model of the field to
determine the mapping of each pixel to a point on the field. The desired location of the line is
9
input manually. With this information, the line can be positioned at a fixed location on the field.
A color model is calibrated prior to each game to learn the grass, dirt, and uniform colors, so that
the line covers up the ground, and is in turn covered by the players, as shown in Figure 3.
Figure 3: Augmented first down line in football.
Another augmented reality application used in sports broadcasting is the insertion of
virtual advertisements that appear to be attached to the stadium wall. One such system is LVIS
(pronounced “Elvis”) from Princeton Video Image (PVI) [73]. It uses various methods,
including instrumented cameras, a vision-based tracking system and pattern recognition, to
position the advertisement, then modifies the video as it is broadcast. An example is shown in
Figure 4.
10
Figure 4: Virtual advertising.
The Head-Up Display (HUD) in many military aircraft overlays graphics showing the
horizon, the instantaneous flight path, and the locations of tracked targets over the real world, in
addition to information such as altitude and airspeed. Avionics on board the aircraft calculate
the information to display. The symbols are projected on a see-through combiner glass, giving
the appearance that they are superimposed on the world outside. Figure 5 shows an example.
The Apache helicopter goes one step further with its Integrated Helmet and Display Sight
System (IHADSS). Video showing infrared imagery is reflected off a small lens in front of the
pilot’s right eye. The infrared sensor is mounted on a turret which moves in synchronization
with the pilot’s head, so the pilot merely has to look in a particular direction to see the infrared
view in that direction. This allows the pilot to fly with one eye seeing the real world and the
other eye seeing a correlated sensor view of the world.
11
Figure 5: Example of an aircraft HUD.
In the medical world, images such as X-Ray, CT or MRI scans must be registered with
the patient’s anatomy before and during surgery. One such application is MIT’s project on
Image Guided Surgery [1]. Three-dimensional reconstructions of the internal anatomy are
obtained from scans and projected on live video views of the patient. This eliminates positioning
errors and allows the surgeon to concentrate on the task without constantly reorienting the scan
data mentally. One example is shown in Figure 6
12
Figure 6: Image guided surgery.
Virtual sets are a step up from basic green screen chroma-keying. Instead of
broadcasting from an expensive studio, the anchor is filmed in front of a backdrop, which may
contain calibrated markers. By tracking the location of these markers, the camera locations and
parameters can be determined. With this knowledge, the background can be replaced with a
three dimensional virtual studio, with accurate perspective changes as the camera moves.
Less well-known applications involve overlaying additional information on a scene, such
as labels for buildings, names of engine parts, or pipes and ducts hidden behind walls. These
provide a more intuitive presentation of information than the traditional means of maps and
blueprints.
This sampling of augmented reality applications shows the diversity of situations in
which it can be applied. A complete list would require a book on its own, and is bounded only
by the imagination. The possibilities range from mere entertainment, to a more useful display of
existing information, to the ability to accomplish tasks that weren’t possible without the added
understanding provided by the augmented reality.
13
2.2 Augmented Reality Challenges
The diverse sampling of augmented reality applications presented in the previous section
may appear to have little in common. However, all augmented reality applications have two
common requirements; real-time execution and accurate registration between the real and virtual
worlds.
The real-time requirement is derived from the interactivity inherent to augmented reality.
The virtual objects must react to the users immediately. Movies use many of the same
techniques as augmented reality for creating special effects, like filming actors safely in front of
a green screen and then adding an explosion in the background, but they have the luxury of
unlimited time to mix the two. Movie effects can also be tweaked by hand or tried several
different ways to see which achieves the best effect for each individual situation. In augmented
reality, the reaction must be immediate, or the illusion is lost. NTSC cameras update at 30
frames per second, so ideally, the augmented reality display should run at 30 Hz. as well.
Anything less than about 10 Hz. will have unacceptable delay.
The registration between the virtual and real objects in the scene is critical to maintain
the illusion that both sets of objects exist in the same world. The virtual first down line could not
be mistaken for a real line if it didn’t stay firmly stuck to the ground. Virtual objects must be
stable, without jitter or drift, and they must be accurate – a stable object won’t look right if it is
in the wrong place. Registration error and uncertainty is also a factor in augmented reality
interfaces, where the user must be able to unambiguously select a single virtual object [19].
14
2.3 Augmented Reality Implementation
Most of the work in augmented reality is focused on the visual presentation. Other
aspects needed for a totally immersive experience include sound and touch (like tactile feedback,
wind, and heat). Since our work is in the visual arena, these other elements will not be discussed
further here.
The process for generating the augmented view can be described in three basic steps:
1. Sense the real environment
2. Add virtual elements
3. Present the combined result
2.3.1 Sensing the Real Environment
There are many ways that the system can obtain the required knowledge about the real
world. Prior knowledge can be provided, such as the location and dimensions of objects and
cameras, light sources, 3D models of objects, colors and materials of objects, or motion
characteristics of objects. Using prior knowledge of the environment has the advantage that the
data given is stable and accurate, and processing time is not needed to determine this information
while running. The disadvantage is that this makes the application less flexible and limited to a
single location. At times, it is impossible to proceed without prior knowledge. For example, a
system to navigate around a city would require a city map as input.
Another method for determining the location and orientation of various objects is to use
specialized sensors such as magnetic or acoustic trackers. These sensors may require a
transmitter to be mounted on the object to be tracked, and a receiver elsewhere in the
15
environment. These sensors generally do a reasonable job of tracking the objects within a finite
range, whether or not the object is visible in the scenario. They also make it easy to get the
position and orientation of objects with a simple interface. Drawbacks may include the need to
add sensors and wires to the tracked objects, possibly hindering their motion and a limited area
of operation. The accuracy of the measurements varies with the device, and may or may not
provide the precision necessary for the application.
Since augmentation of the observer’s view usually involves sampling that view with a
camera, that video can be used to learn about the environment. In fact, since registration with
this view is more important than physical accuracy for making the virtual objects look like they
belong in the real world, use of this view is essential for creating a convincing environment. The
other sources may be used to supplement the visual information, but fine tuning based on the
visual data is needed for a seamless blend. Unfortunately, extracting useful information from the
visual input is not trivial, and tends to involve significant processing resources. Combined with
the real-time requirement, this makes use of vision techniques for augmented reality a challenge.
Computer vision techniques have been developed to sense the real environment by
performing tasks including finding the camera pose in a static environment, tracking rigid and
nonrigid moving objects, determining the depth of objects in the scene, and object recognition,
but many of the existing algorithms run too slowly to be used in a real-time environment.
Moore’s Law has increased processing speed to the point that many algorithms are possible
today that were too complex to run a few years ago, but speed is still lacking in most algorithm
implementations. One of the focuses of our research in augmented reality is finding algorithms
that run fast enough to meet the real time requirement.
16
2.3.2 Adding Virtual Elements
In order to make the virtual elements added to a real environment appear like they belong
there, care must be taken to ensure that they are properly registered, occluded, and lit. Lack of
any of these elements will result in a mismatch between the real and virtual parts of the scene.
As discussed earlier with augmented reality challenges, virtual object must be registered
with the real scene so that they are stable and properly positioned. If a virtual object is attached
to a real object, its position relative to that object must remain consistent, regardless of object
and camera motion. If it jumps around or drifts, the observer will be distracted. If the camera
moves past a static scene made up of both real and virtual objects, both types of objects must
appear to remain stationary. The accuracy of the registration required depends on the resolution
of the display presented to the observer. The error in the registration should be less than the
smallest perceivable difference on the display device.
Occlusion is necessary when a virtual object appears to be behind a real object. One
example is when a football player passes in front of the virtual first down line. If the line did not
disappear at that point, it would appear to be floating in space instead of sticking to the ground.
In the case of the first down line, the occlusion is detected when the pixel that should be part of
the line does not match the grass or dirt colors. In other cases, occlusion determination must
compare the depth (distance from the camera) of the real and virtual objects so that the closer
object occludes the further object.
The other major factor in making virtual object appear real is lighting. In some cases,
like labels for scene object, constant lighting may be acceptable, but in most cases the virtual and
real lighting should match. A virtual billboard must change when shadows fall on it, or when the
17
sun goes behind a cloud. More elaborate lighting involves casting shadows from virtual objects,
and using virtual lights to illuminate real objects.
2.3.3 Presenting the Combined Result
There are a wide variety of methods for presenting the augmented reality to the observer.
We will focus on visual methods, although a complete system can include aural, tactile and
olfactory cues as well. The displays vary in complexity, cost, mobility, and degree of
immersion.
Monitors and projectors are the simplest display devices. The immersion can range from
limited, with a single monitor, to complete, with a dome or displays on all sides. Multiple
people can usually view this type of display at once. The screen or monitor is typically
stationary, but can be tracked in orientation, providing a window into the augmented reality
world.
A head mounted display (HMD) is more complex, since the view must change with the
viewer’s head position, but more flexible as well. There are two main techniques for displaying
the virtual elements combined with the real scene. The first is optical see-through HMDs, where
the visor is transparent. Additional graphics can be projected on the visor to supply the virtual
elements. The second method is video see-through HMDs, where the viewer sees only an
opaque screen in front of his eyes. The real world view is supplied by one (for monocular view)
or two (for stereo view) cameras mounted on the helmet. Graphics can be added to this video
stream for the virtual objects. It is easier to register the real and virtual objects with the video
see-through HMD, since the actual observer’s view is available, but this process introduces a
18
time delay. The optical see-through HMD has no delay in viewing the real world, which makes
it a challenge to add graphics that are properly synchronized. In addition, the observer’s view is
not directly available with this system. Head trackers may be used with either type of HMD to
aid in determining the viewer’s position and orientation. HMDs tend to provide more immersion
than monitors or projection screens, but can only be viewed by one observer at a time.
2.3.4 Tools
There are several tools and APIs that can be useful for augmented reality applications.
AR Toolkit [2] is an open source software library that can be used to calculate the camera
position and orientation relative to physical marks. It requires that fiducial markers be placed on
the tracked object. If these markers are occluded, the lighting conditions are poor, or the
background has low contrast, stable tracking is not possible [40].
Canon Inc. and the Japanese government teamed to create the Mixed Reality System
Laboratory Inc. to conduct mixed reality research [93]. They developed a package called “MR
Platform” which includes a stereo video see-through HMD and an SDK for a Linux PC
environment [98]. A hardware tracking device determines the rough head position and
orientation, and this is refined using vision based methods to register the locations of markers.
Their AR2Hockey (Augmented Reality AiR Hockey) system uses color segmentation to locate
the markers, and infrared LEDs on the mallets are sensed with an infrared camera to get the
mallet positions on the plane of the table. Their MR Living Room uses infrared landmarks for
video registration. This requirement to modify the environment makes this tool not compatible
with our goals.
19
OpenCV is an open source library from Intel [14]. It seeks to “aid commercial uses of
computer vision in human-computer interface, robotics, monitoring, biometrics and security by
providing a free and open infrastructure where the distributed efforts of the vision community
can be consolidated and performance optimized.” We have made extensive use of this library in
our work. The routines are generally fast. Most of the functions perform low level image
processing, but there are some high level algorithms as well, including face detection code that
will be described later.
DirectX is the subset of Microsoft’s DirectX for handling multimedia for Windows. It is
based on the concept of a “filter graph”. The three basic filter types are source, transform and
render. Source filters have only output pins, and generate video or audio content, typically from
a file or from an external device such as a camera. Transform filters have both input and output
pins, and are used to modify the content, for example by changing the color depth, resolution,
compression, brightness or adding overlays. Render filters have only input pins and typically
render the result to devices like a screen or speakers or to a file. These filters can be connected
under software control to perform most any operation on multimedia streams. Filters exist to
handle most cameras and file formats, so this API allows applications to deal with a wide variety
of devices with a minimum of development effort. We have created an introduction to both
DirectShow and OpenCV [90], which is widely used.
2.4 Augmented Looking Glass System Description
We have developed interactive augmented reality systems that simulate a mirror. At first
glance, users see themselves and their surroundings. A second look reveals that enhancements
20
had been added, such as a picture on a blank placard or hats on the people. The users can
manipulate the virtual objects and interact with them.
The goal is to use minimal hardware and setup, with no special environment. The
hardware consists of a single fixed camera for input, a standard single processor PC for
processing, and a stationary monitor for display. The system should work in all but the most
extreme lighting. (There is not much that can be done if the camera senses all black or all white
pixels.) No special room, fiduciary markers or blue screen will be used, nor will the room setup
be known in advance. Any training or calibration is simple and fast, and the goal is to require
none.
The application receives the video stream from the camera, processes it, then renders the
original video plus the augmentation in real time. Microsoft’s DirectShow is used to get the
video frames from the camera. The processing is done with the aid of Intel’s OpenCV library.
The rendering is done either with OpenGL, to take advantage of hardware acceleration of
rendering tasks such as texture mapping, or simple bit-mapped graphics within the DirectShow
transform filter. This overview is shown in Figure 7. The processing requires three basic steps:
1. Detect object
2. Fit model (pose estimation)
3. Augment
21
Figure 7: Overview of the AR application
Detecting the object to be augmented and localizing it in the video frame is the first step.
Depending on what object we are tracking, there are several techniques that can be used for this
step, which may include color detection and background subtraction. If the color of the object is
known, the object can be located by finding large concentrations of pixels with the given color.
Since the RGB values recorded by the camera vary with lighting, care must be taken to handle
various lighting conditions. Background subtraction can be used to detect moving objects.
Although there are many variations, the basic idea in background subtraction is to build a model
for the fixed background, then mark pixels that differ from this background as belonging to
DirectShow
Video Processing
OpenCV
Render
22
moving objects. More specialized methods can be used for detecting certain objects, such as
faces, in video.
Once an object is located, its pose (position and orientation) must be determined in order
to augment it. The precision required for this step depends on the application. For example, if
the goal is to put a box around the object, the orientation is not needed and the position can be
approximate. On the other hand, tasks like adding a hat or glasses to a person’s head require that
both the position and orientation be precisely known, or else the augmented portions of the scene
will not appear to be attached to the person’s head. The method for determining the pose
depends on the type of object being tracked. For tracking a solid shape like a rectangle, the
locations of the corners (or equivalently, the edges) must be determined. This is sufficient and
necessary for overlaying a perspective corrected image, and with knowledge of the rectangle size
and camera focal length, the Euclidean position and orientation can be determined. For more
complex object such as faces, pose can be determined by tracking features like eye and mouth
corners, or by comparing the input image with a database of faces with known poses. For any
method, the task becomes more challenging when parts of the object being tracked are occluded
by other objects or go outside the camera field of view.
With the pose known, augmentation is primarily a graphics rendering task. An accurate
pose estimate leads to proper registration of virtual items in the scene. Depending on the
application, the rendering may need to take occlusions and lighting into account. When
rendering with OpenGL, occlusions may be handled by initializing the depth buffer with
appropriate values so that occluded portions of the virtual objects are not rendered. Descriptions
of occluding objects may be supplied to the application, learned with training, or calculated from
23
the input video. Finding occlusions from the input video is generally done by either tracking all
objects in the scene and their relative depth, or finding parts of the object that don’t fit the model.
For example, we could track all the people in the scene and deal with occlusion when their
silhouettes overlap, or mark all the non-red pixels within the borders of a red object as
occlusions. Likewise, lighting can be given beforehand, learned from training, or determined
from the input video.
2.5 Related Work
In this section, we will present an overview of representative work in areas covered by
our work, including color constancy, virtual mirror applications, shape tracking, face detection,
face tracking, and face pose estimation.
2.5.1 Color Constancy
Variable illumination is a factor in most image processing applications using images
recorded by cameras. Two images of the same static scene will not be identical due to noise in
the camera and lighting changes. Any outdoor application must deal with light changes,
especially on cloudy days. Many indoor environments derive some of their light from windows,
resulting in lighting changes similar to outdoor environments. Even without windows (or at
night), the many reflections from the walls cause light change when there is motion outside the
camera view, which can also contribute to the two images of identical scenes being different.
Any application that compares successive images, including motion detection for surveillance,
24
motion compensation, color-based object detection and video indexing, must deal with changes
in the recorded pixels due to illumination changes.
The field of color has been widely studied from several perspectives, including human
perception, computer graphics, and computer vision. Human perception researchers seek to
understand and model how humans convert a spectrum of light input to an image and assign
color labels, as well as other aspects like the emotional response of humans to different colors.
Of particular relevance to the subject at hand is the topic of color constancy, or the
human ability to determine an object’s color correctly most of the time in widely differing
lighting conditions. A camera may record snow with a bluish tint if it erroneously uses indoor
light settings, but a human never sees blue snow. We can tell that a banana is yellow (and not an
unripe green) with various indoor and outdoor lights, in shadow or sunlight, during midday or
with the reddish rays of sunset. Getting a machine to do the same is not trivial. Efforts in this
area will be described in more detail below.
In the field of computer graphics researchers seek to answer the question, “Given the
color and material properties of an object and the lighting conditions, what is the proper color to
display?” This decision takes into account the physics of light reflection and refraction, the
response of rods and cones in the human eye, and the properties of the display device in order to
produce RGB values that will stimulate the eye to produce the same response as the real scene.
Computer vision approaches the problem from the other direction, trying to answer
questions like, “Given the color recorded by the camera, what is the likely object color?” and “Is
the color of the object imaged by this pixel the same or different than the one in the previous
25
video frame?” Understanding how our eyes interpret the visible spectrum provides insight into
the problem, but it is not necessary for computers to arrive at the answer the same way.
Our interest is in applications tracking a moving object. Even with a fixed light source
and fixed camera settings, the RGB values of the object recorded by a camera will change as the
object moves, due to changes in the surface normal of the object and shadows. Since color is an
important feature in object detection and tracking, we seek to find a color representation that will
remain constant as the object moves as well as when the illumination changes in intensity (such
as the sun going behind a cloud), or the camera adjusts to changing scene brightness.
For this reason, we are not interested in estimating the illuminant, unless that helps to
find an invariant color representation. We are also not very concerned about changes in
illuminant color, since for most tracking applications the light source is a uniform color that is
close to white (like daylight or incandescent lamps). We are, however, extremely interested in
illumination intensity changes, whether from shadows, changes in the surface normal, or varying
light levels.
Many color constancy algorithms assume that the illumination is uniform (or at least
continuous) across the scene, and the scene content does not vary between images. While this is
true in many of our experiments, we do not take advantage of this because we seek a color space
that remains constant for a moving object within the scene.
In the discussion that follows, we will present representative work from the large volume
of research that has been done in this area.
Although the uniform lighting and static scene assumptions do not apply to our
application, there has been much work that uses these premises. One longstanding model,
26
attributed to von Kries [101], is that illumination change can be modeled by scaling each
channel. That is,
(R’, G’, B’) = (αR, βG, γB), (1)
where α, β, and γ are constant across the image. This is also known as the coefficient rule.
Since this is equivalent to multiplying the original vector by a diagonal matrix, it is also called
the diagonal model.
The simplest category of color constancy methods is Gray World algorithms. They
compute a statistic for the entire image, like the mean or the max, and find α, β, and γ such that
Equation 1 scales each channel so that the statistic matches a preset value. There are many
variations on what statistic to measure and what preset value to use. For example, if the scene
contains a white object, scaling each channel so that the maximum is 255 (for 8-bit color values)
will force the object to be white in the corrected image, and hopefully corrects the remainder of
the scene appropriately. Buchsbaum [15] scaled each channel so that the mean value for each
channel was 50% of the maximum possible value.
This group of algorithms assumes that the lighting changes equally throughout a fairly
static scene. This may fit well with an outdoor camera monitoring a relatively quiet area, but
clearly would be not work if a large white object entered or exited the scene, as this would be
misinterpreted as a change in illumination. Local changes, such as a moving shadow, would not
be corrected.
To handle localized changes, Gershon et al. [41] used this same idea on smaller regions,
segmenting the image into planar surfaces. A problem with this approach is that improper
27
segmentation may introduce errors. Edges in the segmentation can be caused by geometric
discontinuities, shadow edges, or specular highlights.
Finlayson et al. [28][29] added sensor sharpening, which maps the original sensors to a
new set by multiplying the original vector by a 3x3 matrix. The performance of coefficient-
based color constancy algorithms is improved by using the transformed sensors. Three methods
for finding the transformation matrix were proposed. Data-based sharpening uses knowledge of
the surface reflectances and illuminants in a particular scene to find the optimal transformation.
Sensor-based sharpening calculates the transformation that will produce the narrowest band
sensors. The third operates on a limited set of illuminants and reflectances to achieve perfect
color constancy.
Barnard et al. [8] evaluated the effect of sharpening on several color constancy
algorithms and found that sometimes it helped and other times it did not. Based on these
observations, they proposed multiple-illuminant-with-positivity sharpening. This method
minimizes the mapping error from a set of reasonable illuminants instead of the original, which
minimized the error mapping between two illuminants, since the second illuminant is unknown
for color constancy problems.
Land (of Polaroid fame) and his colleagues made a major contribution to color constancy
with the Retinex theory [59][60][61]. The name comes from a combination of retina and cortex.
Land was trying to model the human visual system, but didn’t know where the color constancy
operation took place. Land also coined the term “Mondrian”, used extensively in the color
constancy literature, because the abstract collections of colored rectangles he used for his tests
reminded him of a painting by Piet Mondrian. In this simplified world of uniform lighting and
28
sharp borders between colors, he performed experiments that changed our understanding of how
humans perceive color.
One experiment involved illuminating the Mondrian with three independently controlled
light sources, with low, medium and high frequencies. The lights were individually adjusted so
that the energy reflected from a white patch in the Mondrian for the three wavelengths were
equal (1, 1, 1). Human subjects reported the observed color of the patch to be white. The lights
were then changed so that the energy reflected from a yellow patch was (1, 1, 1) in the same
units as before. The human response was expected to be “white”, since the spectral energy from
the yellow patch was the same as the white patch before, but the observer reported “yellow”.
The process was repeated numerous times, with different colors and light settings, and the
observer consistently reported the correct intrinsic color with only minor deviations. The
experiment was also done with the subject demonstrating the ability to match a patch in the
Mondrian to a standardized set of color samples illuminated with a reference white light.
Further experiments showed that the humans appear to process inputs from the cones that
sense red, green, and blue wavelengths independently. This correlates with the coefficient rule
that scales each channel separately. Land and his colleagues also demonstrated that human
perception of color is relative across the field of view. The borders between adjacent colors
affect our ability to compare them.
These findings were combined to create the retinex algorithm. In retinex methods, small
changes in image intensity are assumed to be caused by illumination variation, and large changes
are assumed to be from geometry changes. The initial algorithm [60][61] traced paths through
the image, computing ratios between regions, assuming that white would be encountered
29
somewhere in the image. Subsequent work [59] removed the requirement for a white patch by
calculating the average relative reflectance in a small area using paths from surrounding pixels.
Each path to the designated area is traversed, accumulating ratios of neighboring pixels that
differ by more than a threshold..
The method is effective for Mondrians, with no lighting discontinuities, like shadows or
sudden depth changes, but has trouble when it misses an edge. Implementation of retinex in
MATLAB has recently been published [38][18].
Barnard et al. [9] improved the edge detection in retinex and recovered the illumination
variation across the image. Once the illumination is known, it can be removed. The image is
segmented into regions roughly corresponding to surfaces. A relative illumination map is
calculated by finding the ratio between each point in a patch and the centroid of the patch, then
solving for the relative strengths of the patch centers, assuming that illumination is smooth
across region boundaries.
Gamut mapping algorithms were introduced by Forsyth [36]. Assuming that the scene
contains a wide variety of reflectances, the illuminant is estimated by observing the gamut of
sensor responses. Constraints on the possible illuminants that could produce the observed image
are used. For example, if there are strong red sensor responses, the light is not blue. The
feasible set of illuminants is those diagonal transforms which map the observed gamut to a
subset of the possible gamut under a canonical white light. The results for Mondrians were
reasonable when a diverse set of colors was present. Ambiguities caused by multiple light
sources, varying illumination, shadows, specular highlights, or orientation effects were not
addressed.
30
Finlayson et al. [30] introduced color by correlation, which computes multiple possible
illuminants, then uses likelihood estimates to choose a single illuminant. Due to the ambiguity
in determining light intensity (is it darker because the light is dimmer or because the objects have
lower reflectivities?) they reduce the sensor responses and the illumination from three
components (R, G, B) to two (R/B, G/B). First, information is collected about which illuminants
produce which image colors. Second, this data is correlated with the colors present in a given
image to calculate likelihoods. Finally, these likelihoods are used to estimate the illuminant.
Unfortunately, since this idea ignores light intensity, it cannot be used for the most common
case, where only the light intensity changes over time.
An extensive evaluation of color constancy algorithms was done by Barnard et al.
[7][10]. The first part analyzes synthetic data, and the second installment uses real non-
Mondrian images. The evaluation includes a wide range of color constancy algorithms,
including gray world, retinex, gamut-mapping, and color by correlation. The experiments tested
the ability of the various algorithms to recover the illuminant color and intensity and the error in
the corrected color image. Due to the difficulties in changing only the light color and intensity
between images, the corrected image color was only evaluated for the synthetic scenes, and
automatic camera aperture was simulated even on the real images. Since our desire is to measure
how constant a color remains while the light intensity varies with real world scenes and cameras,
this study does not provide any useful data towards this end.
In color constancy, the image with unknown illumination is transformed to its equivalent
under a canonical light. Invariants take a different approach, finding quantities in which the
lighting terms cancel out, making them independent of illumination.
31
Funt and Finlayson [39] use ratios of adjacent colors as an invariant in the task of color
indexing (finding a matching image from a database based on color histograms). These ratios
are actually implemented as the derivative of the logarithm of the image. Assuming that the
unknown lighting is the same at two adjacent pixels, this operation should produce the same
result, independent of light color or intensity. The continuous lighting assumption was also used
in retinex. However, this method does not handle illumination discontinuities (such as those
caused by a geometric discontinuity), and does not discriminate well between colors. For
example, a solid red object and a solid blue object of the same shape would both have log
derivatives of zero (ratios of one) across the interior of the object, and the ratios around the edge
would depend on the background color as much as the object color.
Healey and Slater [50] show that changes in illumination cause affine transformations of
a color histogram. They use Taubin and Cooper’s [94] affine invariant moments to describe the
color histograms. The results are claimed to improve on previous methods for color indexing
while using fewer parameters to describe the color pixel distribution. The ideas here are specific
to the object recognition task, and don’t apply to pixel-wise computations or localized
illumination changes. The work is extended in [84] by computing the invariants for smaller
areas instead of the whole image. This allows matches to be found in the presence of partial
occlusion, but still doesn’t help in our quest for pixel-wise invariants.
Gevers and Smeulders [44] analyze several color spaces for invariance to image
conditions, such as viewing orientation, illumination intensity, or highlights. These properties
were derived from a dichromatic reflectance model. Hue, saturation, and normalized color RGB,
as well as two newly proposed color models, c1c2c3 and l1l2l3, were shown to be invariant to a
32
change in viewing direction, object geometry, and illumination for white light. Hue and l1l2l3
should also be invariant to specular highlights. They also introduced a third new color space,
m1m2m3, which is based on ratios between pixels in two locations. The m1m2m3 space should be
invariant to changes in light color as well. The performance of the color spaces were then tested
in a color-based object recognition experiment. The c1c2c3 and l1l2l3 model improved on the
recognition success with white light, and the m1m2m3 space worked the best when the light color
changed. Since ratios do not distinguish between different colors (e.g., constant red and constant
green will both have ratios between adjacent pixels of 1.0), m1m2m3 is not useful for object
detection or tracking based on color, but may be used to find color edges. We used the c1c2c3
and l1l2l3 models in our experiments.
Finlayson and Schaefer [31] evaluated color indexing using a variety of color spaces,
color indexing methods, and color constancy algorithms that are theoretically invariant to
various types of illumination changes and concluded that none of the techniques performs well
enough for the color indexing task.
Finlayson and Schaefer [32] point out that although hue is generally considered to be
invariant to brightness, many cameras apply gamma correction to adjust for varying scene
brightness. This nonlinear change modifies the hue, as it is usually computed. They propose a
new method for computing hue using logarithms that is invariant to gamma as well as
illumination changes. We used this Log Hue in our experiments.
Geusebroek et al. [42] derive a complete set of geometrical invariants for various color
invariant sets. Basically, they use these invariants to segment images by finding regions that
33
have the same material color, even though the RGB color recorded by the camera varies due to
changing surface normal, illumination intensity, or illumination direction across the object.
2.5.2 Virtual Mirror Applications
In this section, we review augmented reality applications that use the concept of a virtual
mirror.
A project called “Magic Mirror” [24] is similar to ours in that it uses a camera and a
screen to simulate a mirror interactively, but it is designed for a large audience, like a theater.
Computer generated effects are overlaid, but the image processing involved is solely motion
detection. The areas where there is motion are highlighted, and used to affect a virtual object,
such as a ball. Figure 8 shows an example from this application. The reported frame rate is 15
Hz.
Figure 8: Screen capture from "Magic Mirror" project. Green highlights show areas where motion is detected.
34
In another system, Darrell et al. [23] detect faces in real time using stereo cameras, and
distort them on a display. A sample of the output of this system is shown in Figure 9. A half
silvered mirror is used so that two stereo cameras can be aligned with the optical axis of the
display. Stereo information obtained with the help of special purpose hardware is used to
segment the users, and then skin color detection is applied to identify likely body parts. Face
detection is used to eliminate hands and other non-face parts from further consideration. The
detected faces are then distorted on the output display. The system was reported to run as 12 Hz.
This system localizes faces, but does not need to find the orientation of the faces nor identify
features, such as the location of the eyes. It also uses two cameras, three computer systems and a
dedicated stereo computation board, which is significantly more equipment than our minimal
setup (although computational power has increased significantly since this was done in 1998).
Figure 9: Sample display from the virtual mirror of Darrell et al. which distorts detected faces.
35
Lepetit et al. [63] demonstrated a face augmentation system that does not require
specialized hardware or markers. It accurately recovers the full 3D pose of the face, allowing
virtual additions, such as glasses or a mustache at a rate of 25 Hz. The demo used the authors’
tracking method [62], which uses a calibrated camera and a 3D model and image library of the
object being tracked. The stability of the tracked objects comes from tracking many points on
the object, so will not work for simple objects, such as a solid rectangle, that only have a few
unique points. A single face fills most of the screen in all of the sample images, such as in
Figure 10. Example movies were shown that had other people in the background, changed the
lighting, and occluded parts of the face without losing track. Sudden, large motion appears to
cause tracking to fail, invoking a recovery scheme. Only one face at a time is tracked. It is not
clear whether nonrigid motion such as facial expressions will cause problems. More details of
the tracking method will be discussed with other face tracking methods.
36
Figure 10: Screenshots from the face augmentation demo of Lepetit et al.
2.5.3 Rectangle Tracking
Tracking simple objects means few features are available. Tracking methods based on
corners or interest points, such as [69], [48], [88] and [72] are common, but a rectangle only has
four corners, and all four must be accurately located to get the outline. If a single corner is
occluded, an interest point-based tracking method will fail to recover the rectangle from the
video. Tracking of individual corners also tends to be noisy. For tracking a rectangle, three
37
quarters of the neighborhood around the corner is background, which will likely change as the
rectangle is moved, further complicating the task of tracking the corners.
Edges are more promising features for this task, and have long been used in tracking.
More edge pixels are available than corner pixels, so tracking can still be successful if some are
occluded. Canny [17] is probably the best known edge detection method. The process for
finding the edges uses image gradients, which in turn typically require a convolution of the
image with at least a 3x3 filter for the horizontal and vertical directions, in each of three color
planes, to get color gradients. This operation alone may make edge detection in the whole image
too slow for real-time operation on high resolution images. Methods for tracking edges include
[105] and [99].
Once edges are available, the problem of finding a rectangle first involves finding
straight lines. Popular methods include the Hough Transform [27] and Burns line detector [16].
In the Hough Transform, every edge pixel votes for all possible lines that could go through it.
The line candidates with the most votes are detected as lines. Continuity and endpoints are not
checked. The Burns line detector finds line segments by grouping adjacent edge pixels on a line
with gradients perpendicular to the line.
Even if line detection were fast enough for real time, the problem of weeding out the four
lines belonging to the rectangle from the background clutter still remains. For many
environments, there may be few straight lines, but a background like a bookshelf can yield an
abundance of lines.
38
Color is the other obvious feature, since the rectangle is specified to have a solid color.
Much of the work in color-based tracking and detection has been in the context of skin color
detection, which will be discussed with face detection.
Shape can also be used for tracking a rectangle. One popular technique for tracking a
general curve in dense visual clutter is Isard and Blake’s Condensation [56], an example of a
particle filter. A particle filter tracks multiple hypotheses simultaneously. Each “particle”
represents a possible object state, and has a weight corresponding to the probability of the
current observation (the current frame) given that object state. The hypotheses are stochastically
propagated to the next frame and the weights adjusted based on the new observation. Impressive
results are shown, including tracking the outline of a leaf on a bush in the wind. The algorithm
was reported to run in “near real-time” in 1998. The validity of many hypotheses must be
measured every frame, and each may require samples at numerous points on the curve. As an
example, they show tracking of a human head-and-shoulders silhouette. The probability of a
hypothesis is based on the distance from the curve to a high gradient pixel along the normal to
the curve at 20 to 30 locations. Invariant image features have also been used for tracking general
objects [89].
2.5.4 Face Detection in Images
Face detection is the problem of determining whether or not an image contains at least
one face. Face localization finds the location and size of each face. However, the distinction
between these two problems is often blurred, with face detection covering both whether or not a
face exists and finding the bounds for each face. We will use the term face detection to include
39
face localization unless otherwise specified. There are several related research areas that we will
not be exploring. These include face recognition, which looks for matches between the current
face image and a database of faces; face authentification, which validates the claim that a face
image belongs to a given individual; and facial expression recognition, which identifies the
affective state of the given face.
Yang et al. [109] surveyed face detection techniques in images. They listed the
challenges to face detection as pose, presence or absence of structural components, facial
expression, occlusion, image orientation, and imaging conditions. The general categories they
used to characterize the methods are knowledge-based methods, invariant feature approaches,
template matching methods, and appearance-based methods. The detection rates were compared
for ten representative appearance-based methods, but no execution times were given.
Knowledge-based methods use rules defined by human knowledge of what constitutes a
face. Typical rules specify relationships between facial features, like relative position and
symmetry. These methods tend to work well to find frontal faces in uncluttered scenes, but it is
difficult to define rules that are neither so general that many non-faces are detected nor so
specific that faces are missed. Yang and Huang [108] used a multiresolution method using
knowledge-based rules. They found face candidates by identifying patterns of similar colored
pixels at coarse resolutions, when the face occupies only a few pixels. These candidates were
evaluated by looking for facial features in finer resolutions.
Invariant feature methods look for facial features, such as eyebrows, eyes, nostrils, or
mouth, and infer the presence of a face from the location of these features. Lighting, occlusion,
and background clutter can make it challenging to identify these features. Also in this category
40
are methods that use other low level image components like texture and skin color. A
representative example from this group is from Yow and Cipolla [112]. They detected interest
points using a filter. Edges and interest point characteristics were used to group the interest
points into regions. Each region was labeled as a feature based on a comparison between a
feature vector obtained from the region and a training database. A Bayesian network was used
to evaluate features and groupings for face detection. The method handles faces with different
orientations, but requires a relatively large (60x60 pixels) face image.
Skin color detection is commonly used as a component of face detection. Skin color by
itself is not sufficient for face detection since it also occurs in other body parts, such as hands,
and skin colors may occur in other places in the scene. Even though skin tones vary
considerably, many pixels can be eliminated from consideration because no skin is, for example,
purple or green. One example of color-based skin detection is Kjeldsen and Kender [57]. They
trained a color predicate to segment skin pixels belonging to a hand from background pixels
using both positive and negative training samples.
Texture is also a useful component in a face detection scheme. Augusteijn and Skufca
[2] classified 16x16 pixel subimages as skin, hair, or other. The presence of hair and skin
textures indicates the presence of a face. This technique can handle any face pose, but depends
on the uniqueness of these textures.
Template matching involves correlating a predefined face template with an input image.
In its simplest form, this method does not handle changes in size, shape and face orientation.
Numerous modifications have been proposed to deal with these variations. Templates are often
comprised of edges, including the subtemplates for eyes, nose, mouth and face contour used by
41
Sakai et al. in 1969 [79]. Govindaraju [45] used curves defined by the hairline and left and right
sides of a face as a template, and then linked contour segments as a basis for face detection.
Templates have also been built from face silhouettes [80] and relative brightness of facial
regions [83].
Appearance-based methods differ from templates primarily in that the patterns are
learned from training data instead of defined by an expert. One of the best known methods for
face recognition has been dubbed “Eigenfaces” because it decomposes the training set of aligned
and normalized images into eigenvectors. The eigenvector decomposition of the test image is
matched with the training database to find the best match. Turk and Pentland [97] applied this
idea to face detection by looking at the different clusters formed when face and nonface images
are projected into the subspace spanned by the eigenvectors. Various machine learning
techniques have been applied to face detection, including neural networks [78], Support Vector
Machines (SVMs) [71], Sparse Network of Winnows (SNoW) [110], Bayes classifier [81] [82]
[54], Hidden Markov Model (HMM) [74], and AdaBoost [100].
2.5.5 Face Detection in Video
Many of the existing face detection algorithms operate on a single image. While an
interactive system that uses video as an input has the disadvantage of requiring real time
operation, it also has the advantage of providing motion information and frame-to-frame
coherency. By limiting the tests for faces to foreground (moving) pixels, the search space is
greatly decreased. As a consequence, static faces like a portrait on the wall may be missed, but
this is acceptable for our augmented reality application. In fact, since the people in front of the
42
camera are not likely to be upside down, the tops of moving groups of pixels are good candidates
to check for faces. Once a face is found, it can be tracked in subsequent frames.
A similar approach was used by Foresti et al. [35]. They used change detection to find
moving objects, and then analyzed the silhouette of each blob to locate areas where there was a
high probability of finding a human head. Skin color and principal component analysis were
used to find human faces. The system was reported to successfully localize the face 98% of the
time, and works with multiple faces, but no frame rates were given.
Viola and Jones’ [100] AdaBoost cascade provides one of the few single image methods
fast enough for interactive video. Boosting is the general process of combining multiple weak
classifiers to create a single strong classifier. The weak classifiers must be better than random
chance, but don’t have to be much better. AdaBoost [37] was introduced by Freund and
Schapire as the first boosting method to achieve good results. Its name is due to the adaptive
weighting that takes place when building the classifier. The classifier is trained using a set of
labeled samples. The weights for each sample are equal at the beginning, but weights for
correctly classified samples are decreased, while weights for incorrectly classified samples are
increased. This makes later classifiers focus on the “harder” samples.
Viola and Jones used the presence or absence of simple rectangular features as their weak
classifiers. They also introduced a method for combining classifiers in a cascade for more
efficient processing. While standard AdaBoost decides on the classification of a sample based
on the majority vote of all the weak classifiers, cascaded AdaBoost uses multiple stages.
AdaBoost is performed at each stage, and samples that fail a stage are not processed further.
This allows many negative samples to be eliminated with little processing. For this to work, the
43
“false negative” rate of the earlier stages must be low. It is reported to run at 15 Hz. on a
Pentium III processor for images that are 384x288 pixels, 15 times faster than the Rowley-
Baluja-Kanade detector [78] (considered the fastest detector at the time, in 2001), and 600 times
faster than the Schneiderman-Kanade detector [82]. Examples given included images with
single and multiple faces. The method is limited to frontal faces and does not use color
information.
Huang and Lai [53] combined these ideas with skin color detection and position
prediction to create a face detection system reported to run 4 times the video rate at 320x240
pixel resolution on a Pentium IV.
2.5.6 Real-time Face Tracking
Once a face is localized, its position and orientation must be determined in order to
augment it. For example, if a hat is added, it must appear to move and rotate with the head. The
label “face tracking” is applied to everything from tracking the centroid of the face, to tracking
the bounding box of the face, to analyzing facial expressions. For our work, we adopt Toyama’s
definition [96]: “The 3D Face-Pose Tracking Task is the real-time recovery of the six-degree-of-
freedom (6-DOF) pose of a target structure that has a rigid geometric relationship to the skull of
a particular individual, given a stream of live black-and-white or color images.”
Toyama [96] also lists face tracking systems reported in the literature through May 1998.
These systems range from tracking only position to the full six degrees of freedom at various
speeds. They are classified by a number of characteristics, including the algorithm types used for
44
tracking and recovery, the execution speed, and robustness. While numerous papers on face
tracking have been published since 1998, the basic categories still suffice. These features are:
• Color – Color-based algorithms tend to be fast, but can only be used to find position, not
orientation. They are good for recovery (finding a face that is not being tracked), but are
easily confused by similar colored objects in the background.
• Motion – Pixels with similar motion are clustered.
• Depth – Pixels with similar depth are clustered.
• Edge – Uses the contour of the head.
• Feature – Uses specific facial features. Most of the methods in Toyama’s survey that
tracked all six degrees of freedom used some form of feature tracking.
• Template – Uses a template that covers the whole face. Methods using small templates
are classified with “Feature”. The larger number of pixels involved in tracking with this
method (vs. Feature) provides more robust tracking, but is computationally more
expensive.
• Optic flow – Uses optical flow for tracking.
A representative method that uses color and feature methods is from Bakic and Stockman
[6]. They extracted a region matching the skin color model as a possible face. In this region,
intensity variations were used to find the eyes and nose. These three points were then used to
estimate the head pose. Frame rates of 10 to 30 Hz. were reported on an SGI Indy 2 for 320x240
images with accuracy sufficient to determine the region of a computer screen that the user is
looking at.
45
Optical flow was used by Basu et al. [11] to track rigid head motion in video. They used
a 3D ellipsoidal head model and find the head motion that best matched the observed optical
flow. The ellipsoid was chosen as a compromise between the too-simple plane model and the
complexity of a full head model. The method claims to be robust to large angular and
translational motion and small variations in the initial fit, as well as remain stable over extended
sequences.
To obtain not only the six parameters describing the position and orientation of the head,
but the 3D location of any point on the head, a 3D face model is often used. One commonly
used generic 3D face model is the CANDIDE model [12]. The Candide-3 model has about 100
vertices. There are shape units to modify the rigid structure to fit a specific face instance and
action units to animate facial expressions. If the wireframe face model can be matched to the
video such that it appears to be painted on the face, then augmentation is simply a matter of
adding more polygons to the model. The challenge is in matching the shape of the generic
model to the specific face, then updating the position and orientation in real time.
Toyama [96] described a face tracking system that integrates several of the available
features using an incremental focus of attention, that is, the tracking is performed at the highest
layer possible given the available information. Layer one detected skin color pixels, layer 2
looked for motion near the last location of the object, layer 3 detected the approximate size and
shape of clusters of skin colored pixels, layer 4 detected the principal axis of these clusters, layer
5 matched live images to a stored template, and layer 6 tracked five point features on the face.
When conditions were good, layer 6 determined the correct pose. When tracking failed, the
46
layers were traversed to recover the face location. The algorithm was reported to run at 30Hz on
a Pentium II, with recovery in less than 200 ms. when tracking was lost.
La Cascia et al. [58] modeled the head as a texture mapped cylinder. Tracking was a
matter of finding the pose that provides the best image registration with the video frame.
Illumination models were used to account for lighting variation. A 2D face detector initialized
the system, and the implementation was reported to run at 15 Hz.
Cootes et al. introduced the Active Appearance Model (AAM). AAM uses an analysis-
by-synthesis approach, evaluating differences between a synthesized model with the
hypothesized pose and the current image. Variations in shape and intensity are combined to get
a model for an object, such as a face. The shape is modeled by the 2D location of vertices on an
image. For faces, they used 122 points. After normalization, principal component analysis
(PCA) was done on the deviation of the shape of the training images from the mean, producing a
set of orthogonal modes of variation, sorted in importance order. Any face shape in the set could
then be reconstructed by applying shape vectors with appropriate weights. Likewise, the
appearance could be modeled by warping each input shape to the mean shape, one triangle at a
time, normalizing the brightness and contrast, then applying PCA to the pixels inside the shape
model to get a mean appearance and a set of appearance modes. PCA was then applied to the
combined shape and appearance data to capitalize on correlations between shape and appearance
for a final set of parameters. Finding the set of parameters to make the model match a test image
would appear to be a difficult high-dimensional optimization, but the authors proposed a way to
guide the optimization process. They perturbed 2D position, scale, orientation, and all of the
model parameters and noted how each affected the error image (the difference between the
47
model and the test image). For example, the parameter that controls “smile” caused changes in
the mouth region. Using a linear model, estimates for changes in all of the parameters were
computed from the error image, and the image with the minimum error was produced in
relatively few iterations. On faces that were about 200 pixels wide, the process was reported to
take an average of 4.1 seconds on a Sun Ultra. This method recovers facial shape and expression
in addition to 2D pose.
Dornaika and Ahlberg [25] used an Active Appearance Model (AAM) as their face
template, but used the 3D CANDIDE shape instead of the 2D shape. They determined 12 shape
parameters and 6 animation parameters in addition to the global 6-DOF pose. They presented an
alternate faster search method as well as adaptations from using the 3D shape instead of the 2D
shape. They claimed that each video frame can be processed in less than 10 ms. It is not clear
whether this process will work with small faces, since all of the examples showed the face filling
most of the frame.
Matthews and Baker [67] applied recent computational advances in image alignment to
AAMs. In traditional Lucas-Kanade image alignment [5], the goal is to find a set of parameters,
p, which minimizes the error between the original image warped by p and a template. At each
iteration, the remaining error is used to calculate increments for the parameters (Δp). The
parameter estimate used for the next iteration is p + Δp. This is referred to as the forwards
additive approach. Solving for the parameter increments requires the gradient and Jacobian,
which both depend on p, so they must be recomputed at each iteration, making the method
computationally expensive. The detailed cost analysis is in [5]. Instead of warping the image by
p + Δp, it can be warped by Δp, then by p. This is called the forwards compositional approach.
48
The advantage is that since the warp by Δp is only applied to the original image, the Jacobian
only needs to be computed once at initialization. The gradient must still be computed each
iteration. For the inverse compositional approach, the roles of the image and template are
swapped. The template is warped by Δp and the original image is warped by p. The gradient of
the template need only be computed at initialization, and the Jacobian of the warp is still only
computed once. When these ideas are applied to AAMs, the computational burden of the high
dimensional nonlinear optimization problem can be overcome. The result allows a 2D mesh to
be overlaid on a face in video, but does not provide 3D information about head orientation.
Xiao et al. [106] built a non-rigid 3D shape model from the 2D AAM using [107], and
expanded the 2D AAM idea to include a constraint that the vertices of the 2D AAM match a
projection of the 3D model. Although each iteration takes longer, the algorithm is claimed to be
faster because fewer iterations are needed. One of the advantages of this extension is that the 3D
orientation is available as parameters of the 3D model projection. We used a similar idea in our
work and will present details in Chapter 5.
The tracking method used by Petit et al. [62] for their face augmentation demo is not
specifically for faces. It works with any rigid object that has a sufficient number of interest
points. It combines the use of keyframes, which prevents drift over time, with concatenated
transformations, which yield smooth tracking. The system requires a rough 3D model and a few
keyframes of the object to be tracked. In this method, a keyframe is an image of the object and
its corresponding pose. In each keyframe, the interest points that lie on the object are found in
the 2D image, and their 3D coordinates are calculated from the 3D model. A small image of the
neighborhood around each point along with the surface normal is stored for use during tracking.
49
Several variations of each image patch are created by rerendering the patch from different
angles, using a planar assumption. For initialization, interest points found in the video frame are
matched with the interest points from the keyframes using eigen images that combine the
variations as descriptors. Using eigen images to quickly compare image patches was developed
in [65]. From the matches, the pose can be calculated. The matching process can be repeated for
each keyframe, then the results from the keyframe with the most matches are used. Tracking is
similar to initialization. The keyframe that had the most object area visible in the previous
frame’s pose is used, and each patch in that keyframe is rerendered to match the previous
frame’s pose. Interest points are extracted from the current video frame and matched with the
keyframe patches. The rotation and translation are calculated using numerical minimization with
an M-estimator. The last step adjusts the pose and the 3D point location to minimize the
reprojection error in the current and previous frames. This compensates for initialization errors
and adds frame-to-frame stability.
2.5.7 Face Pose Estimation
Face pose estimation is the problem of finding the head position and orientation relative
to the camera, given a face image. This bridges the gap between face detection, which finds the
region occupied by a face, and face tracking, which requires an initial pose. Approaches include
appearance-based and feature-based methods. Appearance-based methods find a transformation
that maps the observed image to a model known to be in a frontal pose. This is made more
difficult with the variations between individuals, occlusions, lighting conditions, and facial
expressions.
50
Feature-based methods recover the positions of facial features, such as eye and mouth
corners. The geometry between these features can be used to determine the head position and
orientation. We focus here on methods that recover the in-plane rotation angle.
Yilmaz and Shah [111] automatically detected the eyes, eyebrows, and mouth, and then
used these points to determine the pose. Candidate locations were found by using training
templates for the eyes and eyebrows and an edge map for the mouth, then using likelihoods to
match features with candidates. The feature locations were then used to calculate the orientation
angles.
Fleuret and Geman [33] found the in-plane rotation angle and scale of a detected head by
creating a hierarchy of classifiers that successively restricted the positions of the eyes and mouth
within the detected face region, so that the finest level classifier isolated a particular pose. This
method requires training data for every possible face pose.
Romdhani and Vetter [77] fit a 3D morphable model to an image using an Inverse
Compositional Image Alignment (ICIA) algorithm. They obtained correspondences with errors
less than a pixel, but this took 30 seconds per image on a 2.8 GHz. P4 and was not automated.
Hu et al. [51] detected facial components, such as eyes and mouth corners, and then used
the confidences of these detections to compute the face pose. The basic idea is that the
confidence increases as the difference between the actual and reference poses decreases.
51
3 COLOR ANALYSIS WITH VARIABLE LIGHT INTENSITY
In order to track objects viewed by a camera, it is important to know which changes in
the color of a pixel are due to changes in the reflectance (e.g., a different object or different part
of the object) and which are caused by lighting changes (e.g., shadows or changes in light
intensity). Large changes in light color are certainly possible, but infrequent. The most common
environments for tracking applications are daylight, or in a house or office. While the spectral
content of these light sources varies between different types as well as while varying the power
of a single light, these are all in the broad category of “white light”. In this chapter, we will
analyze the effect of light intensity changes on different diffuse reflectances in order to find a
color space in which color changes are only due to object color, not lighting, assuming that the
light is nearly white.
For this problem, we do not seek to recover the color of the illuminant or even the object
color. Instead, we only want to distinguish between different object colors. To further narrow
our focus, our interest is only in the camera’s perception, not what humans would observe.
This color knowledge is also required for the inverse problem – creating a graphics image
based on scene content that resembles one from a camera. Much theoretical research has been
done in this area, and many assumptions are commonly made based on the models created, and a
number of color conversions have been invented to deal with illumination changes, but the
assumptions are rarely tested. In this chapter, we will explore how well those assumptions hold.
52
3.1 Color Models
The color spaces used in our experiments are defined in this section. Most images and
video originate in RGB, which can be viewed directly on most display systems. The remaining
color spaces are calculated from this RGB data. For clarity, the following equations assume that
R, G, and B are in the range [0, 1], although in the experiments the pixels were integers in the
range [0, 255] and the derived quantities were also scaled to be in the same range.
In computer graphics, OpenGL uses simple models for lighting [104]. For a single point light
with only diffuse reflection from the object, the formula is
( ) materiallightqlc
diffusediffusenLdkdkk
color vertex rr•
++= 2
1 (2)
where d is the distance between the object and the light, kc, kl, and kq are constant, linear and
quadratic terms of an attenuation factor, Lr
is the unit vector that points from the vertex to the
light position, nr is the unit normal vector, diffuselight is the RGB light intensity, and diffusematerial
is the RGB reflectivity of the object. If only diffuselight changes, this matches the coefficient rule
from Equation 1. If the change is only intensity (diffuselight’ = scalar * diffuselight), α, β, and γ
will be the same, so
(R’, G’, B’) = (αR, αG, αB) = α (R, G, B) (3)
Specular reflection is included in most models as well, but is beyond the scope of the
current experiment.
53
3.1.1 YIQ
YIQ is used for U.S. TV broadcasting. The Y component is luminance, while the
chromaticity is encoded in I and Q. For black-and-white televisions, only the Y component is
shown. The transformation is [34]
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
−−−=
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
BGR
QIY
311.0523.0212.0321.0275.0596.0
114.0587.0299.0. (4)
Multiplying R, G, and B by a scalar will result in Y, I, and Q being similarly scaled,
however adding a constant to R, G, and B will only produce a change in Y.
3.1.2 HSV
HSV is more intuitive. H is the hue, measured as an angle. S is saturation, which ranges
from 0 to 1 and measures the purity of a color. Adding white pigment to a color will lower its
saturation. The V component is value or intensity and measures the brightness of a color. Thus
shades of gray have S = 0, with black having V = 0 and white with V = 1. Hue is an angle, with
red at 0°, green at 120°, teal at 180°, and blue at 240°. The conversion from RGB to HSV is
[34]:
54
mxVmx
mnmxS
H
mxBmnmxGR
mxGmnmxRB
mxRmnmxBG
H
BGRmnBGRmx
=
−=
⎪⎪⎪
⎩
⎪⎪⎪
⎨
⎧
==−−
+
==−−
+
==−−
×=
==
oo
o
360 and 0between Clamp
if4
if2
if
60
),,min(),,max(
(5)
With the angle clamped in the range [0°, 360°) as listed above, colors near red may jump
from 0° to 359°. With hue angles between -180° and +180°, the wraparound will be colors near
teal instead of red. Since our experiments used red but not teal samples, we used hues between
±180°. Scaling R, G, and B by the same factor will change V by the same factor, but the scale
factor cancels out of H and S. The hue is expected to be unstable when BGR == , since this
will cause division by zero. Likewise, for dark colors, mx will be near zero, so small changes in
RGB values will cause large changes in saturation.
3.1.3 HLS
HLS is very similar to HSV. The three components are hue, lightness, and saturation.
The hue is identical to HSV (including clamping it between -180° and +180° for this experiment,
instead of 0° to 360°). The saturation and lightness capture the same concepts as HSV’s
saturation and value, but the calculation is slightly different. The conversion from RGB to HLS
is [34]:
55
⎪⎪⎩
⎪⎪⎨
⎧
+−−
≤+−
=
+=
⎪⎪⎪
⎩
⎪⎪⎪
⎨
⎧
==−−
+
==−−
+
==−−
×=
==
otherwise)(2
5.0 if
2
360 and 0between Clamp
if4
if2
if
60
),,min(),,max(
mnmxmnmx
Lmnmxmnmx
S
mnmxL
H
mxBmnmxGR
mxGmnmxRB
mxRmnmxBG
H
BGRmnBGRmx
oo
o
(6)
Comparable to HSV, only L in HLS should change when the light intensity changes.
3.1.4 CIELAB
The CIELAB color space was created by the Commission Internationale de l’Eclaraige
(CIE). The L* component measures lightness, and the hues are changed by varying a* and b*.
The nonlinear relationships are intended to mimic the response of the human eye. It is defined
as:
⎪⎪⎩
⎪⎪⎨
⎧
≤⎟⎟⎠
⎞⎜⎜⎝
⎛
>⎟⎟⎠
⎞⎜⎜⎝
⎛
=008856.0 if*3.903
008856.0 if*116*
31
nn
nn
YY
YY
YY
YY
L (7)
56
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛−⎟⎟⎠
⎞⎜⎜⎝
⎛=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛−⎟⎟⎠
⎞⎜⎜⎝
⎛=
nn
nn
ZZf
YYfb
YYf
XXfa
*200
*500
*
*
(8)
where ( )⎪⎩
⎪⎨
⎧
≤+
>=
008856.0 tif11616t*7.787
008856.0 if31
tttf (9)
Since L*, a*, and b* are defined in terms of the CIE XYZ tristimulus values of the measured
point (X, Y, Z) and the reference white point (Xn, Yn, Zn), we also need to convert the measured
RGB value into XYZ. For an unknown generic white light, we can use [22]
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡=
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
BGR
ZYX
76.102.000.003.091.075.030.022.024.1
(10)
and assume the reference white value is (1, 1, 1).
3.1.5 Chromaticity (Normalized RGB)
In chromaticity space, also called normalized RGB, the vector representing the color is
normalized so the resulting vector components sum to 1.
BGR
BbBGR
GgBGR
Rr++
=++
=++
= , , (11)
Since r + g + b = 1, this is really only a two dimensional color space. A change in light
intensity should not affect this model. Dark colors will result in unstable values as the
denominator approaches zero.
57
3.1.6 c1c2c3
The 321 ccc model was proposed by Gevers and Smeulders [44] to be invariant under
white illumination on matte, dull surfaces, and is defined
( ) ( )⎟⎟⎠⎞
⎜⎜⎝
⎛=⎟⎟
⎠
⎞⎜⎜⎝
⎛=⎟⎟
⎠
⎞⎜⎜⎝
⎛=
GRBc
BRGc
BGRc
,maxarctan ,
,maxarctan ,
),max(arctan 321 (12)
Scaling R, G, and B by a constant should not affect this model.
3.1.7 l1l2l3
The 321 lll color space was also proposed by Gevers and Smeulders [44] to be invariant
under white illumination on shiny as well as matte dull surfaces. It is defined:
( )( ) ( ) ( )
( )( ) ( ) ( )
( )( ) ( ) ( )222
2
3
222
2
2
222
2
1
BGBRGRBGl
BGBRGRBRl
BGBRGRGRl
−+−+−−
=
−+−+−−
=
−+−+−−
=
(13)
Scaling R, G, and B by a constant should not affect this model. Shades of gray, when
BGR == will cause the denominator to go to zero, causing instability.
3.1.8 Derivative
The invariants used by Geusebroek et al. [42] included three color derivatives in addition
to detectors for various types of edges. The first is H, which is related to the hue of the material:
58
λλ
λ
EE
H = (14)
where E is shorthand for ( )xE r,λ , or the energy of a particular wavelength λ received from a
scene location xr and the subscript denotes variable of differentiation. They show that using the
dichromatic reflection model, with uniform illumination and white light H is independent of
viewpoint, illumination direction, illumination intensity and Fresnel reflectance coefficient.
The second property, Cλ, is interpreted as describing object color regardless of intensity:
EE
C λλ = . (15)
With white light on matte surfaces, they show Cλ to be invariant to viewpoint, surface
orientation, illumination direction and illumination intensity. Differentiating once more by
wavelength yields
E
EC λλ
λλ = (16)
which shares the same invariant properties.
To get from RGB camera inputs to wavelength derivatives, they assume a Gaussian color
model [43]. That is, using measurements integrated over a range of wavelengths using a
Gaussian aperture function. With a Taylor expansion around λ0 = 520 nm and σ0 = 55 nm for
compatibility with the human vision system, the observed Gaussian spectral derivatives E , λE ,
and λλE are a good match with the CIE 1964 XYZ basis. The transformation from camera RGB
inputs to spectral derivatives is therefore
59
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
−−=
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
−−
−=
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
−=
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
BGR
EEE
ZYX
EEE
BGR
ZYX
17.006.034.035.004.030.0
27.063.006.0
ˆˆˆ
ˆˆˆ
03.118.14.0048.0
28.02.118.0
ˆˆˆ
11.103.001.005.056.030.019.011.062.0
ˆˆˆ
λλ
λ
λλ
λ (17)
3.1.9 Log Hue
Finlayson and Schaefer [32] derived a formula for hue that uses logarithms to cancel the
effects of gamma. It is:
BGR
GRHlog2loglog
loglogarctan−+
−= . (18)
If R, G, and B are multiplied by a constant, logarithms of these quantities will make the
gain additive. The differences in both numerator and denominator were designed to cancel out
the extra terms, even before the ratio, so this model should be invariant to changes in light
intensity. For shades of gray, when BGR == , the denominator will become zero, resulting in
instability.
3.2 Experimental Setup
To learn how the various color spaces really react to changes in illumination intensity, we
assembled samples of eight different colors, and then videotaped them while the light intensity
60
varied. Figure 11 shows the first frame from one trial. The colors are orange, red, green, blue,
yellow, brown, beige, and black. The palette includes the three graphics primaries (red, green,
and blue), different “brightnesses” of the same color (brown and beige), and extremes in
intensity (beige and black). A Panasonic PV-DV951 digital video camera (which has 3 CCD
chips) was used. Several of the trials were simultaneously captured with another camera with
only one CCD chip. The results are not shown here, but were very similar.
Figure 11: A sample frame from the experiment. The labels were added later to accommodate printing in black and white.
The experiment was repeated several times, varying the light source (a halogen lamp or
daylight from a window), camera settings (automatic or manual), and the way the illumination
changed (change the light source or change the camera).
The video clip from each trial was captured as an AVI file with as little modification as
possible. This resulted in a 720x480 video file with DV compression for each trial.
An application was created to sample the RGB values at a specified pixel location at each
frame of a video file and write the results to a text file. This application was executed eight
61
times for each video, sampling a pixel at the center of each color region, creating eight text files
for each trial.
Each text file was used as input to a MATLAB script. The script converted the RGB
values in the file to each of the color spaces listed in Section 3.1 and plotted the results. These
plots will be shown in Sections 3.3 and 3.4.
3.3 Results
Each sequence name has three parts:
1) Indoors or outdoors – This refers to the type of light, not the location. For indoor
lighting, the test was done at night, so no sunlight was present. All the light
comes from an artificial white source. The light was changed using a dimmer
switch. For outdoor lighting, the only light source was indirect sunlight from a
window. The light intensity was changed by opening and closing blinds on the
window.
2) Manual or automatic – This references the camera settings. On automatic, the
camera is free to make adjustments in response to changes in lighting. In manual,
the settings are fixed for the duration of the trial unless otherwise stated.
3) Change light or change camera – The apparent scene illumination was changed
either by varying the light source or the camera. When the camera was changed
during the sequence, the camera was using manual settings and the iris was
opened and closed to change the amount of light at the sensor.
Qualitative results will be discussed in this section. Quantitative analysis will follow.
62
3.3.1 Indoors Manual Change Light Sequence
In this sequence, no outdoor light was present. The lamp started at a fairly dim level, got
dimmer, and then returned to the original level. Figure 12 shows the RGB levels for each of the
color samples on the vertical axis with time on the horizontal axis. On this and all the rest of the
plots, the eight plots are shown in the same relative position as the eight color tiles in Figure 11.
As expected, all three components at each sampled point decrease with the light, with the
amount of decrease proportional to the signal strength. The black sample in the lower right does
show a change, albeit a small one, since the RGB values were low to begin with.
100 200 3000
50
100
150
200
250RGB cam1-in-man-change-light-orange.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 3000
50
100
150
200
250RGB cam1-in-man-change-light-red.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 3000
50
100
150
200
250RGB cam1-in-man-change-light-green.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 3000
50
100
150
200
250RGB cam1-in-man-change-light-blue.txt
frame numberco
lor c
ompo
nent
s
redgreenblue
100 200 3000
50
100
150
200
250RGB cam1-in-man-change-light-yellow.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 3000
50
100
150
200
250RGB cam1-in-man-change-light-brown.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 3000
50
100
150
200
250RGB cam1-in-man-change-light-beige.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 3000
50
100
150
200
250RGB cam1-in-man-change-light-black.txt
frame number
colo
r com
pone
nts
redgreenblue
Figure 12: RGB color space for the Indoor Manual Change Light Sequence
The same data is plotted in 3D in Figure 13, using the red, green, and blue axes instead of
time. Since the scene was generally dark, the data is near the origin. The plots are roughly
linear, but curves can be seen, especially with yellow and brown on the bottom row.
63
0100
200
0
100
200
0
100
200
red
RGB cam1-in-man-change-light-orange.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-in-man-change-light-red.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-in-man-change-light-green.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-in-man-change-light-blue.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-in-man-change-light-yellow.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-in-man-change-light-brown.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-in-man-change-light-beige.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-in-man-change-light-black.txt
green
blue
Figure 13: RGB color space in 3D plot for the Indoors Manual Change Light Sequence
The plots after these RGB values were converted to YIQ are shown in Figure 14. The
luminance (Y), shown in red, reflects the change in the light intensity as expected. Ideally, the I
and Q components should remain constant, since the color did not change, but some variation is
visible, especially in the green sample (top row, third from left). This set of plots also shows that
there is not much difference between the I and Q components of the different colors.
100 200 300-100
0
100
200
YIQ cam1-in-man-change-light-orange.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300-100
0
100
200
YIQ cam1-in-man-change-light-red.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300-100
0
100
200
YIQ cam1-in-man-change-light-green.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300-100
0
100
200
YIQ cam1-in-man-change-light-blue.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300-100
0
100
200
YIQ cam1-in-man-change-light-yellow.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300-100
0
100
200
YIQ cam1-in-man-change-light-brown.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300-100
0
100
200
YIQ cam1-in-man-change-light-beige.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300-100
0
100
200
YIQ cam1-in-man-change-light-black.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
Figure 14: YIQ color space for the Indoors Manual Change Light Sequence
64
The HSV color space is shown in Figure 15. The V component, shown in blue is the
intensity measure in this color space, and reacts to the illumination change as expected. The hue
(red) and saturation (green) ideally should remain constant as the illumination changes, but the
saturation changes, most noticeably for beige. Hue shows fewer changes, but several of the
colors appear to have the same hue, making it not very useful for distinguishing different colors.
Black hue gets noisy, since the R, G, and B values are equal and saturation gets noisy as the R,
G, and B values near zero.
100 200 300
-100
0
100
200
HSV cam1-in-man-change-light-orange.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300
-100
0
100
200
HSV cam1-in-man-change-light-red.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300
-100
0
100
200
HSV cam1-in-man-change-light-green.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300
-100
0
100
200
HSV cam1-in-man-change-light-blue.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300
-100
0
100
200
HSV cam1-in-man-change-light-yellow.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300
-100
0
100
200
HSV cam1-in-man-change-light-brown.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300
-100
0
100
200
HSV cam1-in-man-change-light-beige.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300
-100
0
100
200
HSV cam1-in-man-change-light-black.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
Figure 15: HSV color space for the Indoors Manual Change Light Sequence
The HLS color space is shown in Figure 16. In this color space, the lightness (L)
component shown in green follows the illumination level. Theoretically, the hue and saturation
should remain unchanged, just like in HSV. In fact, the hue in HLS is identical to the hue in
HSV. Even though the formula is slightly different, the saturation (blue) in HLS is very similar
to the saturation (green) in HSV, with the most variation from illumination in the beige sample.
Hue does not distinguish well between six of the colors.
65
100 200 300
-100
0
100
200
HLS cam1-in-man-change-light-orange.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300
-100
0
100
200
HLS cam1-in-man-change-light-red.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300
-100
0
100
200
HLS cam1-in-man-change-light-green.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300
-100
0
100
200
HLS cam1-in-man-change-light-blue.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300
-100
0
100
200
HLS cam1-in-man-change-light-yellow.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300
-100
0
100
200
HLS cam1-in-man-change-light-brown.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300
-100
0
100
200
HLS cam1-in-man-change-light-beige.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300
-100
0
100
200
HLS cam1-in-man-change-light-black.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
Figure 16: HLS color space for the Indoors Manual Change Light Sequence
The CIELAB color space is shown in Figure 17. The brightness is in the L* component
(red). The other two components should ideally remain constant, but small changes can be seen.
100 200 300-200
-100
0
100
200Lab cam1-in-man-change-light-orange.txt
frame number
colo
r com
pone
nts
Lab
100 200 300-200
-100
0
100
200Lab cam1-in-man-change-light-red.txt
frame number
colo
r com
pone
nts
Lab
100 200 300-200
-100
0
100
200Lab cam1-in-man-change-light-green.txt
frame number
colo
r com
pone
nts
Lab
100 200 300-200
-100
0
100
200Lab cam1-in-man-change-light-blue.txt
frame number
colo
r com
pone
nts
Lab
100 200 300-200
-100
0
100
200Lab cam1-in-man-change-light-yellow.txt
frame number
colo
r com
pone
nts
Lab
100 200 300-200
-100
0
100
200Lab cam1-in-man-change-light-brown.txt
frame number
colo
r com
pone
nts
Lab
100 200 300-200
-100
0
100
200Lab cam1-in-man-change-light-beige.txt
frame number
colo
r com
pone
nts
Lab
100 200 300-200
-100
0
100
200Lab cam1-in-man-change-light-black.txt
frame number
colo
r com
pone
nts
Lab
Figure 17: CIELAB color space for the Indoors Manual Change Light Sequence
In the chromaticity color space shown in Figure 18, all three components are expected to
remain constant, since it is a matte surface under white illumination. We see much more
66
consistency than with RGB, but it is still obvious from the plots when the illumination changed,
especially in the beige sample (bottom row, third from left). The black sample (lower right) gets
noisy as the RGB values approach zero, and the difference between red and orange is subtle.
100 200 3000
0.2
0.4
0.6
0.8
1chrom. space cam1-in-man-change-light-orange.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 3000
0.2
0.4
0.6
0.8
1chrom. space cam1-in-man-change-light-red.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 3000
0.2
0.4
0.6
0.8
1chrom. space cam1-in-man-change-light-green.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 3000
0.2
0.4
0.6
0.8
1chrom. space cam1-in-man-change-light-blue.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 3000
0.2
0.4
0.6
0.8
1chrom. space cam1-in-man-change-light-yellow.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 3000
0.2
0.4
0.6
0.8
1chrom. space cam1-in-man-change-light-brown.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 3000
0.2
0.4
0.6
0.8
1chrom. space cam1-in-man-change-light-beige.tx
frame number
colo
r com
pone
nts
redgreenblue
100 200 3000
0.2
0.4
0.6
0.8
1chrom. space cam1-in-man-change-light-black.txt
frame number
colo
r com
pone
nts
redgreenblue
Figure 18: Chromaticity space for the Indoors Manual Change Light Sequence
The c1c2c3 color space is plotted in Figure 19, and looks indistinguishable from
chromaticity.
67
100 200 3000
20
40
60
80
c1c2c3 cam1-in-man-change-light-orange.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 3000
20
40
60
80
c1c2c3 cam1-in-man-change-light-red.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 3000
20
40
60
80
c1c2c3 cam1-in-man-change-light-green.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 3000
20
40
60
80
c1c2c3 cam1-in-man-change-light-blue.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 3000
20
40
60
80
c1c2c3 cam1-in-man-change-light-yellow.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 3000
20
40
60
80
c1c2c3 cam1-in-man-change-light-brown.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 3000
20
40
60
80
c1c2c3 cam1-in-man-change-light-beige.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 3000
20
40
60
80
c1c2c3 cam1-in-man-change-light-black.txt
frame number
colo
r com
pone
nts
c1c2c3
Figure 19: c1c2c3 color space for the Indoors Manual Change Light Sequence
The l1l2l3 color space shown in Figure 20 should also be invariant to illumination
changes, since this is white illumination. We see changes with lighting in the graphs, especially
for green and yellow. Black and blue get extremely noisy, corresponding to times when the R,
G, and B values are equal.
100 200 3000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-in-man-change-light-orange.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 3000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-in-man-change-light-red.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 3000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-in-man-change-light-green.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 3000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-in-man-change-light-blue.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 3000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-in-man-change-light-yellow.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 3000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-in-man-change-light-brown.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 3000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-in-man-change-light-beige.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 3000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-in-man-change-light-black.txt
frame number
colo
r com
pone
nts
l1l2l3
Figure 20: l1l2l3 color space for the Indoors Manual Change Light Sequence
68
The Derivative color space is plotted in Figure 21. The hue, shown in red, shows some
deviation, especially with the blue and black samples, but Cλ and Cλλ stay fairly constant during
the trial.
100 200 300
-150
-100
-50
0
50
100
150
Derivative cam1-in-man-change-light-orange.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300
-150
-100
-50
0
50
100
150
Derivative cam1-in-man-change-light-red.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300
-150
-100
-50
0
50
100
150
Derivative cam1-in-man-change-light-green.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300
-150
-100
-50
0
50
100
150
Derivative cam1-in-man-change-light-blue.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300
-150
-100
-50
0
50
100
150
Derivative cam1-in-man-change-light-yellow.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300
-150
-100
-50
0
50
100
150
Derivative cam1-in-man-change-light-brown.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300
-150
-100
-50
0
50
100
150
Derivative cam1-in-man-change-light-beige.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300
-150
-100
-50
0
50
100
150
Derivative cam1-in-man-change-light-black.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
Figure 21: Derivative color space for the Indoors Manual Change Light Sequence
The last color space is Log Hue, shown in Figure 22. While this uses only one
parameter, it remains fairly constant for most of the samples. Blue is the notable exception, and
black gets noisy. These correspond to the times when the R, G, and B values were equal. The
discriminative capability of log hue is not exceptional, since brown and beige as well as red and
orange have very similar values.
69
100 200 300
-150
-100
-50
0
50
100
150
log hue cam1-in-man-change-light-orange.txt
frame number
colo
r com
pone
nts
100 200 300
-150
-100
-50
0
50
100
150
log hue cam1-in-man-change-light-red.txt
frame number
colo
r com
pone
nts
100 200 300
-150
-100
-50
0
50
100
150
log hue cam1-in-man-change-light-green.txt
frame number
colo
r com
pone
nts
100 200 300
-150
-100
-50
0
50
100
150
log hue cam1-in-man-change-light-blue.txt
frame number
colo
r com
pone
nts
100 200 300
-150
-100
-50
0
50
100
150
log hue cam1-in-man-change-light-yellow.txt
frame number
colo
r com
pone
nts
100 200 300
-150
-100
-50
0
50
100
150
log hue cam1-in-man-change-light-brown.txt
frame number
colo
r com
pone
nts
100 200 300
-150
-100
-50
0
50
100
150
log hue cam1-in-man-change-light-beige.txt
frame number
colo
r com
pone
nts
100 200 300
-150
-100
-50
0
50
100
150
log hue cam1-in-man-change-light-black.txt
frame number
colo
r com
pone
nts
Figure 22: Log Hue for the Indoors Manual Change Light Sequence
3.3.2 Indoor Manual Change Iris Sequence
This sequence was also indoors, with no daylight present. The lighting was fixed for the
duration of the trial, with the iris on the camera opening and then closing. The RGB values
recorded during this clip are shown in Figure 23. For all the colors except black, one or more
components saturated and were clipped at 255. While in saturation, four of the samples (green,
blue, yellow and beige) were recorded as white (255, 255, 255). The same data is plotted in 3D
in Figure 24. The inflection points are caused by clipping. For example, the beige sample clips
in red first, then green, before saturating all three colors.
70
100 200 300 400 5000
50
100
150
200
250RGB cam1-indoor-change-iris-orange.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
50
100
150
200
250RGB cam1-indoor-change-iris-red.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
50
100
150
200
250RGB cam1-indoor-change-iris-green.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
50
100
150
200
250RGB cam1-indoor-change-iris-blue.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
50
100
150
200
250RGB cam1-indoor-change-iris-yellow.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
50
100
150
200
250RGB cam1-indoor-change-iris-brown.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
50
100
150
200
250RGB cam1-indoor-change-iris-beige.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
50
100
150
200
250RGB cam1-indoor-change-iris-black.txt
frame number
colo
r com
pone
nts
redgreenblue
Figure 23: RGB color space for the Indoor Manual Change Iris Sequence
0100
200
0
100
200
0
100
200
red
RGB cam1-indoor-change-iris-orange.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-indoor-change-iris-red.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-indoor-change-iris-green.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-indoor-change-iris-blue.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-indoor-change-iris-yellow.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-indoor-change-iris-brown.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-indoor-change-iris-beige.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-indoor-change-iris-black.txt
green
blue
Figure 24: RGB color space in 3D for the Indoors Manual Change Iris Sequence
In the YIQ space shown in Figure 25, the luminance in Y (red) shows the change in
illumination, and gets saturated in the same four samples. The worst of the fluctuations in I and
Q occur during saturation, but the signals are not constant outside the saturation time.
71
100 200 300 400 500-100
0
100
200
YIQ cam1-indoor-change-iris-orange.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300 400 500-100
0
100
200
YIQ cam1-indoor-change-iris-red.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300 400 500-100
0
100
200
YIQ cam1-indoor-change-iris-green.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300 400 500-100
0
100
200
YIQ cam1-indoor-change-iris-blue.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300 400 500-100
0
100
200
YIQ cam1-indoor-change-iris-yellow.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300 400 500-100
0
100
200
YIQ cam1-indoor-change-iris-brown.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300 400 500-100
0
100
200
YIQ cam1-indoor-change-iris-beige.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300 400 500-100
0
100
200
YIQ cam1-indoor-change-iris-black.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
Figure 25: YIQ color space for the Indoor Manual Change Iris Sequence
Both the HSV color space in Figure 26 and HLS in Figure 27 show hue that is fairly
constant except when saturation occurs, although the hue does not discriminate well between the
samples. The value (blue) in HSV and the lightness (green) in HLS show the expected
illumination change. The saturation component (green in HSV and blue in HLS) changes
considerably even when the RGB values are not saturated.
100 200 300 400 500
-100
0
100
200
HSV cam1-indoor-change-iris-orange.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300 400 500
-100
0
100
200
HSV cam1-indoor-change-iris-red.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300 400 500
-100
0
100
200
HSV cam1-indoor-change-iris-green.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300 400 500
-100
0
100
200
HSV cam1-indoor-change-iris-blue.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300 400 500
-100
0
100
200
HSV cam1-indoor-change-iris-yellow.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300 400 500
-100
0
100
200
HSV cam1-indoor-change-iris-brown.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300 400 500
-100
0
100
200
HSV cam1-indoor-change-iris-beige.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300 400 500
-100
0
100
200
HSV cam1-indoor-change-iris-black.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
Figure 26: HSV color space for the Indoor Manual Change Iris Sequence
72
100 200 300 400 500
-100
0
100
200
HLS cam1-indoor-change-iris-orange.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300 400 500
-100
0
100
200
HLS cam1-indoor-change-iris-red.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300 400 500
-100
0
100
200
HLS cam1-indoor-change-iris-green.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300 400 500
-100
0
100
200
HLS cam1-indoor-change-iris-blue.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300 400 500
-100
0
100
200
HLS cam1-indoor-change-iris-yellow.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300 400 500
-100
0
100
200
HLS cam1-indoor-change-iris-brown.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300 400 500
-100
0
100
200
HLS cam1-indoor-change-iris-beige.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300 400 500
-100
0
100
200
HLS cam1-indoor-change-iris-black.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
Figure 27: HLS color space for the Indoor Manual Change Iris Sequence
The CIELAB color space shown in Figure 28 shows the change in light intensity in the
L* component (red). Change can be seen in the other two components as well, most notably in
the b* component (blue) for the yellow sample (lower left). Also, the a* and b* components are
close to zero for several of the samples, making it difficult to distinguish between colors.
100 200 300 400 500-200
-100
0
100
200Lab cam1-indoor-change-iris-orange.txt
frame number
colo
r com
pone
nts
Lab
100 200 300 400 500-200
-100
0
100
200Lab cam1-indoor-change-iris-red.txt
frame number
colo
r com
pone
nts
Lab
100 200 300 400 500-200
-100
0
100
200Lab cam1-indoor-change-iris-green.txt
frame number
colo
r com
pone
nts
Lab
100 200 300 400 500-200
-100
0
100
200Lab cam1-indoor-change-iris-blue.txt
frame number
colo
r com
pone
nts
Lab
100 200 300 400 500-200
-100
0
100
200Lab cam1-indoor-change-iris-yellow.txt
frame number
colo
r com
pone
nts
Lab
100 200 300 400 500-200
-100
0
100
200Lab cam1-indoor-change-iris-brown.txt
frame number
colo
r com
pone
nts
Lab
100 200 300 400 500-200
-100
0
100
200Lab cam1-indoor-change-iris-beige.txt
frame number
colo
r com
pone
nts
Lab
100 200 300 400 500-200
-100
0
100
200Lab cam1-indoor-change-iris-black.txt
frame number
colo
r com
pone
nts
Lab
Figure 28: CIELAB color space for the Indoor Manual Change Iris Sequence
73
Once again, the chromaticity in Figure 29 and the 321 ccc color space in Figure 30 look
nearly identical. The flat part in the middle of most of the plots corresponds to one or more of
the RGB components being in saturation, but the signals vary considerably outside of saturation.
100 200 300 400 5000
0.2
0.4
0.6
0.8
1chrom. space cam1-indoor-change-iris-orange.tx
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
0.2
0.4
0.6
0.8
1chrom. space cam1-indoor-change-iris-red.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
0.2
0.4
0.6
0.8
1chrom. space cam1-indoor-change-iris-green.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
0.2
0.4
0.6
0.8
1chrom. space cam1-indoor-change-iris-blue.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
0.2
0.4
0.6
0.8
1chrom. space cam1-indoor-change-iris-yellow.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
0.2
0.4
0.6
0.8
1chrom. space cam1-indoor-change-iris-brown.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
0.2
0.4
0.6
0.8
1chrom. space cam1-indoor-change-iris-beige.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
0.2
0.4
0.6
0.8
1chrom. space cam1-indoor-change-iris-black.txt
frame number
colo
r com
pone
nts
redgreenblue
Figure 29: Chromaticity color space for the Indoor Manual Change Iris Sequence
100 200 300 400 5000
20
40
60
80
c1c2c3 cam1-indoor-change-iris-orange.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 300 400 5000
20
40
60
80
c1c2c3 cam1-indoor-change-iris-red.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 300 400 5000
20
40
60
80
c1c2c3 cam1-indoor-change-iris-green.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 300 400 5000
20
40
60
80
c1c2c3 cam1-indoor-change-iris-blue.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 300 400 5000
20
40
60
80
c1c2c3 cam1-indoor-change-iris-yellow.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 300 400 5000
20
40
60
80
c1c2c3 cam1-indoor-change-iris-brown.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 300 400 5000
20
40
60
80
c1c2c3 cam1-indoor-change-iris-beige.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 300 400 5000
20
40
60
80
c1c2c3 cam1-indoor-change-iris-black.txt
frame number
colo
r com
pone
nts
c1c2c3
Figure 30: 321 ccc color space for the Indoor Manual Change Iris Sequence
74
The l1l2l3 space is displayed in Figure 31. The saturation effects are dramatic, but
remarkably constant when not saturated. The black sample is very noisy, due to the R, G, and B
values being equal.
100 200 300 400 5000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-indoor-change-iris-orange.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 300 400 5000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-indoor-change-iris-red.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 300 400 5000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-indoor-change-iris-green.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 300 400 5000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-indoor-change-iris-blue.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 300 400 5000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-indoor-change-iris-yellow.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 300 400 5000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-indoor-change-iris-brown.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 300 400 5000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-indoor-change-iris-beige.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 300 400 5000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-indoor-change-iris-black.txt
frame number
colo
r com
pone
nts
l1l2l3
Figure 31: 321 lll color space for the Indoor Manual Change Iris Sequence
The derivative color space is shown in Figure 32. Once again the hue component (red)
varies significantly, while the other two remain more constant, even during saturation.
75
100 200 300 400 500
-150
-100
-50
0
50
100
150
Derivative cam1-indoor-change-iris-orange.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300 400 500
-150
-100
-50
0
50
100
150
Derivative cam1-indoor-change-iris-red.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300 400 500
-150
-100
-50
0
50
100
150
Derivative cam1-indoor-change-iris-green.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300 400 500
-150
-100
-50
0
50
100
150
Derivative cam1-indoor-change-iris-blue.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300 400 500
-150
-100
-50
0
50
100
150
Derivative cam1-indoor-change-iris-yellow.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300 400 500
-150
-100
-50
0
50
100
150
Derivative cam1-indoor-change-iris-brown.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300 400 500
-150
-100
-50
0
50
100
150
Derivative cam1-indoor-change-iris-beige.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300 400 500
-150
-100
-50
0
50
100
150
Derivative cam1-indoor-change-iris-black.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
Figure 32: Derivative color space for the Indoors Manual Change Iris Sequence
Finally, the Log Hue in Figure 33 is relatively constant when not in saturation, but there
is very little discriminating power on the bottom row of samples.
100 200 300 400 500
-150
-100
-50
0
50
100
150
log hue cam1-indoor-change-iris-orange.txt
frame number
colo
r com
pone
nts
100 200 300 400 500
-150
-100
-50
0
50
100
150
log hue cam1-indoor-change-iris-red.txt
frame number
colo
r com
pone
nts
100 200 300 400 500
-150
-100
-50
0
50
100
150
log hue cam1-indoor-change-iris-green.txt
frame number
colo
r com
pone
nts
100 200 300 400 500
-150
-100
-50
0
50
100
150
log hue cam1-indoor-change-iris-blue.txt
frame number
colo
r com
pone
nts
100 200 300 400 500
-150
-100
-50
0
50
100
150
log hue cam1-indoor-change-iris-yellow.txt
frame number
colo
r com
pone
nts
100 200 300 400 500
-150
-100
-50
0
50
100
150
log hue cam1-indoor-change-iris-brown.txt
frame number
colo
r com
pone
nts
100 200 300 400 500
-150
-100
-50
0
50
100
150
log hue cam1-indoor-change-iris-beige.txt
frame number
colo
r com
pone
nts
100 200 300 400 500
-150
-100
-50
0
50
100
150
log hue cam1-indoor-change-iris-black.txt
frame number
colo
r com
pone
nts
Figure 33: Log Hue for the Indoor Manual Change Iris Sequence
76
3.3.3 Indoor Manual Flashlight Sequence
Color detection algorithms need to deal with light changes that affect only part of the
scene as well as global illumination changes. There are schemes to adjust the entire frame to
maintain a constant maximum, average, or range, that may cancel out some of the variations seen
in the previous experiments, but there is still a need to recognize a color when the light changes
in a subset of the image. To address this case, we used fixed manual camera settings and fixed
indoor lighting with the exception of a flashlight, which traced a path from right to left along the
bottom row, then left to right along the top row. The RGB plots in Figure 34 show a spike as the
flashlight beam crossed each sample. Orange, green, yellow and beige saturated in at least one
component. There were no cases where the RGB values were zero, and the only time the RGB
values were all equal was during saturation. The same data is plotted in 3D in Figure 35.
100 200 300 400 5000
50
100
150
200
250RGB cam1-in-man-flashlight-orange.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
50
100
150
200
250RGB cam1-in-man-flashlight-red.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
50
100
150
200
250RGB cam1-in-man-flashlight-green.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
50
100
150
200
250RGB cam1-in-man-flashlight-blue.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
50
100
150
200
250RGB cam1-in-man-flashlight-yellow.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
50
100
150
200
250RGB cam1-in-man-flashlight-brown.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
50
100
150
200
250RGB cam1-in-man-flashlight-beige.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
50
100
150
200
250RGB cam1-in-man-flashlight-black.txt
frame number
colo
r com
pone
nts
redgreenblue
Figure 34: RGB color space for the Indoor Manual Flashlight Sequence
77
0100
200
0
100
200
0
100
200
red
RGB cam1-in-man-flashlight-orange.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-in-man-flashlight-red.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-in-man-flashlight-green.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-in-man-flashlight-blue.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-in-man-flashlight-yellow.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-in-man-flashlight-brown.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-in-man-flashlight-beige.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-in-man-flashlight-black.txt
green
blue
Figure 35: RGB color space in 3D for the Indoor Manual Flashlight Sequence
The YIQ color space in Figure 36 shows the expected spike in Y (red), but also shows
changes in I and Q, even when there was no saturation.
100 200 300 400 500-100
0
100
200
YIQ cam1-in-man-flashlight-orange.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300 400 500-100
0
100
200
YIQ cam1-in-man-flashlight-red.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300 400 500-100
0
100
200
YIQ cam1-in-man-flashlight-green.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300 400 500-100
0
100
200
YIQ cam1-in-man-flashlight-blue.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300 400 500-100
0
100
200
YIQ cam1-in-man-flashlight-yellow.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300 400 500-100
0
100
200
YIQ cam1-in-man-flashlight-brown.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300 400 500-100
0
100
200
YIQ cam1-in-man-flashlight-beige.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300 400 500-100
0
100
200
YIQ cam1-in-man-flashlight-black.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
Figure 36: YIQ color space for the Indoor Manual Flashlight Sequence
78
HSV in Figure 37 and HLS in Figure 38 show the expected spike in value (blue in HSV)
and lightness (green in HLS). The hue is consistent in both except when saturated, but the
saturation changes even in samples that were not saturated.
100 200 300 400 500
-100
0
100
200
HSV cam1-in-man-flashlight-orange.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300 400 500
-100
0
100
200
HSV cam1-in-man-flashlight-red.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300 400 500
-100
0
100
200
HSV cam1-in-man-flashlight-green.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300 400 500
-100
0
100
200
HSV cam1-in-man-flashlight-blue.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300 400 500
-100
0
100
200
HSV cam1-in-man-flashlight-yellow.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300 400 500
-100
0
100
200
HSV cam1-in-man-flashlight-brown.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300 400 500
-100
0
100
200
HSV cam1-in-man-flashlight-beige.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300 400 500
-100
0
100
200
HSV cam1-in-man-flashlight-black.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
Figure 37: HSV color space for the Indoor Manual Flashlight Sequence
100 200 300 400 500
-100
0
100
200
HLS cam1-in-man-flashlight-orange.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300 400 500
-100
0
100
200
HLS cam1-in-man-flashlight-red.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300 400 500
-100
0
100
200
HLS cam1-in-man-flashlight-green.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300 400 500
-100
0
100
200
HLS cam1-in-man-flashlight-blue.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300 400 500
-100
0
100
200
HLS cam1-in-man-flashlight-yellow.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300 400 500
-100
0
100
200
HLS cam1-in-man-flashlight-brown.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300 400 500
-100
0
100
200
HLS cam1-in-man-flashlight-beige.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300 400 500
-100
0
100
200
HLS cam1-in-man-flashlight-black.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
Figure 38: HLS color space for the Indoor Manual Flashlight Sequence
79
The CIELAB color space is shown in Figure 39. The L* component (red) shows the
increase in brightness when the flashlight beam moves across each sample, and the a* and b*
components appear stable. There isn’t much difference in the a* and b* components between
the brown, beige, and black samples.
100 200 300 400 500-200
-100
0
100
200Lab cam1-in-man-flashlight-orange.txt
frame number
colo
r com
pone
nts
Lab
100 200 300 400 500-200
-100
0
100
200Lab cam1-in-man-flashlight-red.txt
frame number
colo
r com
pone
nts
Lab
100 200 300 400 500-200
-100
0
100
200Lab cam1-in-man-flashlight-green.txt
frame numberco
lor c
ompo
nent
s
Lab
100 200 300 400 500-200
-100
0
100
200Lab cam1-in-man-flashlight-blue.txt
frame number
colo
r com
pone
nts
Lab
100 200 300 400 500-200
-100
0
100
200Lab cam1-in-man-flashlight-yellow.txt
frame number
colo
r com
pone
nts
Lab
100 200 300 400 500-200
-100
0
100
200Lab cam1-in-man-flashlight-brown.txt
frame number
colo
r com
pone
nts
Lab
100 200 300 400 500-200
-100
0
100
200Lab cam1-in-man-flashlight-beige.txt
frame number
colo
r com
pone
nts
Lab
100 200 300 400 500-200
-100
0
100
200Lab cam1-in-man-flashlight-black.txt
frame number
colo
r com
pone
nts
Lab
Figure 39: CIELAB color space for the Indoor Manual Flashlight Sequence
For both the chromaticity space in Figure 40 and the 321 ccc space in Figure 41, there are
only small bumps where the flashlight beam crossed the sample except when there was
saturation.
80
100 200 300 400 5000
0.2
0.4
0.6
0.8
1chrom. space cam1-in-man-flashlight-orange.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
0.2
0.4
0.6
0.8
1chrom. space cam1-in-man-flashlight-red.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
0.2
0.4
0.6
0.8
1chrom. space cam1-in-man-flashlight-green.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
0.2
0.4
0.6
0.8
1chrom. space cam1-in-man-flashlight-blue.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
0.2
0.4
0.6
0.8
1chrom. space cam1-in-man-flashlight-yellow.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
0.2
0.4
0.6
0.8
1chrom. space cam1-in-man-flashlight-brown.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
0.2
0.4
0.6
0.8
1chrom. space cam1-in-man-flashlight-beige.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 5000
0.2
0.4
0.6
0.8
1chrom. space cam1-in-man-flashlight-black.txt
frame number
colo
r com
pone
nts
redgreenblue
Figure 40: Chromaticity color space for the Indoor Manual Flashlight Sequence
100 200 300 400 5000
20
40
60
80
c1c2c3 cam1-in-man-flashlight-orange.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 300 400 5000
20
40
60
80
c1c2c3 cam1-in-man-flashlight-red.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 300 400 5000
20
40
60
80
c1c2c3 cam1-in-man-flashlight-green.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 300 400 5000
20
40
60
80
c1c2c3 cam1-in-man-flashlight-blue.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 300 400 5000
20
40
60
80
c1c2c3 cam1-in-man-flashlight-yellow.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 300 400 5000
20
40
60
80
c1c2c3 cam1-in-man-flashlight-brown.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 300 400 5000
20
40
60
80
c1c2c3 cam1-in-man-flashlight-beige.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 300 400 5000
20
40
60
80
c1c2c3 cam1-in-man-flashlight-black.txt
frame number
colo
r com
pone
nts
c1c2c3
Figure 41: 321 ccc color space for the Indoor Manual Flashlight Sequence
The l1l2l3 space in Figure 42 shows consistent values except when the RGB signal was
saturated, although the blue and black samples are quite noisy.
81
100 200 300 400 5000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-in-man-flashlight-orange.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 300 400 5000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-in-man-flashlight-red.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 300 400 5000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-in-man-flashlight-green.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 300 400 5000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-in-man-flashlight-blue.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 300 400 5000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-in-man-flashlight-yellow.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 300 400 5000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-in-man-flashlight-brown.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 300 400 5000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-in-man-flashlight-beige.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 300 400 5000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-in-man-flashlight-black.txt
frame number
colo
r com
pone
nts
l1l2l3
Figure 42: 321 lll color space for the Indoor Manual Flashlight Sequence
All three components of the derivative space in Figure 43 remain fairly constant,
although the hue (red) shows some spikes during saturation.
100 200 300 400 500
-150
-100
-50
0
50
100
150
Derivative cam1-in-man-flashlight-orange.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300 400 500
-150
-100
-50
0
50
100
150
Derivative cam1-in-man-flashlight-red.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300 400 500
-150
-100
-50
0
50
100
150
Derivative cam1-in-man-flashlight-green.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300 400 500
-150
-100
-50
0
50
100
150
Derivative cam1-in-man-flashlight-blue.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300 400 500
-150
-100
-50
0
50
100
150
Derivative cam1-in-man-flashlight-yellow.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300 400 500
-150
-100
-50
0
50
100
150
Derivative cam1-in-man-flashlight-brown.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300 400 500
-150
-100
-50
0
50
100
150
Derivative cam1-in-man-flashlight-beige.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300 400 500
-150
-100
-50
0
50
100
150
Derivative cam1-in-man-flashlight-black.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
Figure 43: Derivative color space for the Indoor Manual Flashlight Sequence
Log Hue in Figure 44 shows flat plots even when the RGB signal was saturated. Again,
there is not much discrimination between the samples in the bottom row.
82
100 200 300 400 500
-150
-100
-50
0
50
100
150
log hue cam1-in-man-flashlight-orange.txt
frame number
colo
r com
pone
nts
100 200 300 400 500
-150
-100
-50
0
50
100
150
log hue cam1-in-man-flashlight-red.txt
frame number
colo
r com
pone
nts
100 200 300 400 500
-150
-100
-50
0
50
100
150
log hue cam1-in-man-flashlight-green.txt
frame number
colo
r com
pone
nts
100 200 300 400 500
-150
-100
-50
0
50
100
150
log hue cam1-in-man-flashlight-blue.txt
frame number
colo
r com
pone
nts
100 200 300 400 500
-150
-100
-50
0
50
100
150
log hue cam1-in-man-flashlight-yellow.txt
frame number
colo
r com
pone
nts
100 200 300 400 500
-150
-100
-50
0
50
100
150
log hue cam1-in-man-flashlight-brown.txt
frame number
colo
r com
pone
nts
100 200 300 400 500
-150
-100
-50
0
50
100
150
log hue cam1-in-man-flashlight-beige.txt
frame number
colo
r com
pone
nts
100 200 300 400 500
-150
-100
-50
0
50
100
150
log hue cam1-in-man-flashlight-black.txt
frame number
colo
r com
pone
nts
Figure 44: Log Hue for the Indoor Manual Flashlight Sequence
3.3.4 Outdoor Automatic Change Light Sequence
The next trial used only outdoor light, which was indirect light from a window. The
amount of light was changed by closing then opening mini-blinds on the window, in addition to
any uncontrolled variation from clouds. This time, the camera was in “automatic” mode, free to
make adjustments for the changing light level. The RGB plots in Figure 45 show that all three
components of all the colors changed with the illumination level. No saturation was present in
this trial. The R, G, and B components were equal for the black sample, so we expect to see
noisy results for hue and 321 lll . The ending light level was approximately the same as the
beginning light level, yet the RGB values at the end were different than the beginning, especially
in the green and blue samples. The probable cause of this is that the automatic white balance
setting in the camera changed when the light was dim, resulting in different recorded colors at
the beginning and ending of the sequence. This can be avoided by using manual camera settings.
83
The white balance change can be seen more clearly in the 3D plots in Figure 46, where several
of the samples show two lines of points instead of one.
100 200 300 400 500 6000
50
100
150
200
250RGB cam1-out-auto-change-light2-orange.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 500 6000
50
100
150
200
250RGB cam1-out-auto-change-light2-red.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 500 6000
50
100
150
200
250RGB cam1-out-auto-change-light2-green.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 500 6000
50
100
150
200
250RGB cam1-out-auto-change-light2-blue.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 500 6000
50
100
150
200
250RGB cam1-out-auto-change-light2-yellow.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 500 6000
50
100
150
200
250RGB cam1-out-auto-change-light2-brown.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 500 6000
50
100
150
200
250RGB cam1-out-auto-change-light2-beige.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 500 6000
50
100
150
200
250RGB cam1-out-auto-change-light2-black.txt
frame number
colo
r com
pone
nts
redgreenblue
Figure 45: RGB color space for the Outdoor Automatic Change Light Sequence
0100
200
0
100
200
0
100
200
red
RGB cam1-out-auto-change-light2-orange.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-out-auto-change-light2-red.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-out-auto-change-light2-green.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-out-auto-change-light2-blue.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-out-auto-change-light2-yellow.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-out-auto-change-light2-brown.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-out-auto-change-light2-beige.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-out-auto-change-light2-black.txt
green
blue
Figure 46: RGB color space in 3D for the Outdoor Automatic Change Light Sequence
The YIQ plots in Figure 47 show that the luminance in Y (red) varies with light level, and
the I and Q components only changed slightly. The white balance change is not obvious here.
84
100 200 300 400 500 600-100
0
100
200
YIQ cam1-out-auto-change-light2-orange.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300 400 500 600-100
0
100
200
YIQ cam1-out-auto-change-light2-red.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300 400 500 600-100
0
100
200
YIQ cam1-out-auto-change-light2-green.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300 400 500 600-100
0
100
200
YIQ cam1-out-auto-change-light2-blue.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300 400 500 600-100
0
100
200
YIQ cam1-out-auto-change-light2-yellow.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300 400 500 600-100
0
100
200
YIQ cam1-out-auto-change-light2-brown.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300 400 500 600-100
0
100
200
YIQ cam1-out-auto-change-light2-beige.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
100 200 300 400 500 600-100
0
100
200
YIQ cam1-out-auto-change-light2-black.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
Figure 47: YIQ color space for the Outdoor Automatic Change Light Sequence
For HSV in Figure 48 and HLS in Figure 49, the value (blue in HSV) and lightness
(green in HLS) components reacted as expected to the illumination change. The hue (red in both
HSV and HLS) was fairly constant, although it was very noisy at times (when BGR == ), and
drifted a bit due to the white balance change in the camera. The saturation was not very
consistent.
100 200 300 400 500 600
-100
0
100
200
HSV cam1-out-auto-change-light2-orange.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300 400 500 600
-100
0
100
200
HSV cam1-out-auto-change-light2-red.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300 400 500 600
-100
0
100
200
HSV cam1-out-auto-change-light2-green.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300 400 500 600
-100
0
100
200
HSV cam1-out-auto-change-light2-blue.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300 400 500 600
-100
0
100
200
HSV cam1-out-auto-change-light2-yellow.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300 400 500 600
-100
0
100
200
HSV cam1-out-auto-change-light2-brown.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300 400 500 600
-100
0
100
200
HSV cam1-out-auto-change-light2-beige.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
100 200 300 400 500 600
-100
0
100
200
HSV cam1-out-auto-change-light2-black.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
Figure 48: HSV color space for the Outdoor Automatic Change Light Sequence
85
100 200 300 400 500 600
-100
0
100
200
HLS cam1-out-auto-change-light2-orange.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300 400 500 600
-100
0
100
200
HLS cam1-out-auto-change-light2-red.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300 400 500 600
-100
0
100
200
HLS cam1-out-auto-change-light2-green.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300 400 500 600
-100
0
100
200
HLS cam1-out-auto-change-light2-blue.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300 400 500 600
-100
0
100
200
HLS cam1-out-auto-change-light2-yellow.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300 400 500 600
-100
0
100
200
HLS cam1-out-auto-change-light2-brown.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300 400 500 600
-100
0
100
200
HLS cam1-out-auto-change-light2-beige.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
100 200 300 400 500 600
-100
0
100
200
HLS cam1-out-auto-change-light2-black.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
Figure 49: HLS color space for the Outdoor Automatic Change Light Sequence
For the CIELAB color space, shown in Figure 50, the intensity in the L* component (red)
does not change as much as the intensity components in the previous color spaces due to the
cube root. The a* and b* components vary with the light change, most obviously in the yellow
sample. Once again, there is not much difference in a* and b* between the brown, beige, and
black samples.
100 200 300 400 500 600-200
-100
0
100
200Lab cam1-out-auto-change-light2-orange.txt
frame number
colo
r com
pone
nts
Lab
100 200 300 400 500 600-200
-100
0
100
200Lab cam1-out-auto-change-light2-red.txt
frame number
colo
r com
pone
nts
Lab
100 200 300 400 500 600-200
-100
0
100
200Lab cam1-out-auto-change-light2-green.txt
frame number
colo
r com
pone
nts
Lab
100 200 300 400 500 600-200
-100
0
100
200Lab cam1-out-auto-change-light2-blue.txt
frame number
colo
r com
pone
nts
Lab
100 200 300 400 500 600-200
-100
0
100
200Lab cam1-out-auto-change-light2-yellow.txt
frame number
colo
r com
pone
nts
Lab
100 200 300 400 500 600-200
-100
0
100
200Lab cam1-out-auto-change-light2-brown.txt
frame number
colo
r com
pone
nts
Lab
100 200 300 400 500 600-200
-100
0
100
200Lab cam1-out-auto-change-light2-beige.txt
frame number
colo
r com
pone
nts
Lab
100 200 300 400 500 600-200
-100
0
100
200Lab cam1-out-auto-change-light2-black.txt
frame number
colo
r com
pone
nts
Lab
Figure 50: CIELAB color space for the Outdoor Automatic Change Light Sequence
86
Once again, the chromaticity in Figure 51 and the 321 ccc color space in Figure 52 appear
identical. The variation is not as severe as the RGB space, but most of the colors are
significantly affected by the illumination change as well as the shift in white balance.
100 200 300 400 500 6000
0.2
0.4
0.6
0.8
1chrom. space cam1-out-auto-change-light2-orange.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 500 6000
0.2
0.4
0.6
0.8
1chrom. space cam1-out-auto-change-light2-red.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 500 6000
0.2
0.4
0.6
0.8
1chrom. space cam1-out-auto-change-light2-green.t
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 500 6000
0.2
0.4
0.6
0.8
1chrom. space cam1-out-auto-change-light2-blue.tx
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 500 6000
0.2
0.4
0.6
0.8
1chrom. space cam1-out-auto-change-light2-yellow.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 500 6000
0.2
0.4
0.6
0.8
1chrom. space cam1-out-auto-change-light2-brown.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 500 6000
0.2
0.4
0.6
0.8
1chrom. space cam1-out-auto-change-light2-beige.txt
frame number
colo
r com
pone
nts
redgreenblue
100 200 300 400 500 6000
0.2
0.4
0.6
0.8
1chrom. space cam1-out-auto-change-light2-black.txt
frame number
colo
r com
pone
nts
redgreenblue
Figure 51: Chromaticity color space for the Outdoor Automatic Change Light Sequence
100 200 300 400 500 6000
20
40
60
80
c1c2c3 cam1-out-auto-change-light2-orange.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 300 400 500 6000
20
40
60
80
c1c2c3 cam1-out-auto-change-light2-red.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 300 400 500 6000
20
40
60
80
c1c2c3 cam1-out-auto-change-light2-green.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 300 400 500 6000
20
40
60
80
c1c2c3 cam1-out-auto-change-light2-blue.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 300 400 500 6000
20
40
60
80
c1c2c3 cam1-out-auto-change-light2-yellow.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 300 400 500 6000
20
40
60
80
c1c2c3 cam1-out-auto-change-light2-brown.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 300 400 500 6000
20
40
60
80
c1c2c3 cam1-out-auto-change-light2-beige.txt
frame number
colo
r com
pone
nts
c1c2c3
100 200 300 400 500 6000
20
40
60
80
c1c2c3 cam1-out-auto-change-light2-black.txt
frame number
colo
r com
pone
nts
c1c2c3
Figure 52: The 321 ccc color space for the Outdoor Automatic Change Light Sequence
87
The l1l2l3 color space shows the white balance change dramatically in the green sample in
Figure 53. The orange, red, and yellow samples remain fairly constant, but the black sample is
extremely noisy, due to the equal RGB components.
100 200 300 400 500 6000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-out-auto-change-light2-orange.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 300 400 500 6000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-out-auto-change-light2-red.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 300 400 500 6000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-out-auto-change-light2-green.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 300 400 500 6000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-out-auto-change-light2-blue.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 300 400 500 6000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-out-auto-change-light2-yellow.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 300 400 500 6000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-out-auto-change-light2-brown.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 300 400 500 6000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-out-auto-change-light2-beige.txt
frame number
colo
r com
pone
nts
l1l2l3
100 200 300 400 500 6000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-out-auto-change-light2-black.txt
frame number
colo
r com
pone
nts
l1l2l3
Figure 53: The l1l2l3 color space for the Outdoor Automatic Change Light Sequence
The derivative color space in Figure 54 shows that the hue is affected by the light
intensity, but the other two components react much less. The white balance change can be seen
most dramatically in the hue component of the black sample.
88
100 200 300 400 500 600
-150
-100
-50
0
50
100
150
Derivative cam1-out-auto-change-light2-orange.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300 400 500 600
-150
-100
-50
0
50
100
150
Derivative cam1-out-auto-change-light2-red.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300 400 500 600
-150
-100
-50
0
50
100
150
Derivative cam1-out-auto-change-light2-green.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300 400 500 600
-150
-100
-50
0
50
100
150
Derivative cam1-out-auto-change-light2-blue.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300 400 500 600
-150
-100
-50
0
50
100
150
Derivative cam1-out-auto-change-light2-yellow.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300 400 500 600
-150
-100
-50
0
50
100
150
Derivative cam1-out-auto-change-light2-brown.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300 400 500 600
-150
-100
-50
0
50
100
150
Derivative cam1-out-auto-change-light2-beige.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
100 200 300 400 500 600
-150
-100
-50
0
50
100
150
Derivative cam1-out-auto-change-light2-black.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
Figure 54: Derivative color space for the Outdoor Automatic Change Light Sequence
Lastly, the Log Hue in Figure 55 remains fairly stable until the white balance change
except for the black sample, which is very noisy in addition to changing in value. Once again,
the price for stability is limited ability to distinguish between different colors.
100 200 300 400 500 600
-150
-100
-50
0
50
100
150
log hue cam1-out-auto-change-light2-orange.txt
frame number
colo
r com
pone
nts
100 200 300 400 500 600
-150
-100
-50
0
50
100
150
log hue cam1-out-auto-change-light2-red.txt
frame number
colo
r com
pone
nts
100 200 300 400 500 600
-150
-100
-50
0
50
100
150
log hue cam1-out-auto-change-light2-green.txt
frame number
colo
r com
pone
nts
100 200 300 400 500 600
-150
-100
-50
0
50
100
150
log hue cam1-out-auto-change-light2-blue.txt
frame number
colo
r com
pone
nts
100 200 300 400 500 600
-150
-100
-50
0
50
100
150
log hue cam1-out-auto-change-light2-yellow.txt
frame number
colo
r com
pone
nts
100 200 300 400 500 600
-150
-100
-50
0
50
100
150
log hue cam1-out-auto-change-light2-brown.txt
frame number
colo
r com
pone
nts
100 200 300 400 500 600
-150
-100
-50
0
50
100
150
log hue cam1-out-auto-change-light2-beige.txt
frame number
colo
r com
pone
nts
100 200 300 400 500 600
-150
-100
-50
0
50
100
150
log hue cam1-out-auto-change-light2-black.txt
frame number
colo
r com
pone
nts
Figure 55: Log Hue for the Outdoor Automatic Change Light Sequence
89
3.3.5 Outdoor Manual Change Light Sequence
The last trial used indirect sunlight from a window, varied by closing and opening blinds,
as the light source, with fixed manual camera settings. The RGB plots in Figure 56 show one
more time that all the components change with the light intensity. No saturation occurred in this
trial. The 3D plots in Figure 57 look close to linear, but curvature can be seen, especially in the
orange and red samples. The black sample exhibits both equal R, G, and B components and
values close to zero.
200 400 6000
50
100
150
200
250RGB cam1-out-man-change-light2-orange.txt
frame number
colo
r com
pone
nts
redgreenblue
200 400 6000
50
100
150
200
250RGB cam1-out-man-change-light2-red.txt
frame number
colo
r com
pone
nts
redgreenblue
200 400 6000
50
100
150
200
250RGB cam1-out-man-change-light2-green.txt
frame number
colo
r com
pone
nts
redgreenblue
200 400 6000
50
100
150
200
250RGB cam1-out-man-change-light2-blue.txt
frame number
colo
r com
pone
nts
redgreenblue
200 400 6000
50
100
150
200
250RGB cam1-out-man-change-light2-yellow.txt
frame number
colo
r com
pone
nts
redgreenblue
200 400 6000
50
100
150
200
250RGB cam1-out-man-change-light2-brown.txt
frame number
colo
r com
pone
nts
redgreenblue
200 400 6000
50
100
150
200
250RGB cam1-out-man-change-light2-beige.txt
frame number
colo
r com
pone
nts
redgreenblue
200 400 6000
50
100
150
200
250RGB cam1-out-man-change-light2-black.txt
frame number
colo
r com
pone
nts
redgreenblue
Figure 56: RGB color space for the Outdoor Manual Change Light Sequence
90
0100
200
0
100
200
0
100
200
red
RGB cam1-out-man-change-light2-orange.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-out-man-change-light2-red.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-out-man-change-light2-green.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-out-man-change-light2-blue.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-out-man-change-light2-yellow.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-out-man-change-light2-brown.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-out-man-change-light2-beige.txt
green
blue
0100
200
0
100
200
0
100
200
red
RGB cam1-out-man-change-light2-black.txt
green
blue
Figure 57: RGB color space in 3D for the Outdoor Manual Change Light Sequence
The YIQ plots in Figure 58 show Y (red) proportional to the light intensity, and I and Q
are not constant.
200 400 600-100
0
100
200
YIQ cam1-out-man-change-light2-orange.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
200 400 600-100
0
100
200
YIQ cam1-out-man-change-light2-red.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
200 400 600-100
0
100
200
YIQ cam1-out-man-change-light2-green.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
200 400 600-100
0
100
200
YIQ cam1-out-man-change-light2-blue.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
200 400 600-100
0
100
200
YIQ cam1-out-man-change-light2-yellow.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
200 400 600-100
0
100
200
YIQ cam1-out-man-change-light2-brown.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
200 400 600-100
0
100
200
YIQ cam1-out-man-change-light2-beige.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
200 400 600-100
0
100
200
YIQ cam1-out-man-change-light2-black.txt
frame number
colo
r com
pone
nts
Y (luminance)IQ
Figure 58: YIQ color space for the Outdoor Manual Change Light Sequence
The HSV plots in Figure 59 and the HLS plots in Figure 60 once again show value (blue
in HSV) and lightness (green in HLS) proportional to the light intensity. Hue (red in both HSV
91
and HLS) remains mostly constant, with limited discrimination ability and lots of noise for
black, since the RGB components were close to zero. Saturation (green in HSV and blue in
HLS) does not remain constant.
200 400 600
-100
0
100
200
HSV cam1-out-man-change-light2-orange.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
200 400 600
-100
0
100
200
HSV cam1-out-man-change-light2-red.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
200 400 600
-100
0
100
200
HSV cam1-out-man-change-light2-green.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
200 400 600
-100
0
100
200
HSV cam1-out-man-change-light2-blue.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
200 400 600
-100
0
100
200
HSV cam1-out-man-change-light2-yellow.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
200 400 600
-100
0
100
200
HSV cam1-out-man-change-light2-brown.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
200 400 600
-100
0
100
200
HSV cam1-out-man-change-light2-beige.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
200 400 600
-100
0
100
200
HSV cam1-out-man-change-light2-black.txt
frame number
colo
r com
pone
nts
H (hue)S (saturation)V (value)
Figure 59: HSV color space for the Outdoor Manual Change Light Sequence
200 400 600
-100
0
100
200
HLS cam1-out-man-change-light2-orange.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
200 400 600
-100
0
100
200
HLS cam1-out-man-change-light2-red.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
200 400 600
-100
0
100
200
HLS cam1-out-man-change-light2-green.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
200 400 600
-100
0
100
200
HLS cam1-out-man-change-light2-blue.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
200 400 600
-100
0
100
200
HLS cam1-out-man-change-light2-yellow.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
200 400 600
-100
0
100
200
HLS cam1-out-man-change-light2-brown.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
200 400 600
-100
0
100
200
HLS cam1-out-man-change-light2-beige.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
200 400 600
-100
0
100
200
HLS cam1-out-man-change-light2-black.txt
frame number
colo
r com
pone
nts
H (hue)L (lightness)S (saturation)
Figure 60: HLS color space for the Outdoor Manual Change Light Sequence
92
Once again, the CIELAB color space in Figure 61 shows the intensity variation in the L*
component (red), small changes in the a* and b* components, and very little difference between
the brown, beige, and black samples.
200 400 600-200
-100
0
100
200Lab cam1-out-man-change-light2-orange.txt
frame number
colo
r com
pone
nts
Lab
200 400 600-200
-100
0
100
200Lab cam1-out-man-change-light2-red.txt
frame number
colo
r com
pone
nts
Lab
200 400 600-200
-100
0
100
200Lab cam1-out-man-change-light2-green.txt
frame number
colo
r com
pone
nts
Lab
200 400 600-200
-100
0
100
200Lab cam1-out-man-change-light2-blue.txt
frame number
colo
r com
pone
nts
Lab
200 400 600-200
-100
0
100
200Lab cam1-out-man-change-light2-yellow.txt
frame number
colo
r com
pone
nts
Lab
200 400 600-200
-100
0
100
200Lab cam1-out-man-change-light2-brown.txt
frame number
colo
r com
pone
nts
Lab
200 400 600-200
-100
0
100
200Lab cam1-out-man-change-light2-beige.txt
frame number
colo
r com
pone
nts
Lab
200 400 600-200
-100
0
100
200Lab cam1-out-man-change-light2-black.txt
frame number
colo
r com
pone
nts
Lab
Figure 61: CIELAB color space for the Outdoor Manual Change Light Sequence
Chromaticity in Figure 62 and 321 ccc in Figure 63 have no significant differences. As
before, the variation with light intensity is much less than in RGB space, but there is still
nontrivial deviation.
93
200 400 6000
0.2
0.4
0.6
0.8
1chrom. space cam1-out-man-change-light2-orange.txt
frame number
colo
r com
pone
nts
redgreenblue
200 400 6000
0.2
0.4
0.6
0.8
1chrom. space cam1-out-man-change-light2-red.txt
frame number
colo
r com
pone
nts
redgreenblue
200 400 6000
0.2
0.4
0.6
0.8
1chrom. space cam1-out-man-change-light2-green.txt
frame number
colo
r com
pone
nts
redgreenblue
200 400 6000
0.2
0.4
0.6
0.8
1chrom. space cam1-out-man-change-light2-blue.txt
frame number
colo
r com
pone
nts
redgreenblue
200 400 6000
0.2
0.4
0.6
0.8
1chrom. space cam1-out-man-change-light2-yellow.txt
frame number
colo
r com
pone
nts
redgreenblue
200 400 6000
0.2
0.4
0.6
0.8
1chrom. space cam1-out-man-change-light2-brown.txt
frame number
colo
r com
pone
nts
redgreenblue
200 400 6000
0.2
0.4
0.6
0.8
1chrom. space cam1-out-man-change-light2-beige.txt
frame number
colo
r com
pone
nts
redgreenblue
200 400 6000
0.2
0.4
0.6
0.8
1chrom. space cam1-out-man-change-light2-black.txt
frame number
colo
r com
pone
nts
redgreenblue
Figure 62: Chromaticity color space for the Outdoor Manual Change Light Sequence
200 400 6000
20
40
60
80
c1c2c3 cam1-out-man-change-light2-orange.txt
frame number
colo
r com
pone
nts
c1c2c3
200 400 6000
20
40
60
80
c1c2c3 cam1-out-man-change-light2-red.txt
frame number
colo
r com
pone
nts
c1c2c3
200 400 6000
20
40
60
80
c1c2c3 cam1-out-man-change-light2-green.txt
frame number
colo
r com
pone
nts
c1c2c3
200 400 6000
20
40
60
80
c1c2c3 cam1-out-man-change-light2-blue.txt
frame number
colo
r com
pone
nts
c1c2c3
200 400 6000
20
40
60
80
c1c2c3 cam1-out-man-change-light2-yellow.txt
frame number
colo
r com
pone
nts
c1c2c3
200 400 6000
20
40
60
80
c1c2c3 cam1-out-man-change-light2-brown.txt
frame number
colo
r com
pone
nts
c1c2c3
200 400 6000
20
40
60
80
c1c2c3 cam1-out-man-change-light2-beige.txt
frame number
colo
r com
pone
nts
c1c2c3
200 400 6000
20
40
60
80
c1c2c3 cam1-out-man-change-light2-black.txt
frame number
colo
r com
pone
nts
c1c2c3
Figure 63: The 321 ccc color space for the Outdoor Manual Change Light Sequence
The l1l2l3 plots in Figure 64 have less variation with light intensity, but gets very noisy in
dim light and with the black sample due to the equal RGB components.
94
200 400 6000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-out-man-change-light2-orange.txt
frame number
colo
r com
pone
nts
l1l2l3
200 400 6000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-out-man-change-light2-red.txt
frame number
colo
r com
pone
nts
l1l2l3
200 400 6000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-out-man-change-light2-green.txt
frame number
colo
r com
pone
nts
l1l2l3
200 400 6000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-out-man-change-light2-blue.txt
frame number
colo
r com
pone
nts
l1l2l3
200 400 6000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-out-man-change-light2-yellow.txt
frame number
colo
r com
pone
nts
l1l2l3
200 400 6000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-out-man-change-light2-brown.txt
frame number
colo
r com
pone
nts
l1l2l3
200 400 6000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-out-man-change-light2-beige.txt
frame number
colo
r com
pone
nts
l1l2l3
200 400 6000
0.2
0.4
0.6
0.8
1l1l2l3 cam1-out-man-change-light2-black.txt
frame number
colo
r com
pone
nts
l1l2l3
Figure 64: The l1l2l3 color space for the Outdoor Manual Change Light Sequence
The derivative space plotted in Figure 65 shows small changes in hue when the light is at
its lowest level, with even smaller changes in the other two components.
200 400 600
-150
-100
-50
0
50
100
150
Derivative cam1-out-man-change-light2-orange.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
200 400 600
-150
-100
-50
0
50
100
150
Derivative cam1-out-man-change-light2-red.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
200 400 600
-150
-100
-50
0
50
100
150
Derivative cam1-out-man-change-light2-green.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
200 400 600
-150
-100
-50
0
50
100
150
Derivative cam1-out-man-change-light2-blue.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
200 400 600
-150
-100
-50
0
50
100
150
Derivative cam1-out-man-change-light2-yellow.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
200 400 600
-150
-100
-50
0
50
100
150
Derivative cam1-out-man-change-light2-brown.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
200 400 600
-150
-100
-50
0
50
100
150
Derivative cam1-out-man-change-light2-beige.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
200 400 600
-150
-100
-50
0
50
100
150
Derivative cam1-out-man-change-light2-black.txt
frame number
colo
r com
pone
nts
HCλ
Cλλ
Figure 65: Derivative color space for the Outdoor Manual Change Light Sequence
95
Finally, the Log Hue plots in Figure 66 remain fairly constant through the light intensity
variation except for the black sample, where BGR == . Noise increases as the light intensity
decreases, and yellow, beige and brown look almost the same.
200 400 600
-150
-100
-50
0
50
100
150
log hue cam1-out-man-change-light2-orange.txt
frame number
colo
r com
pone
nts
200 400 600
-150
-100
-50
0
50
100
150
log hue cam1-out-man-change-light2-red.txt
frame number
colo
r com
pone
nts
200 400 600
-150
-100
-50
0
50
100
150
log hue cam1-out-man-change-light2-green.txt
frame number
colo
r com
pone
nts
200 400 600
-150
-100
-50
0
50
100
150
log hue cam1-out-man-change-light2-blue.txt
frame number
colo
r com
pone
nts
200 400 600
-150
-100
-50
0
50
100
150
log hue cam1-out-man-change-light2-yellow.txt
frame number
colo
r com
pone
nts
200 400 600
-150
-100
-50
0
50
100
150
log hue cam1-out-man-change-light2-brown.txt
frame number
colo
r com
pone
nts
200 400 600
-150
-100
-50
0
50
100
150
log hue cam1-out-man-change-light2-beige.txt
frame number
colo
r com
pone
nts
200 400 600
-150
-100
-50
0
50
100
150
log hue cam1-out-man-change-light2-black.txt
frame number
colo
r com
pone
nts
Figure 66: Log Hue for the Outdoor Manual Change Light Sequence
3.4 Analysis
To quantify the performance in each of the color spaces, we computed the standard
deviation of each component for each trial. The magnitude of the standard deviation for RGB
space shows the magnitude of the illumination change. A larger change in lighting causes a
larger change in RGB levels, which will more robustly test another color space’s ability to
remain constant. The most invariant color space will have the smallest standard deviation.
The discriminative power was also computed for each color space for each trial. The
discriminative power is large when the inter-sample differences are large and the intra-sample
differences are small. The formula we used is
96
( )( ) ( )( )( )( ) ( )( )tctc
tctcDP
ji
jiij stdevstdev
meanmean
+
−= , (19)
where ( )tci is the vector function representing one of the color spaces as it varied over time and
x is the magnitude of vector x. The numerator is the distance between two colors and the
denominator is the cumulative noise. Since we are seeking a color space invariant to
illumination, the luminance component was omitted from YIQ, HSV, and HLS. The hue
component was also removed from the derivative color space since it was observed to vary much
more than the other two components. With 8 color samples, there are 28 pairs (1 + 2 + … + 7).
Each graph of discriminative power shows the value computed for all 28 pairs for one color
space in one trial. Values less than one indicate that the noise is greater than the difference
between colors, and bigger numbers signify better discriminatory capability.
3.4.1 Indoors Manual Change Light Sequence
The standard deviations for the Indoors Manual Change Light sequence are in Figure 67.
This was the sequence with the smallest change in RGB values. In general, the standard
deviations of the luminance components (Y in YIQ, V in HSV, L in HLS, L* in CIELAB) are the
same magnitude as RGB. Components I and Q of YIQ have less deviation than RGB for six of
the samples, but not for orange and red. The hue in HSV and HLS has a small standard
deviation in most of the samples, but varies more than RGB in blue and black. The standard
deviation of the saturation in HSV and HLS is about the same magnitude as RGB in most of the
samples, but much worse in black. The a* and b* components of CIELAB are low in all the
97
samples. Chromaticity and 321 ccc are fairly consistent across the samples, although they have
higher standard deviations than RGB in the black sample. The l1l2l3 space has standard
deviations similar to chromaticity most of the time, but in green, blue, and black, it is the worst
of the color spaces. In the derivative space, the hue component varies more than RGB in the
blue and black samples. The other two components always vary less than RGB, even in the
black sample that has the smallest RGB deviation. Log hue has smaller standard deviation than
RGB in five of the eight samples, but is much worse than RGB in blue and black.
The bottom line is that only the Cλ and Cλλ components of the derivative color space and
the a* and b* componenets of the CIELAB color space always produced components with
smaller standard deviation than RGB. The YIQ space was the only other one that did not get
noisy for the black sample.
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30Stdev cam1-in-man-change-light-orange.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30Stdev cam1-in-man-change-light-red.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30Stdev cam1-in-man-change-light-green.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30Stdev cam1-in-man-change-light-blue.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30Stdev cam1-in-man-change-light-yellow.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30Stdev cam1-in-man-change-light-brown.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30Stdev cam1-in-man-change-light-beige.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30Stdev cam1-in-man-change-light-black.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
Figure 67: Standard Deviation for the Indoor Manual Change Light Sequence
98
The discriminative power for the color spaces in the Indoors Manual Change Light
sequence are shown in Figure 68. For RGB space, 13 of the 28 pairs have DP < 1.0, and only 3
pairs exceed 2.0, confirming that RGB is seriously affected by illumination changes. Of the
remaining color spaces, l1l2l3 has the fewest distinguishable pairs with 19, due to the noisy blue
and black samples. The rest of the color spaces only fail to distinguish between 4 or fewer color
pairs. Chromaticity, 321 ccc , and derivative color spaces have the most pairs with DP above any
given threshold.
5 10 15 20 250
2
4
6
8
10RGB cam1-in-man-change-light- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10(Y)IQ cam1-in-man-change-light- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10HS(V) cam1-in-man-change-light- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10H(L)S cam1-in-man-change-light- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10(L)ab cam1-in-man-change-light- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10chrom. cam1-in-man-change-light- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10c1c2c3 cam1-in-man-change-light- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10l1l2l3 cam1-in-man-change-light- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10deriv. cam1-in-man-change-light- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10LogHue cam1-in-man-change-light- all frames
color pair
disc
rimin
ativ
e po
wer
Figure 68: Discriminative power for the Indoor Manual Change Light Sequence
99
3.4.2 Indoor Manual Change Iris Sequence
The standard deviations for the various color spaces for the Indoor Manual Change Iris
sequence are shown in Figure 69. Since the RGB values saturated in some of the samples, we
also analyzed only the frames without saturation, and present the results in Figure 70. Note also
that the RGB values in this sequence started at values similar to the previous sequence, but got
brighter here versus darker in the previous sequence. Thus, this sequence measures the color
space responses to brighter colors. The only color component that had standard deviations worse
than RGB was luminance, which was expected to vary with illumination. The standard deviation
of saturation was generally at least half as large as RGB. The other color spaces all performed
much better, without the large amounts of noise in the previous darker sequence. The results are
significantly better without the frames where at least one of the RGB components was saturated.
Tracking applications should treat values that have been clipped differently, since the ratios
between the components are no longer valid in this case, but this is frequently overlooked. The
difference between the results with and without the data from the clipped frames shows the
importance of the treatment of clipped values.
100
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30
35Stdev cam1-indoor-change-iris-orange.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30
35Stdev cam1-indoor-change-iris-red.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30
35Stdev cam1-indoor-change-iris-green.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30
35Stdev cam1-indoor-change-iris-blue.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30
35Stdev cam1-indoor-change-iris-yellow.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30
35Stdev cam1-indoor-change-iris-brown.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30
35Stdev cam1-indoor-change-iris-beige.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30
35Stdev cam1-indoor-change-iris-black.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
Figure 69: Standard Deviation for all frames of the Indoor Manual Change Iris Sequence
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30
35Stdev w/o sat cam1-indoor-change-iris-orange.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30
35Stdev w/o sat cam1-indoor-change-iris-red.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30
35Stdev w/o sat cam1-indoor-change-iris-green.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30
35Stdev w/o sat cam1-indoor-change-iris-blue.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30
35Stdev w/o sat cam1-indoor-change-iris-yellow.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30
35Stdev w/o sat cam1-indoor-change-iris-brown.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30
35Stdev w/o sat cam1-indoor-change-iris-beige.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30
35Stdev w/o sat cam1-indoor-change-iris-black.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
Figure 70: Standard deviation for frames with no saturation for the Indoor Manual Change Iris Sequence
The discriminative power of the various color spaces for the Indoor Manual Change Iris
sequence are shown in Figure 71, and Figure 72 shows the results when saturated frames are
excluded from the analysis. The RGB color space shows DP < 1 for all color pairs. When
saturated frames are included, the results are not encouraging, with only the Log Hue having DP
> 2 for more than half the color pairs. When saturated frames are removed, the situation is very
101
different, with all but RGB having at least 19 of the 28 color pairs with DP > 1. Log Hue and
l1l2l3 did the best here, with at least 7 color pairs having DP > 10, and 23 color pairs with DP >
2. Most of the color pairs with low DP had black as one member. Recall that black was
extremely noisy for both the Log Hue and l1l2l3 color spaces. Chromaticity had 23 color pairs
with DP > 1 and 15 with DP > 2. CIELAB had 26 color pairs with DP > 1 and 22 with DP > 2.
5 10 15 20 250
2
4
6
8
10RGB cam1-indoor-change-iris- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10(Y)IQ cam1-indoor-change-iris- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10HS(V) cam1-indoor-change-iris- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10H(L)S cam1-indoor-change-iris- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10(L)ab cam1-indoor-change-iris- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10chrom. cam1-indoor-change-iris- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10c1c2c3 cam1-indoor-change-iris- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10l1l2l3 cam1-indoor-change-iris- all frames
color pairdi
scrim
inat
ive
pow
er
5 10 15 20 250
2
4
6
8
10deriv. cam1-indoor-change-iris- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10LogHue cam1-indoor-change-iris- all frames
color pair
disc
rimin
ativ
e po
wer
Figure 71: Indoor Manual Change Iris Sequence Discriminative Power for all frames
102
5 10 15 20 250
2
4
6
8
10RGB cam1-indoor-change-iris- no sat.
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10(Y)IQ cam1-indoor-change-iris- no sat.
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10HS(V) cam1-indoor-change-iris- no sat.
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10H(L)S cam1-indoor-change-iris- no sat.
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10(L)ab cam1-indoor-change-iris- no sat.
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10chrom. cam1-indoor-change-iris- no sat.
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10c1c2c3 cam1-indoor-change-iris- no sat.
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10l1l2l3 cam1-indoor-change-iris- no sat.
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10deriv. cam1-indoor-change-iris- no sat.
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10LogHue cam1-indoor-change-iris- no sat.
color pair
disc
rimin
ativ
e po
wer
Figure 72: Indoor Manual Change Iris Sequence Discriminative Power excluding frames with saturation
3.4.3 Indoor Manual Flashlight Sequence
The Indoor Manual Flashlight sequence measures local changes in illumination. The
standard deviations for all frames are shown in Figure 73, and the graphs with saturated frames
excluded are in Figure 74. As expected, the luminance component standard deviations have
magnitudes similar to the RGB components. Saturation standard deviation exceeds RGB for
yellow, green and beige. Standard deviations for log hue, chromaticity, 321 ccc , a* and b*
components of CIELAB, and Cλ and Cλλ components from the derivative space are low for all
the samples, while l1l2l3 gets noisy for blue and black. The large deviations in l1l2l3 for the beige
sample only occur when there is saturation.
103
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15Stdev cam1-in-man-flashlight-orange.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15Stdev cam1-in-man-flashlight-red.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15Stdev cam1-in-man-flashlight-green.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15Stdev cam1-in-man-flashlight-blue.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15Stdev cam1-in-man-flashlight-yellow.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15Stdev cam1-in-man-flashlight-brown.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15Stdev cam1-in-man-flashlight-beige.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15Stdev cam1-in-man-flashlight-black.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
Figure 73: Standard Deviation for all frames of the Indoor Manual Flashlight Sequence.
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15Stdev w/o sat cam1-in-man-flashlight-orange.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15Stdev w/o sat cam1-in-man-flashlight-red.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15Stdev w/o sat cam1-in-man-flashlight-green.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15Stdev w/o sat cam1-in-man-flashlight-blue.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15Stdev w/o sat cam1-in-man-flashlight-yellow.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15Stdev w/o sat cam1-in-man-flashlight-brown.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15Stdev w/o sat cam1-in-man-flashlight-beige.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15Stdev w/o sat cam1-in-man-flashlight-black.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
Figure 74: Standard Deviation for frames with no saturation for the Indoor Manual Flashlight Sequence
The discriminative power for the Indoor Manual Flashlight sequence is shown in Figure
75, and the results when saturated frames are excluded are shown in Figure 76. When the
saturated frames are included, CIELAB, chromaticity, 321 ccc , and derivative spaces do the best,
with DP > 5 for at least 21 of the 28 pairs. Log Hue is not far behind, with DP > 5 for 19 pairs.
Without the saturated frames, all the color spaces have at least 26 pairs with DP > 1. RGB and
104
HLS are the worst, with 10 or fewer pairs having DP > 4. CIELAB, chromaticity, 321 ccc ,
derivative and log hue are the best, with more than half the samples having DP > 10.
5 10 15 20 250
2
4
6
8
10RGB cam1-in-man-flashlight- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10(Y)IQ cam1-in-man-flashlight- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10HS(V) cam1-in-man-flashlight- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10H(L)S cam1-in-man-flashlight- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10(L)ab cam1-in-man-flashlight- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10chrom. cam1-in-man-flashlight- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10c1c2c3 cam1-in-man-flashlight- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10l1l2l3 cam1-in-man-flashlight- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10deriv. cam1-in-man-flashlight- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10LogHue cam1-in-man-flashlight- all frames
color pair
disc
rimin
ativ
e po
wer
Figure 75: Discriminative Power for the Indoor Manual Flashlight Sequence with all frames
105
5 10 15 20 250
2
4
6
8
10RGB cam1-in-man-flashlight- no sat.
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10(Y)IQ cam1-in-man-flashlight- no sat.
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10HS(V) cam1-in-man-flashlight- no sat.
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10H(L)S cam1-in-man-flashlight- no sat.
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10(L)ab cam1-in-man-flashlight- no sat.
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10chrom. cam1-in-man-flashlight- no sat.
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10c1c2c3 cam1-in-man-flashlight- no sat.
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10l1l2l3 cam1-in-man-flashlight- no sat.
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10deriv. cam1-in-man-flashlight- no sat.
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10LogHue cam1-in-man-flashlight- no sat.
color pair
disc
rimin
ativ
e po
wer
Figure 76: Discriminative power for the Indoor Manual Flashlight Sequence with no saturated frames
3.4.4 Outdoor Automatic Change Light Sequence
The standard deviations from the Outdoor Automatic Change Light sequence are shown
in Figure 77. This sequence had no saturation, but the white balance changed during the
sequence. The black sample shows very noisy hue, l1l2l3, and log hue components. The
chromaticity, 321 ccc , a* and b* components of CIELAB, Cλ and Cλλ components from the
derivative space, and I and Q components of YIQ all have deviations smaller the RGB.
106
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30Stdev cam1-out-auto-change-light2-orange.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30Stdev cam1-out-auto-change-light2-red.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30Stdev cam1-out-auto-change-light2-green.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30Stdev cam1-out-auto-change-light2-blue.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30Stdev cam1-out-auto-change-light2-yellow.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30Stdev cam1-out-auto-change-light2-brown.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30Stdev cam1-out-auto-change-light2-beige.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30Stdev cam1-out-auto-change-light2-black.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
Figure 77: Standard deviations for all frames of the Outdoor Automatic Change Light Sequence
The discriminative power for the Outdoor Automatic Change Light sequence is shown in
Figure 78. RGB shows 16 pairs with DP > 1 and only 5 pairs with DP > 2. All the rest have at
least 21 pairs with DP > 1. Chromaticity, 321 ccc , and derivative spaces have at least 22 pairs
with DP > 2, CIELAB has 20, and log hue has 18. However, if the threshold is changed to DP >
3, log hue is the best, discriminating between 14 pairs.
107
5 10 15 20 250
2
4
6
8
10RGB cam1-out-auto-change-light2- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10(Y)IQ cam1-out-auto-change-light2- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10HS(V) cam1-out-auto-change-light2- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10H(L)S cam1-out-auto-change-light2- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10(L)ab cam1-out-auto-change-light2- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10chrom. cam1-out-auto-change-light2- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10c1c2c3 cam1-out-auto-change-light2- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10l1l2l3 cam1-out-auto-change-light2- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10deriv. cam1-out-auto-change-light2- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10LogHue cam1-out-auto-change-light2- all frames
color pair
disc
rimin
ativ
e po
wer
Figure 78: Discriminative power for the Outdoor Automatic Change Light Sequence
3.4.5 Outdoor Manual Change Light Sequence
In the final sequence, Outdoor Manual Change Light, with standard deviations shown in
Figure 79, saturation only occurred for a few frames at the end of the green sample, and got quite
dark in the middle. The luminance components had standard deviations comparable to RGB,
and saturation components got as high as half the RGB magnitude. The I and Q components of
YIQ had small deviations except for orange, while chromaticity, 321 ccc , and derivative spaces
and the a* and b* components of CIELAB were low for all the samples. The l1l2l3 color space
had small deviations except for brown and black.
108
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30Stdev cam1-out-man-change-light2-orange.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30Stdev cam1-out-man-change-light2-red.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30Stdev cam1-out-man-change-light2-green.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30Stdev cam1-out-man-change-light2-blue.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30Stdev cam1-out-man-change-light2-yellow.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30Stdev cam1-out-man-change-light2-brown.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30Stdev cam1-out-man-change-light2-beige.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
RGB YIQ HSV HLS Lab Chrm c123 L123 Drv LH 0
5
10
15
20
25
30Stdev cam1-out-man-change-light2-black.txt
Std.
dev
. as
% o
f pos
sibl
e ra
nge
Figure 79: Standard Deviation for all frames of the Outdoor Manual Change Light Sequence
Discriminative power for the Outdoor Manual Change Light sequence is shown in Figure
80. RGB only had DP > 1 for 3 pairs. The rest were all better than 23 pairs with DP > 1.
Chromaticity, 321 ccc , and derivative spaces do the best up to DP > 4, but log hue is better for DP
> 5 and above.
109
5 10 15 20 250
2
4
6
8
10RGB cam1-out-man-change-light2- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10(Y)IQ cam1-out-man-change-light2- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10HS(V) cam1-out-man-change-light2- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10H(L)S cam1-out-man-change-light2- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10(L)ab cam1-out-man-change-light2- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10chrom. cam1-out-man-change-light2- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10c1c2c3 cam1-out-man-change-light2- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10l1l2l3 cam1-out-man-change-light2- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10deriv. cam1-out-man-change-light2- all frames
color pair
disc
rimin
ativ
e po
wer
5 10 15 20 250
2
4
6
8
10LogHue cam1-out-man-change-light2- all frames
color pair
disc
rimin
ativ
e po
wer
Figure 80: Discriminative power for the Outdoor Manual Change Light Sequence
3.5 Conclusions
We explored the responses of nine different color spaces with eight different color
samples in five different situations. We learned several things from these experiments.
• Automatic camera settings can cause more than brightness levels to change. The
Outdoor Automatic Change Light sequence showed the problems when the
camera automatically changes the white balance during the sequence. This can be
avoided by using manual camera settings. The need for keeping camera settings
constant was noted by Reinhard et al. [75].
110
• Saturation (clipping) causes shifts in all the color models. The Indoor Manual
Change Iris and Indoor Manual Flashlight sequences showed great improvement
when the saturated frames were excluded.
• The color spaces with some degree of intensity invariance can be partitioned into
those that become unstable at black (R = G = B = 0), and those that become
unstable at gray (R = G = B). Hue (in HSV, HLS, Log Hue, and Derivative) and
321 lll get noisy at gray, due to the denominator approaching zero. When the color
approaches black (a subset of gray), saturation (in HSV and HLS), chromaticity,
and 321 ccc become noisy as well.
• CIELAB never showed the extreme noise apparent in the other color spaces. The
discriminative power tended to be slightly less than that of chromaticity and
321 ccc , especially for higher thresholds of DP.
• The chromaticity and 321 ccc color spaces have comparable performance in all the
experiments. The only case that their standard deviations were larger than RGB
was the black sample in the darkest sequence, and even that sample did not
exhibit the extreme noise that some of the components did. Between chromaticity
and 321 ccc , chromaticity requires fewer operations to compute (2 adds and 3
divides, vs. 3 maxes, 3 divides, and 3 arctans; or if a 256x256 table is used for
each, 2 adds and 3 lookups vs. 3 maxes and 3 lookups).
• The Cλ and Cλλ components from the derivative space worked well in all cases,
even when saturation was present, and avoided the noisy results of other color
111
spaces for dark colors. The discriminatory power analysis showed that there is
enough information to distinguish between different colors.
• Even though there appears to be little difference in log hue for different colors,
the lack of noise except in very dark scenes still gives it good discrimination
ability. Log hue is also easier to manage than of the hue used in HSV and HSL
because it doesn’t wrap around like the angular measures do.
We also experimented with variations on these color spaces that avoided the instability by
keeping the denominator of the various ratios away from zero. We found that there appears to be
a tradeoff between stability and invariance to illumination – more stability yields less invariance.
We also tried applying the idea of the logs from Log Hue to other expressions, but they did not
yield any significant improvement. Given a choice between noise for gray and noise for black,
problems with black are preferred, since black is a subset of the gray shades.
Most algorithms that use color for tasks like tracking or motion detection update the color
model over time to account for illumination changes. This requires choices about how quickly to
change the model. Updating too slowly may result in missing sudden changes, while changing
the model too quickly may lead to missing changes that are not caused by illumination. This
dilemma can be avoided if there is a color space that does not require updating, but can still
distinguish between colors.
The assumption of white light that varies only in intensity (which could be caused by
changes in the light itself, the light direction, or the surface normal) is a reasonable one for many
cases. It holds for outdoor daylight scenes, and most indoor scenes use fixed white lights as
112
well. Our experiments have sampled how a real camera reacts to a real (if simple) scene under
these conditions.
Color spaces such as YIQ, HLS and HSV that are commonly used to avoid illumination
effects were found to not work very well in real conditions. Even though l1l2l3 was designed to
work better, the noise present in gray scenes makes it unusable. The 321 ccc space takes longer to
compute than chromaticity, but doesn’t work any better. The remaining spaces (CIELAB,
chromaticity, derivative, and log hue), while not demonstrating ideal constancy, have results that
are good enough to pursue further.
113
4 AUGMENTING A SOLID RECTANGLE
Our first augmented reality system tracks a rectangle of a solid color in the video and
replaces it with a prestored image [92]. A person standing in front of the camera holding and
moving the solid rectangle watching the display sees themselves holding a picture instead. As
discussed in Section 2.5.3, tracking simple objects requires a different approach than tracking
complex objects. Figure 81 shows a frame of both the raw and augmented video. In keeping
with the low tech nature of the problem domain, the rectangle we use is a piece of construction
paper mounted on a piece of cardboard, so it is not absolutely solid, nor completely rigid. The
camera is a consumer digital video camera.
(a) (b)
Figure 81: Original and augmented video frame.
4.1 Color Representations
Since the object we wish to track is a solid color, it makes sense to use color as one of the
cues to detect the object. However, the RGB colors measured by a camera are dependent on the
114
illumination. Based on the observations from the experiment described in the previous chapter,
we chose chromaticity space for detecting our solid colored rectangle object.
4.2 Proposed Method
The components of our method are an object color model, an object pixel mask, and a
quadrilateral shape. The object color is modeled by a single Gaussian in chromaticity space, and
is updated online. The object pixel mask keeps track of the pixels that are part of the tracked
object in the current frame. The output of the process is the vertices of a quadrilateral bounding
the object.
The steps of the algorithm are as follows:
1. Train the system with a single frame in which the object fills the camera field of view.
Initialize the color model with the mean and standard deviation of all the pixels, and
label all the pixels as “object”.
2. In the next frame, use the color model to mark the pixels whose probability of belonging
to the object exceeds a threshold.
3. For the pixels that have changed since the previous frame, update the object mask with
the result of the color test in step 2.
4. Find the minimum bounding quadrilateral around the largest group of connected pixels
that passed the color test in step 2.
5. Refine the quadrilateral using edge information.
6. Clear the object mask for all pixels outside the quadrilateral.
7. Update the color model using the pixels marked as “object”.
115
8. If there are more video frames, go to step 2.
4.2.1 Color Model
We use a single Gaussian color model to represent the object color. The probability that
a given color vector x matches the object color is
( )( )
2
2
2
21 σ
μ
σπ
−−
=x
x ep (20)
where μ and σ are the mean and standard deviation of the object color distribution.
Testing whether this probability is above a threshold is equivalent to evaluating σμ t>−x ,
where t controls the threshold.
In an ideal world, there would be a color space where no adaptation was needed.
Shadows and lighting changes could be eliminated and only the true material colors would be
evaluated. We tried several color spaces including RGB and HSV, but obtained the best results
with chromaticity [31]. Chromaticity factors out the light intensity by dividing each component
by the sum of the components:
⎟⎠⎞
⎜⎝⎛
++++++=
BGRB
BGRG
BGRRbgr ,,),,( . (21)
Images (a) and (b) in Figure 82 show a video frame before and after conversion to chromaticity
space. Image (c) shows the pixels that match the color model highlighted in red.
The model is updated online to adjust for changing lighting conditions using Equation 22.
( )( ) currentprevioustrainingupdated μβαβμαμμ +−++= 1 (22)
116
where trainingμ is the mean of the colors sampled during the training frame, previousμ is the mean
resulting from the previous frame’s calculations, and currentμ is the mean of the colors in the
object mask in the current frame. The result, updatedμ will be used for object color detection in
the following frame. The standard deviation, σ, is updated similarly. The parameters α and β,
which are between 0 and 1, control how quickly the model adapts. Retaining a portion of the
training model with α > 0 helps prevent drift. Increasing β keeps the model more stable, but less
able to keep up with rapid lighting changes, such as the transition from pointing the rectangle
directly at the primary light source to tilting it away.
Updating the model can cause it to drift if non-object pixels are included. We reduced
this problem by tracking object pixels separately. If rectangle is occluded, for instance by a hand
that is holding it, the skin pixels will not get included as object pixels because they do not match
the color model. Since they are not tagged as object pixels, they will not affect the color model
update. The logic to only update the object mask in locations where the image has changed
serves to minimize the chance that background pixels that match the object color will get
included in the object. Of course, if the whole background is dynamic this will not help, but this
is not the usual case. More complex background models can be used, such as a mixture of
Gaussians, e.g. [87], if enough computation time is available.
117
(a) (b)
(c) (d)
(e) (f)
(g) (h)
Figure 82: Processing stages: original frame, converted to chromaticity space, detected object color pixels, bounding quadrilateral, saturation, horizontal gradient, vertical gradient, and difference between subsequent frames.
118
4.2.2 Minimum Bounding Quadrilateral
The model fitting step converts the object description from a general contour (the contour
of the group of connected pixels that passed the color test with the greatest area) to a
quadrilateral. We used Low and Ilie’s greedy method for finding a tight bounding quadrilateral
from a convex hull [64]. The process begins with a convex hull containing the pixels that
matched the object color model. For each vertex vi in the perimeter, the algorithm calculates the
area added if vertices vi and vi+1 were replaced by a single vertex at the intersection of the lines
defined by (vi-1, vi) and (vi+1, vi+2), as shown in Figure 83. This calculation is required initially
for all n vertices. Each of the n-4 times that two vertices are replaced with one, only a few
vertices in the vicinity of the deleted vertex must be updated, making this process Ο(n). Since
the convex hull requires Ο(n log n), the whole algorithm is therefore Ο(n log n).
Figure 83: Area calculation for finding the bounding quadrilateral
vi-1 vi+2
Area that would be added
vi vi+1
New vertex to replace vi and vi-1
119
When the color detection step is accurate, this produces an accurate quadrilateral.
However, if the pixels near the edges are not detected, the result will be too small, and a single
connected pixel outside the actual rectangle will skew the whole edge. Thus, the bounding
quadrilateral determined this way is only a rough approximation of the desired result.
4.2.3 Quadrilateral Refinement
Particle filters, such as Isard and Blake’s CONDENSATION algorithm [56], track curves in
clutter using a multiple hypothesis framework. Their method for evaluating a hypothesis
involves measuring the distance between the hypothesized contour and high contrast features
along a sparse set of normals to the contour. Since these measurements are repeated for each of
a hundred or more hypotheses, this is too slow for our frame rate target. Instead, we search
along a sparse set of normals to each edge of the bounding quadrilateral approximation for edge
points, and then fit a line through the points found. This is illustrated in Figure 84. The corners
of the bounding quadrilateral are then the intersections of the lines.
120
Figure 84: Edge refinement process
Various methods were tested for finding the best edge pixel along the normal. Using the
pixel with the highest color gradient magnitude didn’t work when the rectangle was resting on a
shelf because the shelf had stronger gradients than the rectangle border. The distance between
the pixel color and the mean of the object color model did not show a consistent pattern at the
rectangle edge.
The best results so far were achieved by using the saturation channel, where
S = (max – min) / max, (23)
and min and max are the minimum and maximum respectively among the R, G, and B pixel
values. The normal is traversed from a finite distance inside the edge towards the outside, and
the edge pixel is the first pixel found in which the gradient of the saturation channel exceeds a
threshold. These results can be further improved using RANSAC to discard outliers.
The resulting bounding quadrilateral is shown in Figure 82d. The last four images show
intermediate calculation stages for refining the quadrilateral. Figure 82e is the saturation
Edge of bounding quadrilateral
Since the quad edge is closer to horizontal, vertical lines are searched for high gradient points.
Line fit through gradient points
121
channel. The edges tend to be better defined in this space than the others tried. Compression
done by the camera interferes with some of the others. Images (f) and (g) in Figure 82 are the
gradients in the horizontal and vertical direction. To avoid the overhead of computing and
comparing with a magnitude and direction, the horizontal gradient is used for edges that are
closer to vertical, and the vertical gradient is used for horizontal edges. Lastly, the difference
between subsequent frames is shown in Figure 82h. The brighter pixels in this image are the
only places where the object mask is updated in step 3. More results showing the improvement
achieved by the refinement step are in Figure 85.
Color detection resultColor detection result Bounding quadrilateralBounding quadrilateral
Refined quadrilateralRefined quadrilateral
Figure 85: Refining the quadrilateral. The rough bounding quadrilateral is computed from the color detection result, which misses the bottom edge. The refinement process results in a more accurate outline.
122
Limiting the marked object pixels to those inside the quadrilateral in step 6 eliminates
pixels with the right color from further consideration. The color model is updated similar to
[87], but using only those pixels inside the quadrilateral that are marked as “object”, so the new
sample should have a minimum amount of contamination of pixels that are not actually part of
the object.
4.2.4 Deinterlacing
When interlaced cameras are used at their full resolution, rapid motion results in
“tearing” of the image because the even and odd lines are recorded at different time instants.
When viewed on an interlaced display, such as a standard TV set, this is not visible, but when
projected on a higher resolution display or when the full frame is used for processing, this
tearing becomes a problem. Figure 86 shows an example of this effect. The area in the white
square in the left image is expanded in the right image. The location of the red rectangle is
different for the even lines than it is for the odd lines. This dual position makes it difficult to
identify the edges of the object and creates many extra sharp gradients. But in the area without
motion, the extra resolution adds important detail.
123
Figure 86: Tearing caused by interlacing
Many deinterlacing algorithms exist with both hardware and software implementations.
Line doubling just repeats the even or odd lines for the whole image. This keeps the horizontal
resolution, but loses vertical resolution where there is no motion. This may also cause smooth
diagonal lines to become jagged. A median filter will retain the edges, but will destroy detail in
the static regions. Recreating the even lines by interpolating between odd lines also keeps
smooth edges, but destroys detail in stationary areas. Motion compensation algorithms are more
sophisticated, tracking moving objects and adjusting their position in one frame to match the
other, but are slow. We propose a fast algorithm that cleans up the moving regions while
retaining the detail present in the static regions.
Our algorithm is as follows:
For each pixel p(i, j) on an even line,
If ),(),1(),(),2( jipjipjipjip −+<−+
Then ( )),(),2(5.0),(),1( jipjipjipjip −++=+
124
Basically, if the pixels two lines apart more similar than the pixels one line apart, replace
the pixel one line below with the interpolated value. While not catching every possible case, this
quickly compensates for the interlacing in areas where there is motion while leaving most of the
rest of the frame untouched. Figure 87 shows the results from this approach. The boundary of
the red rectangle is much better defined than in Figure 86, yet the fabric detail remains.
Figure 87: Fast deinterlacing algorithm results
4.2.5 Displaying the Result
Displaying the combined result with OpenGL requires drawing three polygons. The first
is original frame as a texture map on a rectangle that fills the viewport and sets the depth buffer
to a near value. The second also fills the viewport and updates only the depth buffer with a
distant value for pixels that are part of the object mask. The final polygon is formed by the
vertices of the tracked rectangle with the inserted image as a texture map. Drawing this
quadrilateral with an intermediate depth causes it to only be visible where the object color was
detected.
125
4.3 Optimization Methods
In order to achieve real-time update rates, considerable code optimization was necessary.
The use of Intel’s OpenCV library [55] for low level image processing functions provided a good
basis, but the initial implementation ran much too slowly. The profiler in Microsoft Visual
Studio 6.0 was used to find the slow parts of the algorithm and measure improvements. A test
version of the application built for benchmarking the timing runs the complete algorithm
described in the previous sections 500 times on the same image. Several runs were averaged to
obtain the reported results. While some of the individual changes may be specific to this
compiler or processor, the general method for optimizing and the concepts applied can be used
for any system. The techniques most useful when optimizing the code were:
• Use integer arithmetic instead of floating point On most processors, integer
operations are several times faster than the corresponding floating point operations.
However, care must be taken to ensure that integer values do not overflow, and that
the truncation that occurs as a byproduct of integer division does not cause incorrect
results.
• Use lookup tables instead of arithmetic. If the required calculation is more complex
than computing an array address, a lookup table may run faster. This is easiest if the
inputs are discrete (integer). Memory constraints must also be considered, even
though available memory is increasing rapidly. For example, a 256x256 table with
one byte per entry takes 64 Kb of memory, which is probably acceptable; whereas a
256x256x256 table takes 16 Mb, which can get expensive if there are many of these.
126
• Make the lookup tables global. While it is cleaner to keep everything inside a class,
accessing a table that is a member of a class requires a memory access to get the base
address for the class, then an offset to get the base address of the table. The base
address of a global table can be resolved at compile time, resulting in fewer memory
accesses and faster code.
• Avoid conditional branches. Most modern processors are pipelined to increase
speed. Conditional branches disrupt the pipeline, slowing down execution. If-then-
else constructs should only be used when the instructions skipped are more
expensive than the overhead of the branch. Since loops require conditional
branches, executing more instructions in each loop results in fewer branches. For
example, to operate on every pixel in an image, it is more efficient to process two
pixels inside the loop and halve the number of times through the loop than to process
a single pixel inside the loop.
• Use zero as the ending loop index. If we wanted to execute some section of code
1000 times, it is easy to create a loop where an index goes from 0 to 999. But the
loop termination test must then load the ending index (999) each time or store it in a
dedicated register in order to compare it to the counter. The same loop that goes
from 999 to 0 will run faster because the native instruction that compares the index
to zero can be used.
• Use bit shift instead of divide. Divide is usually the most expensive of the basic
arithmetic operators. An arithmetic shift typically runs significantly faster, and
accomplishes the same purpose when the divisor is an integer power of two. It is
127
often worthwhile to approximate a division by a bit shift. Multiplication can also be
replaced with arithmetic shifts for a lesser gain.
• Operate on the minimum number of pixels. For image processing applications, the
results of some operations may only be needed in localized areas. For example, in
the rectangle tracking application described previously, the gradient is only needed
in the vicinity of the estimated quadrilateral, so it is more efficient to compute a
mask indicating where the gradient is needed and then only calculate the gradient for
those pixels, than it is to calculate the gradient for the whole image.
• Use the most efficient precision. Some operations may run faster if the arguments
are 8-bit integers, while others execute faster with 32-bit integers. For example, to
add a pair of 8-bit integers, the compiler generated instructions to clear two 32-bit
registers, then load the arguments in the least significant byte, add the two registers,
then store the result. If the arguments had been 32-bit integers, the clear instructions
would not have been needed, resulting in faster execution (assuming the 8-bit load
takes the same number of cycles as the 32-bit load). On the other hand, indexing
into a 32-bit array requires the index to be multiplied by 4 (or shifted left by 2)
before adding the index and the array base address, making 8-bit arrays more
efficient.
As an example, the original C++ code to calculate the chromaticity components from the
RGB image is shown in Table 1. The functions that are prefixed with “cv” are calls to the
OpenCV library. Recall that the chromaticity conversion is
128
⎟⎠⎞
⎜⎝⎛
++++++=
BGRB
BGRG
BGRRbgr ,,255),,( . (24)
The multiplication by 255 is necessary for the output to be a triple of 8-bit integers. The
“cvSplit” function separates the three channel RGB image (m_inputImage) into three single
channel images (m_planes[0], m_planes[1], and m_planes[2]) containing the red,
green, and blue components. This is necessary because OpenCV can’t add channels within an
image; it can only add separate images. The “cvConvertScale” function multiplies the first
argument by the scalar third argument and puts the result in the second argument. Dividing by
three was necessary to avoid overflows, and only affects the result by a slight loss of precision.
A better solution would be to use a higher precision result, but the “cvAdd” function requires
that the images have the same resolution. In order for the sum to be 16 bits, each of the
components would have to first be converted to 16 bits. The “cvAdd” function puts the
pixelwise sum of the first two arguments in the third argument. The final step, “cvDiv”, divides
each pixel in the first argument by the corresponding pixel in the second argument, multiplies the
result by the scalar fourth argument, and stores the result in the third argument.
129
Table 1: Original code to calculate chromaticity. Functions prefixed by "cv" reference the OpenCV library. // split into R, G, and B cvSplit(m_inputImage, m_planes[2], m_planes[1], m_planes[0], 0); // divide by 3 so sum will still fit in 8 bits cvConvertScale(m_planes[0], m_planes[0], 0.3333); cvConvertScale(m_planes[1], m_planes[1], 0.3333); cvConvertScale(m_planes[2], m_planes[2], 0.3333); // find R+G+B cvAdd(m_planes[0], m_planes[1], sumImage); cvAdd(m_planes[2], sumImage, sumImage); // divide each plane by sum cvDiv(m_planes[0], sumImage, m_planes[0], 255.0); cvDiv(m_planes[1], sumImage, m_planes[1], 255.0); cvDiv(m_planes[2], sumImage, m_planes[2], 255.0);
This code produces the right result, is easy to read and debug, and uses integer math
wherever possible. The divide can’t be done as integers because the result would always be
zero. However, the function takes 45 ms. to execute for a 720x480 resolution image. Running
only this function (no video input, no rendering, and no further processing) would result in a
frame rate of 22.2 Hz. Of course, there is much more to be done each frame.
The floating point division is a good candidate for optimization. With two integer inputs
in the range of 0 to 255, this can be implemented as a lookup table. The code replacing the three
cvDiv lines is shown in Table 2. The initialization code builds the table as a class member when
the program is started. This execution time is not counted in the frame rate calculations, since
the time per frame will approach zero when the number of frames is large. The inner loop only
needs to go from 0 to 255/3, but it’s still more efficient to keep the full 256 values so the array
element can be accessed by shifting and adding the input indices, instead of requiring a
130
multiplication. The new loop basically performs a two dimensional table lookup for each of the
three components for each pixel. The modified function takes 15 ms. to execute, for a speedup
of 3.
Table 2: Code for table lookup to replace division (three cvDiv lines) in Table 1 // at initialization for (i=0; i<256; i++) { for (j=0; j<256; j++) { if (i == 0) m_chromaticityTable[j] = 0; else m_chromaticityTable[(i<<8) + j] = (unsigned char)cvRound(255.0f * (float)j / (float)i); } } // every frame unsigned char *p0 = (unsigned char *)m_planes[0]->imageData; unsigned char *p1 = (unsigned char *)m_planes[1]->imageData; unsigned char *p2 = (unsigned char *)m_planes[2]->imageData; unsigned char *sumPtr = (unsigned char *)sumImage->imageData; int i; for (i=0; i<m_imageSize.width*m_imageSize.height; i++) { *p0++ = m_chromaticityTable[(*sumPtr << 8) + *p0]; *p1++ = m_chromaticityTable[(*sumPtr << 8) + *p1]; *p2++ = m_chromaticityTable[(*sumPtr++ << 8) + *p2]; }
Eliminating the floating point multiply by 0.3333 should increase efficiency as well as
precision. To do this, we integrate the sum operation into the loop, where it can be stored in a
32-bit integer to avoid overflow. The lookup table must now be larger, since the denominator
now ranges from 0 to 3 × 255 and the numerator may use the full 0 to 255 range. The resulting
131
code is shown in Table 3. This version executes in 10 ms.; a speedup of 1.5 from the previous
iteration.
Table 3: Chromaticity calculation after integrating scale and sum into loop. // at initialization for (i=0; i<256*3; i++) { for (j=0; j<256; j++) { if (i == 0) m_chromaticityTable[j] = 0; else m_chromaticityTable[(i<<8) + j] = (unsigned char)cvRound(255.0f * (float)j / (float)i); } } // every frame cvSplit(m_inputImage, m_planes[2], m_planes[1], m_planes[0], 0); unsigned char *p0 = (unsigned char *)m_planes[0]->imageData; unsigned char *p1 = (unsigned char *)m_planes[1]->imageData; unsigned char *p2 = (unsigned char *)m_planes[2]->imageData; int i; int sum; for (i=0; i<m_imageSize.width*m_imageSize.height; i++) { sum = *p0 + *p1 + *p2; *p0++ = m_chromaticityTable[(sum << 8) + *p0]; *p1++ = m_chromaticityTable[(sum << 8) + *p1]; *p2++ = m_chromaticityTable[(sum << 8) + *p2]; }
There is one remaining OpenCV function: cvSplit. We used this in the first place
because subsequent OpenCV functions required single channel images. Since we have replaced
all of these functions, we no longer need cvSplit. The modified code is shown in Table 4. This
function takes 7.1 ms. to execute, for a speedup of 1.4 from the previous version.
132
Table 4: Chromaticity calculation after all OpenCV functions have been replaced. unsigned char *iPtr = (unsigned char *)m_inputImage->imageData; unsigned char *p0 = (unsigned char *)m_planes[0]->imageData; unsigned char *p1 = (unsigned char *)m_planes[1]->imageData; unsigned char *p2 = (unsigned char *)m_planes[2]->imageData; int i; int sum; for (i=0; i<m_imageSize.width*m_imageSize.height; i++, iPtr+=3) { sum = iPtr[0] + iPtr[1] + iPtr[2]; *p0++ = m_chromaticityTable[(sum << 8) + iPtr[0]]; *p1++ = m_chromaticityTable[(sum << 8) + iPtr[1]]; *p2++ = m_chromaticityTable[(sum << 8) + iPtr[2]]; }
There are still some smaller improvements that can be made. Since the variable i only
counts the number of times through the loop, decrementing the counter will functionally work as
well as incrementing it. Decrementing from some value to zero is faster than incrementing from
0 to some value because testing against zero is a native instruction, whereas testing against a
non-zero value requires loading that value each time through the loop.
The three values extracted from the table each time through the loop are in the same row.
While the compiler may recognize the duplicate operation and consolidate the instructions, we
can modify the code to compute a pointer to the row needed from the table to ensure that this
calculation is only done once.
When the table base address is a class member, the processor must perform a memory
access (“this” pointer plus offset) to get the table address. When the table base address is global,
the address can be resolved by the compiler, resulting in a faster lookup.
Conditional branches, including checking for the terminal condition in a for-loop, disrupt
the instruction pipeline, and tend to slow things down. Fewer times through the loop will result
133
in fewer conditional branches, and should run faster. This is known as loop unrolling. Pointers
used in the loop can be incremented fewer times as well. The disadvantage is that the resulting
code is more difficult to maintain, since changes inside the loop must be made multiple times
instead of just once. To balance the speed requirements with readable code (and because the
video sizes have widths that are multiples of four) we repeat our code four times within a loop.
The final code implementation that integrates all these efficiencies is shown in Table 5.
This function executes in 5.0 ms. This is a speedup of 1.4 from the previous version, and a
speedup of 9 from the original. The biggest gain in the last set of improvements was from using
a global variable for the lookup table, and the next was from loop unrolling. The contribution of
the others was measurable, but much smaller.
134
Table 5: Final optimized code for chromaticity calculation. unsigned char *iPtr = (unsigned char *)m_inputImage->imageData; unsigned char *p0 = (unsigned char *)m_planes[0]->imageData; unsigned char *p1 = (unsigned char *)m_planes[1]->imageData; unsigned char *p2 = (unsigned char *)m_planes[2]->imageData; unsigned char *tablePtr; int i; int sum; for (i=(m_imageSize.width*m_imageSize.height)/4; i>0; i--, iPtr+=12, p0+=4, p1+=4, p2+=4) { sum = iPtr[0] + iPtr[1] + iPtr[2]; tablePtr = g_chromaticityTable + (sum << 8); p0[0] = tablePtr[iPtr[0]]; p1[0] = tablePtr[iPtr[1]]; p2[0] = tablePtr[iPtr[2]]; sum = iPtr[3] + iPtr[4] + iPtr[5]; tablePtr = g_chromaticityTable + (sum << 8); p0[1] = tablePtr[iPtr[3]]; p1[1] = tablePtr[iPtr[4]]; p2[1] = tablePtr[iPtr[5]]; sum = iPtr[6] + iPtr[7] + iPtr[8]; tablePtr = g_chromaticityTable + (sum << 8); p0[2] = tablePtr[iPtr[6]]; p1[2] = tablePtr[iPtr[7]]; p2[2] = tablePtr[iPtr[8]]; sum = iPtr[9] + iPtr[10] + iPtr[11]; tablePtr = g_chromaticityTable + (sum << 8); p0[3] = tablePtr[iPtr[9]]; p1[3] = tablePtr[iPtr[10]]; p2[3] = tablePtr[iPtr[11]]; }
The original and final implementations are both Ο(n), where n is the number of pixels.
The multiplier hidden by the order notation is the source of the improvement for this function.
Since we have to visit every pixel, Ο(n) is the best that can be done for this situation. The time
135
spent coding the algorithm to reduce the multiplier is well worth it for this real time application,
since a process that takes 45 ms. per frame cannot be part of a system that runs near 30 Hz. (33.3
ms. per frame). Since the process only took 5 ms. after optimization, it can be used as part of a
30 Hz. system. Processors will eventually be fast enough to run the original implementation in 5
ms., but by using the optimized version with the faster processor, even more complex
calculations can be packed into the available processing time. Thus, optimization is likely to be
a critical part of real time programming for the foreseeable future.
4.4 Discussion
The system described was implemented using Microsoft DirectShow for interfacing with
the video camera, and Intel’s OpenCV library for low level image processing functions. The
code was optimized using techniques including lookup tables, avoiding floating point
calculations, and operating on only the necessary pixels. Frame rates of 28 Hz. were achieved
on a 2 GHz. Pentium 4 laptop, for 720x480 resolution DV video.
This method is able to track the rectangle successfully in a wide variety of lighting
conditions, including glare and shadows that change quickly. It uses no assumptions about
motion models or camera parameters and runs at interactive rates. Figure 88 shows some sample
frames from a sequence that was accurately tracked. The detected bounding quadrilateral is
shown in red. The frames show that the method can handle occlusions on edges and corners and
diverse lighting conditions.
136
Figure 88: Sample frames from the video sequence showing occlusion, a corner off the screen, and challenging lighting conditions. The tracked boundary is drawn in red.
4.5 Conclusions
We have described and demonstrated a tracking method for augmented reality
applications, in which real-time rates and accurate registration are required. Our setup uses low-
tech equipment, and no markers or elaborate training, yet can accomplish this task for simple
objects (which are sometimes more difficult than complex objects). The method draws on
principles of more sophisticated algorithms that run too slowly for this application, yet runs on
full resolution video at interactive rates.
The system tracks a simple, solid colored, rigid object (a rectangle) and augments it by
replacing the rectangle with a prestored image in the real time video stream. To the user, it looks
like the object he is moving around has a picture on it. This system uses color detection to locate
the rectangle in the image, and edge detection to refine the outline of the rectangle so that it
accurately overlays the target. The system works with shadows, glare, global lighting changes,
and occlusion.
137
5 AUGMENTING HUMAN FACES AND HEADS
The lessons learned in implementing the solid rectangle augmented reality system
described in the previous chapter were applied to a system that augments human faces and heads.
The complexity of the problem is increased because human heads vary in shape, size, color, and
features. Heads are not as rigid as the rectangle, the shape is more complex, and there is great
variety with skin color as well as hair color, texture, and length. Items such as glasses, earrings,
and facial hair may or may not be present. However, augmentation of one’s face or head is much
more personal and compelling than a separate object, like the rectangle. A mirror is most
commonly used to see ourselves, so augmenting faces is a natural extension.
We have made three contributions in this area. The first is a method for integrating face
detection and tracking to filter out the false positives from the detection algorithm while tracking
the valid faces. The second is a novel method for estimating the initial pose of a newly detected
face. The third is an extension of the Active Appearance Model (AAM) to use a parameterized
generic 3D head model to aid in both generating the AAM shape model more easily and
extracting meaningful quantities from the tracked face. This uses the results of face detection,
point-based tracking, and initial pose estimation to improve the initial guess for the iterative
solution.
5.1 Integration of Detection and Tracking
Detection and tracking of faces are often done independently. We discussed methods for
each in Section 2.5. If each were perfect, they could remain independent, but integrating them
138
provides ways to deal with false positives in the detection algorithms (which leads to tracking
objects that aren’t really faces) and drift in the tracking algorithm (where the tracking latches on
to background features). We can avoid both problems by combining the two steps. We will
describe the face detection method in Section 5.1.1, our tracking method in Section 5.1.2, the
integration of the two in Section 5.1.3, and analysis in Section 5.1.4.
5.1.1 Face Detection and Localization
The AdaBoost face detection method of Viola and Jones [100] described earlier was used
for locating face regions in each frame. Specifically, the OpenCV implementation [14] and
trained classifier were used as a starting point. On a 2.2 GHz Opteron processor, at a resolution
of 360x240, this ran at 11 Hz with the parameters we used. The main reason that this algorithm
is computationally expensive is that it must search each frame multiple times, looking for face
candidates at each location and at a range of scales.
To decrease the search area, we added a motion detection module, using the algorithm
from Stauffer and Grimson [87]. In their method, the background is modeled as a mixture of
Gaussians for each pixel. The mean, standard deviation, and weight of each distribution are
updated online as new frames are observed. The multimodal model can handle repetitive
motion, such as a tree blowing in the wind or specular reflections from water where a
background pixel might switch back and forth between two or more colors. A new pixel is
checked against each of the existing distributions. If it matches, the mean and standard deviation
for the matched distribution are updated and the weight is increased. If it doesn’t match, a new
139
distribution is created, replacing the lowest weighted distribution. After sorting the distributions
from highest weight to lowest weight, the background model is the first B distributions, where
⎟⎠
⎞⎜⎝
⎛>= ∑
=
TBb
kkb
1minarg ω , (25)
where T is the amount of data considered “background” and kω is the kth largest weight. Thus,
if the weights at a pixel are 0.85, 0.10, and 0.05 (summing to 1.0), the first distribution represents
the background if T = 0.8 and the first two distributions qualify as background if T = 0.9. A
foreground object that is stationary will eventually become part of the background, since the
weight is increased each time the color is observed. Pixels matching distributions beyond the
first B are designated foreground.
If no face was previously detected in an area and there was no motion, there is no need to
search that area again. Therefore, a mask was created with non-zero values where motion was
present or faces were previously detected. The face detection algorithm was modified to only
test rectangles where more than a threshold percentage of pixels are contained in the mask. The
motion detection algorithm alone runs at 25 Hz, and using this motion mask in conjunction with
the face detection algorithm increases its typical performance from 11 Hz. to 17 Hz. Of course,
if the entire background were dynamic the motion detection would only slow things down, but
for more static situations, the benefits of the reduced search area outweigh the cost of the motion
detection step.
140
5.1.2 Face Tracking
Once a face has been detected, we need to track it in subsequent frames, updating
position and orientation. Of the tracking categories listed in Section 2.5.6, we chose feature-
based tracking, using generic corners found in the detected face region.
By “corner”, we don’t necessarily mean a right angle, but a point with a high gradient in
both the horizontal and vertical directions. A point on a vertical edge would have a large
gradient in the horizontal direction but not in the vertical direction, so there would not be a
unique match for that point in another similar image. On the other hand, a point on a corner
would have large horizontal and vertical gradients, allowing a precise match in another image.
Methods for identifying these corners include Harris [48] and Shi and Tomasi [88]. We used the
latter, as implemented in OpenCV. In addition to finding corners that meet a quality threshold,
this function performs non-maxima suppression, and discards weaker corners that are closer than
a distance threshold from stronger corners.
These corners are located in each following frame using a pyramidal implementation of
the Lucas Kanade optical flow method [13]. Optical flow is based on Equation 26, which
basically says whatever image intensity (color) is visible at location (x, y) at time t will also
appear at a slightly different location (x + dx, y + dy) at a slightly later time t + dt.
( ) ( )dttdyydxxftyxf +++= ,,,, (26)
A Taylor series expansion yields Equations 27 and 28, where Equation 28 is the classic optical
flow equation.
( ) ( ) dttfdy
yfdx
xftyxftyxf
∂∂
+∂∂
+∂∂
+= ,,,, (27)
141
0=++ dtfdyfdxf tyx (28)
The basic premise for Lucas Kanade optical flow for a small region [66] is that the
motion in a small window is constant. For a window size of 3x3 and frames that are 1 time unit
apart, this gives a system of 9 equations that can be solved by least squares, as in Equation 29.
Offsetting the position in the second image by (dx, dy) and iterating allows the estimate to be
refined.
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
−
−−
=⎥⎦
⎤⎢⎣
⎡
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
33
12
11
3333
1212
1111
t
t
t
yx
yx
yx
f
ff
dydx
ff
ffff
MMM (29)
This only works if the motion is less than one pixel. To increase both speed and the scale
of the motion that can be tracked, image pyramids are used. The pyramid is built by
successively downsizing the image by a factor of 2 in each direction each time, creating a
hierarchy. The tracking is done in the coarsest level first, and the offset calculated there is scaled
appropriately and used as the initial estimate at the next finest level. Thus a pyramid with 4
levels allows motion of up to 8 pixels without searching the whole 16x16 pixel area.
Once the point locations in the new frame have been found, the two sets of points
(original points (xi, yi) and next frame points (xi’, yi’)) can be fit to a model to find the change in
object (face) position and orientation. In the simplest model, the points are assumed to all have
the same translation, and the displacement ( )21,bb can be found using Equation 30.
142
⎥⎦
⎤⎢⎣
⎡−−
=⎥⎦
⎤⎢⎣
⎡
⎥⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡
yyxx
bb
bb
yx
yx
''
''
2
1
2
1
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
−−
−−
=⎥⎦
⎤⎢⎣
⎡
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
nn
nn
yyxx
yyxx
bb
''
''
1001
1001
11
11
2
1 MMM (30)
Expanding the model to incorporate a scale factor s results in Equation 31. In the
translation model, the origin of the image coordinates didn’t matter, but here the face needs to be
scaled relative to the center of the face. The transformation includes shifting the coordinates so
that the face center at ( )cc yx , becomes the origin, and shifting back after the scale operation.
The three parameters can be obtained by solving the system of equations in Equation 32.
( )[ ] ( )[ ][ ] ( )[ ]
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡++−++−
=⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡−−
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡++
=⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡−−
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡=
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡−−=
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
11000
0
1''
11000
0
1001001
1''
11001001
1000000
1001001
1001001
1''
1,by Translateby scaleby Translate,by Translate
1''
2
1
2
1
2
1
21
yx
bysysbxsxs
yx
yx
syssxs
bybx
yx
yx
yx
ss
yx
bb
yx
yx
yxsyxbbyx
cc
cc
c
c
c
c
c
c
c
c
cccc
(31)
143
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
−−
−−
=⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
−−
−−
cn
cn
c
c
cn
cn
c
c
yyxx
yyxx
bbs
yyxx
yyxx
''
''
1001
1001
1
1
2
1
1
1
MMMM (32)
If we continue to build the model by adding an in-plane rotation, we get Equation 33.
Once again, the scale and rotation must be relative to the face center, so the coordinates are
translated first, and then translated back after the scale and rotation. Since the scale is the same
in the x and y directions, the order of the scale and the rotation don’t matter.
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡++−−+++−−
=⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡−−
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡ −
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡=
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
1100cossincossinsincossincos
1''
11001001
1000000
1000cossin0sincos
1001001
1001001
1''
2
1
2
1
yx
bysysxssbxsysxss
yx
yx
yx
ss
yx
bb
yx
ccc
ccc
c
c
c
c
θθθθθθθθ
θθθθ
(33)
In order to obtain a linear system of equations, we substitute θcos1 sa = and θsin2 sa = . The
original parameters can be extracted as 22
21 aas += and
1
21tanaa−=θ . The four parameters in
this model can be found using Equation 34.
( )
( )⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
−−
−−
=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
−−−−−
−−−−−
cn
cn
c
c
cncn
cncn
cc
cc
yyxx
yyxx
bbaa
xxyyyyxx
xxyyyyxx
''
''
1001
1001
1
1
2
1
2
111
11
MMMMM (34)
The models up to this point have parameters that directly correspond to the rigid motion
in the video, although these parameters don’t describe all possible motion, such as yaw and pitch
144
(a.k.a. pan and tilt). The affine model, in Equation 35, combines scaling and rotation and allows
shear, which is not likely as an actual occurrence in the video, but could be falsely detected from
errors in the tracking of the point features.
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡−−
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡=
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡−−
11001''
243
121
c
c
c
c
yyxx
baabaa
yyxx
(35)
Other models, such as biquadratic in Equation 36,
xyayaxayaxaay
xyayaxayaxaax
122
112
10987
62
52
4321
'
'
+++++=
+++++= (36)
bilinear in Equation 37,
xyayaxaayxyayaxaax
8765
4321
''
+++=+++=
(37)
and pseudo perspective in Equation 38
2
54876
52
4321
'
'
yaxyayaxaay
xyaxayaxaax
++++=
++++= (38)
provide a linear solution, but don’t have parameters that relate directly back to 3D orientation of
the head. While more complex models may describe the change in the scene more accurately,
they also run the risk of overfitting, i.e., introducing changes in the global model that do not
match the data in an effort to decrease the error at selected points.
5.1.3 Integration of Detection and Tracking
Combining the detection and tracking modules is not as straightforward as it would first
appear. If we simply create a new object for every face detected, several problems are quickly
145
apparent. A scene with one stationary face will produce a pile of tracked objects, as a new object
is initialized on top of the old one when a face is detected in each new frame. If a piece of the
background is falsely detected as a face for a single frame, it will be tracked as a face (and
remain stationary), with no means to terminate it. Another possibility is that the tracked points
will drift away from the face and latch on to parts of the background.
Many face tracking applications avoid these problems by assuming that a scene has
exactly one face in it. This is reasonable for a video conversation using a webcam, but we wish
to allow multiple people to interact with our augmented reality application. To solve this
problem we propose the following method for integrating detection and tracking data. This
process is illustrated in Figure 89.
146
Figure 89: Integrating detection and tracking data
First, motion detection is done, and the positions of the faces present in the previous
frame are updated. Next, face detection is done in areas where motion occurred and a currently
Motion Detection
Update face poses
Create face search mask
Face detection on masked area
Overlap between
detected and
Keep face
Detected face but no previous
face?
Add face
Detected face?
Warp face to frontal pose
Face detection on warped
image
Keep face
Discard face
Yes
Yes
Yes
No
No
No
147
tracked face does not exist. Since we already updated the face locations, these positions should
be more accurate than using the positions from the previous frame. We mentioned previously
that reducing the search area for the face detector yields faster execution. We exclude tracked
faces from the search area because we have a faster method for handling those areas.
The idea behind the next part is that we want to stop tracking areas that fail the face
detector. Previously tracked faces that overlap with a detected face are retained. Those that
don’t are tagged for further testing. Any detected faces that don’t overlap with a previously
tracked face are added to the list of tracked faces. The tagged faces may have failed the detector
for two main reasons: the region isn’t really a face, or the face may be at an orientation that our
frontal face detector can’t handle. In order to keep tracking faces that are no longer frontal, we
warp the face image using the tracked orientation and render a frontal view. If the tracked
orientation was correct and it really was a face, the frontal face detector should yield a positive
result when applied to the warped image. Since this test for a face is pass/fail (i.e., we don’t
need the exact location, just whether or not the region contains a face), we can use a faster
version of the face detector that returns when it finds the first face.
By integrating the tracking data and the face detector, we eliminate patterns that were
only detected as a face for a single frame. Eliminating static regions from the face detection
search area not only speeds up the execution, but also prevents the possibility of static regions of
the background being falsely detected as faces. This idea can be expanded by requiring that a
face be detected for n frames before augmentation is displayed, or allowing it to fail the face
detector for n frames before the track is discarded.
148
5.1.4 Results
We processed a 491 frame sequence with various configurations. We will use precision
and recall to quantify the results. Precision is the percentage of detections that were correct:
true positives / (true positives + true negatives). Recall is the percentage of actual faces that the
detector found: true positives / (true positives + false negatives). Ideally, both should be as close
to 100% as possible. The results are summarized in Table 6.
Using only the face detector (no motion detection, no tracking), 298 faces were detected.
Of these, 241 were actually faces, and 57 were not. In addition, 71 actual faces were not
detected because the roll angle of the face exceeded the limits of the frontal face detector. This
yields a precision of 81% and recall of 77%.
Next, we added motion detection. This eliminated false positive faces in areas where
there was no motion. The same 241 actual faces were found, with just 18 false positives. The
same 71 false negatives remained, since we did not do anything that would compensate for the
roll angle yet. This increased the precision to 93% and the recall remained at 77%.
Tracking was added using a simple translation model. The detection criterion was
changed to require detection in two consecutive frames. True positives were reduced by 6 to 235
because the first frames of each of 6 continuous segments where a face was present were
rejected. In batch mode, we could go back and accept those 6 detections, but with real-time
processing, we can’t see into the future, and don’t want the delay associated with lagging the
output video by a frame. False positives were reduced to just 5. There were two cases where
false detections occurred for consecutive frames; one for 2 frames resulting in one false positive,
and the other for 5 frames resulting in 4 false positives. The same 71 false negatives were still
149
present since the motion model doesn’t deal with roll angle yet, plus the 6 frames at the
beginning of tracked segments, for a total of 77. The precision is 98% and the recall is 75%.
The last step increased the complexity of the tracking model to handle the roll angle. If
the detector failed to detect a tracked face, the image was rotated by the tracked roll angle and
the face detector was invoked on this warped image. The only remaining false negatives were
the first frames of each of two tracked segments, leaving 310 true positives. The same 5 false
positives were still present. The precision is 98% and the recall is 99%.
Table 6: Results from integrating detection and tracking
Description True Positive
False Positive
False Negative
Precision
Recall
Detection only 241 57 71 81% 77% With motion detection 241 18 71 93% 77% Translation tracking model 235 5 77 98% 75% Translation, rotation, and scale tracking model
310 5 2 98% 99%
5.2 Initial Pose Estimation
It is tempting to assume that faces found by a system that detects frontal vertical faces are
perfectly aligned, but this is not valid in practice. We observed the faces detected to be rotated
by as much as 20° from straight ahead. Detectors for features like eyes exist, but involve
searching the face at a range of scales, which is computationally expensive. Glasses, beards,
scarves, and other occlusions further complicate the process. Eyes may not be visible with dark
glasses. A mouth may be hidden under a beard.
One method that has been used to determine the initial face pose is based on finding the
pixels that match a skin color model. The distribution of these pixels can be used to find the
150
orientation of the face when it is first detected. We implemented two variations of this method
with disappointing results.
To improve on this bridge between face detection and pose tracking, we propose a novel
method for quickly determining this initial face orientation based on symmetry. The vast
majority of faces have left-right symmetry. Even the aforementioned occlusions tend to be
symmetric. By operating on a relatively small number of high contrast points (which can later be
used to track the face), the algorithm is much faster than one that uses all the image pixels, and
more accurate than using the distribution of skin color pixels.
5.2.1 Pose from Skin Color Detection
In the first method for finding the skin color pixels for each face, we compute a color
histogram of the region returned by the face detector, and assume that the most commonly
occurring colors correspond to the skin of the face. We used 16 bins for each of the three
channels, for a total of 4096 bins. Since the skin color in even a single face image covers
multiple bins, we designated the bins with at least half as many pixels as the bin with the most
pixels as skin color. For example, if the most populous bin had 100 pixels, all bins with at least
50 pixels were considered skin color. Once pixels that matched this skin color model were
identified, moments were calculated to find the orientation of the skin color pixels using
⎟⎟⎠
⎞⎜⎜⎝
⎛−
= −
0220
111 2tan21
μμμφ , (39)
where ( ) ( )∑ −−=k
jk
ikij yyxxμ .
151
Figure 90 shows the face detection results for a color image with 24 detected faces. The
skin color pixels detected are shown in magenta in Figure 91. The threshold of 50% for the bins
designated as skin color was used because it provides a balance between too many and too few
skin pixels. This also shows the orientation calculated using the moment of the skin pixels. The
vertical axis is shown in green, and the horizontal axis in red. By visual inspection, only two of
the orientations look correct, and two more are 90 degrees off.
Figure 90: Face detection results.
152
Figure 91: Skin detection results using local histogram. Pixels detected as skin for each face are shown in magenta. The orientation calculated for each face is shown with the horizontal axis in red and the vertical axis in green.
We also calculated a global skin color model, which could be used to aid the search for
faces. We collected samples of skin images cropped from images showing a wide variety of skin
colors and lighting conditions. Based on our findings from Chapter 3, we converted the images
to chromaticity space and found the histogram of all the images, ignoring very dark pixels
(because chromaticity is unstable at black) as well as those with at least one RGB component at
255 (since those colors may be saturated). The bins with fewer than 10 samples were discarded
as noise. The non-zero bins remaining were considered to represent skin color. The result of
backprojecting this histogram on our test image is shown in blue in Figure 92. Most of the
actual skin was detected, but hair and white shirts were also found, since these are valid skin
153
colors. The trees (except for the darkest parts), grass and less neutral clothing were not detected
as skin color.
Figure 92: Global skin detection results.
Since the faces were not limited to clearly defined blobs, we masked out the pixels in a
circular region determined by the face detector for our analysis. The same computations as
before were performed using this different set of skin pixels. The results are shown in Figure 93
with the same color scheme as before. Skin pixels are magenta, and the orientation is shown
with a red horizontal axis and a green vertical axis. By visual inspection, five look correct.
154
Figure 93: Face orientation using global skin detection.
5.2.2 Proposed Point Symmetry Method
Our method [91] is based on the bilateral symmetry of faces, but does not recover
specific facial features. The assumption is that if high contrast points are found on a face image
(i.e., the region returned by a face detection algorithm), they will be distributed equally on the
left and right sides of the face. There are likely to be points on the corners of the eyes,
eyebrows, and mouth, since these are the highest contrast locations on most faces. Since
occlusions like glasses and beards are usually symmetric, the algorithm will handle them without
modification, and may even benefit from them.
The points may be found by a method such as the Harris corner detector [48], which
detects points with high gradients in both the horizontal and vertical directions. An additional
155
constraint discards corners that are closer together than a threshold. Making this threshold
proportional to the face region size allows the algorithm to work on a wide range of resolutions.
The first approach we tried was to find the orientation θ of the point set from the second central
moments using Equation 39. Likewise, we also tried using least squares to fit a line to the data.
Both of these produced results worse than assuming that the face was vertical. Instead, we
evaluate the symmetry of the point set at various rotations.
Our measurement of symmetry extends the work of Zabrodsky et al. [114]. They
formalize the concept of symmetry as a continuous measure, i.e., expanding from the binary
choice of “symmetric” or “not symmetric” to the continuous idea of “more symmetric” for a
shape. They begin by defining the symmetry transform of a contour that finds the closest
symmetric shape to a given shape, representing a shape as a set of points on its contour. The
symmetry distance is then the distance between the points representing the original shape and the
corresponding points of the symmetry transform. However, the selection of the sets of
corresponding points is left open. In a related paper [113], the symmetry of molecules is
measured in a similar manner. The two sets of corresponding points are limited to isomorphic
pairs of the graphs formed by the atomic bonds.
In our work, we extend the concept of continuous symmetry to general point sets. This
is illustrated in Figure 94. An example point set Pi is shown in (a). For every point Pi in the set,
we flip it about a potential symmetry axis to get Pi’. This is shown in (b) with solid points
representing Pi and hollow points representing Pi’. Then, for every flipped point Pi’, we find the
closest original point Pj. For points on the axis, i = j. A one-to-one correspondence is not
enforced, so if two points were found on the left eye and three were found on the right eye, the
156
third point can still match an eye point. This matching is shown in (c). The distances between
each closest pair Pi’ and Pj are averaged to get the symmetry distance,
∑−
=
−=1
0'1 n
iji PP
nSD . (40)
Thus, a symmetry distance of zero indicates that a point set is perfectly symmetric, and larger
distances mean the shape is less symmetric. The process of finding the axis of symmetry for a
set of points then consists of finding the symmetry distance for each potential axis. The axis
with the smallest symmetry distance is taken to be axis of symmetry for the shape.
(a) (b) (c)
Figure 94: Symmetry distance. (a) is the original point set, (b) shows the original filled points and the reflected hollow points, (c) shows the closest filled point to each hollow point. The average of these line lengths is the symmetry distance.
To apply this idea to initial pose estimation for detected faces, we begin by assuming that
the center of the detected face region falls on the symmetry axis. We experimented with using
the mean of the points, but got better results using the region center, due to points detected at the
edge of one side of the face that were not visible on the other side. Once we have a point on the
axis, we can rotate the points about this point and measure the symmetry at each potential angle.
157
The range of angles we need to test is practically limited by the face detector. Ours only
detected faces within 20° of vertical, so searching from -30° to +30° is reasonable. We used
increments of one degree in this range. The N2 distance computations for each axis appear at
first to be inefficient, but for low-resolution faces, N = 20 is reasonable, and 400 operations per
rotation angle is not overwhelming. Computational geometry methods that improve on the
exhaustive O(N2) search, such as building a kd-tree [26], can be applied to speed up this process
if needed. A correlation or feature detection algorithm would quickly become more expensive.
An example is shown in Figure 95. Image (a) shows a face region with the high corners
displayed as black circles. Images (b) and (c) show two different rotations. The rotation of the
actual image is not necessary, and is displayed here for illustration purposes. The black points
show the rotated corners, and correspond to the same facial features on all three images. The
white points have been reflected around the vertical axis, which is the rotation axis being
evaluated. If the rotation being tested makes the face vertical, then a reflected point should land
on the same feature on the other side of the face. The white lines connect each white point with
the closest black point, and the symmetry distance is the sum (or average) of the lengths of all
the white lines. In image (c), not only does the face look vertical, but most of the white lines are
very short, and the sum of the line lengths listed at the bottom of the image is the minimum.
158
(a) (b) (c)
Figure 95: Measuring symmetry. (a) shows the corners in black, (b) and (c) show two different rotations, with the reflected points in white, and lines between matched points.
5.2.3 Point Symmetry Results
To test our algorithm, we used the rotated test set of images from [78]. This test set of
contains gray scale images that each have one or more frontal faces at many different rotations
and resolutions. A list of the image coordinates of the left and right eye, nose, and left, center,
and right points on the mouth is also provided. From these points, we computed the ground truth
rotation angle as the angle between the vertical axis and the best line through the nose, center
mouth point, and the midpoint between the eyes for each face. Figure 96 shows some sample
input images. A black triangle is shown that connects the ground truth coordinates for the eyes
and mouth center. The symmetry axis is assumed to be on a line through the mouth point and
the midpoint of the line connecting the eyes. There is also a white number showing the index for
that face (“1” for a single face).
159
Figure 96: Test images. The original image is shown with a black triangle that connects the ground truth coordinates for the eyes and mouth center.
The face detector used in our experiments is the OpenCV [14] implementation and
trained classifiers that are based on Viola and Jones [100]. As previously mentioned, the
algorithm detects frontal faces up to about 20 degrees rotation from vertical. The orientation
algorithm was applied only to the face regions detected.
Some sample results are shown in Figure 97, corresponding to the input images from
Figure 96. The box shows the region detected as a face, small green circles show the high
contrast points found in that region, and an axis shows the orientation found, with red pointing in
the positive x direction and green in the positive y direction. In each of these examples, the
algorithm result was within one degree of the ground truth.
160
Figure 97: Algorithm results. The box shows the region detected as a face, small green circles show the high contrast points found in that region, and an axis shows the orientation found, with red pointing in the positive x direction and green in the positive y.
The quantitative results are summarized in Table 7. The average rotation in the original
data set was 14.3 degrees, with only 4 of the 30 faces rotated less than 5 degrees. A least squares
line fit (fit the best line, and then offset the slope by a multiple of 90° so that the result is
between ±45°) yielded a larger average error of 26.0 degrees, with only 3 of the 30 within 5
degrees of the correct rotation.
For our proposed method, the average rotation error in the 30 faces detected was 6
degrees. The error was less than 1 degree for 23% of the detected faces (<1 degree is the best
error possible when using 1 degree increments), and within 5 degrees for 73% of the detected
faces.
Table 7: Roll angle calculation results Description Average Error
(degrees) % within 1 degree % within 5 degrees
Assume face is vertical
14.3 0% 13%
Least squares line fit 26.0 0% 10% Proposed method 6.0 23% 73%
161
We also tested our method on the color image used for the methods based on skin color
detection. The results are shown in Figure 98 The high contrast points are shown in blue, the
red line indicates the horizontal axis of the face, and the green line shows the vertical axis. By
visual inspection, six appear to be incorrect.
Figure 98: Face orientation results using our point symmetry method. The red line indicates the horizontal axis, the green line indicates the vertical axis, and the high contrast points are shown in blue.
5.2.4 Conclusion
We have presented a novel method for determining the initial face orientation using the
symmetry of high contrast points. This method is fast, since the computations are based on the
162
point coordinates, not image pixels, and is more accurate than other fast methods, such as finding
the orientation of skin color pixels. While not as precise as more complex methods, such as
Active Appearance Models presented later, this is still useful as a way to quickly estimate a
starting pose for face tracking.
5.3 Extension of the Active Appearance Model for Face Tracking
An overview of Active Appearance Models (AAMs) was presented in Section 2.5.6.
Tracking with AAMs requires first building a model offline, then determining the model
parameters that match each input frame. The standard 2D AAM provides 2D position, size, in-
plane rotation, and shape parameters, but obtaining the full 3D orientation that we need for
augmentation requires a combined 2D and 3D AAM. In this section, we will present the existing
methods and our extensions for each task.
5.3.1 Building an Active Appearance Model
The active appearance model was introduced by Cootes et al. [20] as the next step from
their earlier work with Active Shape Models (ASMs) [21]. The approach creates linear models
of shape variation and appearance (gray level) variation to model the object.
The input to the process is a set of images with feature points marked (usually by hand)
on each image. Cootes et al. [20] used 122 points for their face images, while Matthews and
Baker [67] used 68 vertices. First the variation of the 2D point locations in the images is
analyzed to create a mean shape and orthogonal shape modes. The shape s is defined as the
coordinates of the v vertices of the mesh:
163
( )Tvv yxyxyx K2211=s . (41)
Any shape in the input set can then be represented as
∑=
+=n
iiip
10 sss , (42)
where 0s is the mean shape, and the n shape vectors is and coefficients pi represent the shape
variation. This is a 2D model, and will record 3D variations such as turning the head to the side
as changes in the projected 2D shape.
Each image is then warped to match the mean shape, and the intensity variations within
the mean face shape are processed to create a mean appearance and orthogonal appearance
modes. The appearance )A(x is the image defined over the pixels x inside mean shape 0s . Any
appearance in the set can be represented as
( ) 01
0 A)(A)A( sxxxx ∈∀+= ∑=
m
iiiλ (43)
where ( )x0A is the mean appearance, and the m appearance modes ( )xiA and coefficients iλ
model the appearance variation.
Cootes et al. created a model that combined shape and appearance modes, dubbed a
combined AAM, but both Matthews and Baker and our method treat shape and appearance
separately, and call this an independent AAM.
The input needed for creating the shape model is a set of shapes with corresponding
points identified. Our first contribution to AAMs is in collecting these shapes. Instead of
collecting images or video that show different people a wide range of face poses and expressions
and hand marking on the order of 100 points on each, we require only about 30 points marked on
164
a single frontal image for each different person. The 30 points are used to determine the global
translation and rotation as well as the shape and animation parameters to fit a modified version
of the Candide generic 3D face model [12] to the image. From these parameters, we can then
synthesize a wide variety of poses and expressions for each person from the model.
The Candide model consists of a set of base vertices, a list of triangles that connect these
vertices and shape and animation units. Shape and animation units provide linear variation of
the base vertices. Mathematically, shape and animation units are equivalent, but shape units
represent rigid shape variations between individuals and should remain constant for a given
person. Animation units represent non-rigid expression changes, like raising the eyebrows or
opening the mouth.
We made some minor modifications to the Candide model for this application. To avoid
problems with hair or hats at the top of the head, we removed the vertices and associated
triangles above the eyebrows. This matches the face area that was used by Cootes et al. and by
Matthews and Baker, although both used denser sets of points along the edges of the face. We
also limited the animation units to match [25], leaving six animation parameters that control the
mouth and eyebrows. For shape parameters, we eliminated four of fourteen where insufficient
information was available in the 2D image.
In our implementation, for each input face image, the user marks points in a sequence.
The location of the next point to be marked is prompted by highlighting the point on a rendering
of the wireframe model. Figure 99(a) shows the image when all but the last point have been
marked. The prompt for the last point is shown in (b), and the resulting mesh fit on the face is
shown in (c). When the process is complete, a text file containing the shape and animation
165
parameters is saved, along with a 256x256 image file containing texture map for the face, that is,
the result of warping each triangle in (c) to match the base shape in (b).
(a) (b) (c)
Figure 99: Offline generic face model fitting process. (a) shows the points identified so far, (b) shows the prompt for the last point, and (c) shows the mesh that has been fit overlaid on the image.
The fitting was done by solving for the translation (tx, ty), scale k and in-plane rotation φ,
parameterized as (a, b) = (k cos φ, k sin φ), shape parameters σi, and animation parameters τi,
with the linear system of equations:
⎥⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
⎥⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
−
−
−
ny
nx
y
x
y
xy
x
nynynynynxny
nxnxnxnxnynx
yyyyxy
xxxxyx
yyyyxy
xxxxyx
mm
mmmm
batt
ssssvvssssvv
ssssvvssssvvssssvvssssvv
M
M
M
LL
LL
MOMMOMMMMM
LL
LL
LL
LL
2
2
1
1
6
1
10
1222212
222222
111111
111111
61101
61101
61101
61101
61101
61101
~~~~~~00
~~~~~~11
~~~~~~10
~~~~~01
~~~~~~10
~~~~~~01
τ
τσ
σ
ττσσ
ττσσ
ττσσ
ττσσ
ττσσ
ττσσ
(44)
166
where (mix, miy) is the image coordinates of the ith marked point, ( )iyix vv ~,~ is the coordinates of
the base vertex in the Candide model (ignoring z) corresponding to the ith marked point,
( )jjiyix ss σσ ~,~ is the displacement for the ith vertex in the jth shape unit, and ( )jj
iyix ss ττ ~,~ is the
displacement for the ith vertex in the jth animation unit. In this discussion, 3D vectors will be
denoted with a tilde, to distinguish from 2D vectors.
The input to the second step of the model generation process is a list of the images
processed in the first step, and the 3D shape and animation parameters and texture obtained from
each. For each input image, 100 2D shapes are generated by randomly changing the azimuth,
elevation, and animation parameters of the 3D model and projecting the vertex coordinates back
to 2D. The roll angle is not varied because it is removed when the shapes are aligned. Not only
does this provide many shapes that cover the whole envelope of possible parameter variation
after hand marking only one image, but the projected shape model contains all the points from
the 3D model, not just the few that were marked to establish the shape and animation parameters.
Following the procedure from Cootes et al. [21], the shapes must then be aligned and
normalized, so the 2D shape modes that are created capture only true shape change, not scale,
translation, or in-plane rotation. They use a modification of the Procrustes method to minimize a
weighted sum of squares of distances between equivalent points on different shapes.
First, weights for each vertex are calculated to give more importance to those that vary
the least among the set. The weight, wk, for the kth vertex is given by
1
1
0
−−
=⎟⎟⎠
⎞⎜⎜⎝
⎛= ∑
n
jRk kj
Vw (45)
167
where Rkj is the distance between points k and j, and kjRV is the variance in this distance over the
set of shapes. A point that varies greatly will have a large sum of variances, resulting in a small
weight. A point that remains relatively stable with have a smaller sum of variances, giving a
greater weight, so will have greater affect on the shape normalization. The weights for each
point are collected in a diagonal matrix W.
Aligning shape 1 with shape 2 consists of finding the rotation φ, scale s, and translation t
= (tx, ty) that minimizes
⎟⎟⎠
⎞⎜⎜⎝
⎛−⎥
⎦
⎤⎢⎣
⎡ −−⎟⎟
⎠
⎞⎜⎜⎝
⎛−⎥
⎦
⎤⎢⎣
⎡ −− txxWtxx 2121 cossin
sincoscossinsincos
φφφφ
φφφφ
ssT
. (46)
Solving this equation using least-squares results in the following linear system:
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
−
−
2
1
1
1
22
22
22
22
sincos
00
00
CCYX
tt
ss
XYZYXZWXY
WYX
y
x
φφ
, (47)
where
∑−
=
=1
0
n
kikki xwX (48)
∑−
=
=1
0
n
kikki ywY (49)
( )∑−
=
+=1
0
22
22
n
kkkk yxwZ (50)
∑−
=
=1
0
n
kkwW (51)
168
( )∑−
=
+=1
021211
n
kkkkkk yyxxwC (52)
( )∑−
=
−=1
021212
n
kkkkkk yxxywC . (53)
The process to align the set of shapes is iterative. First, each shape in the set is aligned
with the first. For each iteration, the mean shape is calculated from the aligned shapes, the new
mean shape is aligned with the first shape, and then all shapes are aligned to the mean. This
process is repeated until it converges.
Armed with a set of aligned shapes and a mean, we are ready to perform principal
component analysis (PCA) to find the shape variation modes. First the mean shape is subtracted
from each shape, and then a 2n by 2n covariance matrix is formed from the outer product of the
delta shape vectors. Singular value decomposition (SVD) on the covariance matrix yields
orthogonal eigenvectors, which describe the 2D shape mode variation. Figure 100 shows the
first three shape modes for our model.
Figure 100: The first three shape modes. The circles show the mean shape, and the lines show the magnitude and direction of the positive (green) and negative (red) displacements.
169
In order to create the mean appearance and the appearance modes, we first warp each
input image to the mean 2D shape determined in the previous step. This is done by using an
affine warp to map pixels in each triangle from the old image into the new image. Once all the
faces are the same shape, the brightness and contrast must be matched. To match image A with
image B, we want to find α and β to minimize
( )∑ −+ 2B1A βα . (54)
This yields the linear system:
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=⎥⎦
⎤⎢⎣
⎡
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
nn B
BB
A
AA
MMM2
1
2
1
1
11
βα
. (55)
Premultiplying both sides by the transpose of the first matrix simplifies to:
⎥⎦
⎤⎢⎣
⎡ ⋅=⎥
⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡ ⋅
⎥⎦
⎤⎢⎣
⎡⋅⋅
=⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡⋅⋅⋅⋅
BBA
AAAA
B1BA
11A11AAA
nnnn
βα
βα
, (56)
where A and B are the mean of A and B respectively, and n is the number of pixels. Solving
for α and β gives:
( ) ( )22 AAA
BAAAABAnAA
BABAn
n−⋅
⋅−⋅=
−⋅−⋅
= βα . (57)
If B is zero, this simplifies to
( ) αβα AAAABAA
AAABA
22 −=−⋅⋅−
=−⋅⋅
=nn
(58)
The matched image is therefore
170
( ) ( )1AAAAA
BA1AA1AA1AA 2 −−⋅⋅
=−=−=+=n
αααβα' (59)
Following the procedure of Cootes et al. [20], we start by using the first image as an
estimate of the mean, setting its mean to zero and variance to one. During each iteration, each
image is aligned to the mean using Equation 59, and then the mean is recalculated from all the
images. This process is repeated until converged.
Once the mean is calculated, it is subtracted from all the images, and PCA is applied to
create the appearance modes. Figure 101 shows the mean appearance on the left and two of the
appearance modes.
Figure 101: The mean appearance and two appearance modes.
5.3.2 Tracking with an Active Appearance Model
Tracking consists of recovering the model parameters, generating a synthetic image that
resembles the input image as closely as possible. To guide the optimization process, Cootes et
al. [20] assumed a linear relation between the error image (the difference between the input
image and the synthesized image) and the incremental update to the model parameters. This
171
fitting process was reported to take an average of 4.1 seconds on a Sun Ultra. Matthews and
Baker [67] showed a counterexample where the error image is the same for two examples, yet
different increments to the parameters are needed. They then detail a gradient descent approach
to the nonlinear least squares optimization problem that is reported to run at 4.3 ms. per iteration
on a dual 2.4GHz P4 machine. The number of iterations needed depends on the accuracy of the
initial guess.
In the next sections, we will discuss the basic 2D AAM shape fitting process, then the
full system, including appearance variation, brightness matching, 2D shape normalization, and
3D model constraints.
The basic idea of AAM shape fitting is to minimize the square of the difference between
the mean appearance and the input image warped by shape parameters p. Following Matthews
and Baker [67], we use W(x; p) to denote the piecewise affine transform that maps each pixel x
inside the mean 2D shape s0 to a pixel in shape s, using the 2D shape modes obtained from the
method described previously, where
∑=
+=n
iiip
10 sss . (60)
That is, to warp a single pixel in x, determine which triangle it is in, and then use
Equation 60 to map each vertex of that triangle to a new location. Find the affine transform that
maps the three old vertices to the three new vertices, and apply that affine transform to the pixel
coordinates. The warp is assumed to be parameterized such that W(x; 0) = x.
The image intensity as a function of the pixel location is denoted I(x). Thus, we want to
minimize
172
( )( ) ( )[ ]∑∈
−0
20;
sx
xpxW AI (61)
This is therefore a nonlinear least squares minimization problem, since it has the form
[70]
( ) ( )∑=
=m
jj xrxf
1
2
21 (62)
Defining the Jacobian of r as
( )nimji
j
xr
xJKK
,2,1,2,1
==
⎥⎦
⎤⎢⎣
⎡∂
∂= (63)
For such problems,
( ) ( ) ( ) ( ) ( )∑=
∇+=∇m
jjj
T xrxrxJxJxf1
22 . (64)
The Gauss-Newton method drops the second order term, approximating the Hessian as
( ) ( ) ( )xJxJxf T=∇2 . (65)
For standard Newton’s method line search, the kth parameter update pkN is
kkNk ffp ∇−∇= −12 . (66)
With this approximation for the Hessian, the parameter update GNkp is generated by
( ) kTkk
Tk
GNk rJJJp 1−
−= . (67)
Obtaining the parameter update requires calculating the residual, rk, and the Jacobian, Jk.
To take advantage of the efficiency of the inverse compositional image alignment
method, we expand Equation 61 as follows:
173
( )( ) ( )( )[ ]∑∈
Δ−0
20 ;;
sx
pxWpxW AI (68)
Since the incremental parameters are applied to the second term while p is applied to the first
term, p must be updated by composing W(x; p) with W(x; Δp)-1. Details of this operation are in
[67]. A Taylor series expansion around Δp = 0 gives
( )( ) ( )∑∈
⎥⎦
⎤⎢⎣
⎡Δ
∂∂
∇−−0
2
00;sx
ppWxpxW AAI (69)
since W(x; 0) = x. The Jacobian of Δp is pW∂∂
∇− 0A , which does not depend on p, so only
needs to be computed at initialization. We can obtain the parameter update by either setting the
derivative of Equation 69 to zero or substituting the Jacobian and residual in Equation 67:
( )( ) ( )[ ]∑∑ −⎥⎦
⎤⎢⎣
⎡∂∂
∇⎟⎟
⎠
⎞
⎜⎜
⎝
⎛⎥⎦
⎤⎢⎣
⎡∂∂
∇⎥⎦
⎤⎢⎣
⎡∂∂
∇=Δ
−
xxxpxW
pW
pW
pWp 00
1
00 ; AIAAATT
(70)
The first term is the Gauss-Newton approximation for the Hessian. The only term that depends
on p or the input image is the residual, so the gradient of the mean appearance, pW∂∂ , the
Jacobian, and the inverse of the Hessian only need to be computed once, at initialization. At
each iteration, we only need to warp the input image with the current estimate of p, subtract the
mean image, find the dot product with the Jacobian, multiply by the inverse Hessian, and
compose Δp with p.
The warp Jacobian pW∂∂ is a pair of images (x and y) for each component of p. The value
at each pixel is found by determining the triangle that contains the pixel and forming a linear
174
combination of the displacements for the triangle vertices from shape vector s. Figure 102
shows the x and y gradients of the mean appearance, 0A∇ , and Figure 103 shows the warp
Jacobians pW∂∂ for the first three 2D shape modes. Multiplying these two sets to get
pW∂∂
∇ 0A
results in Figure 104.
Figure 102: X and Y gradients of the mean appearance.
175
Figure 103: Warp Jacobians for the first three 2D shape modes. The top row is the x component, and the bottom row is the y component.
Figure 104: The steepest descent images obtained by multiplying the gradients in Figure 102 by the warp Jacobians in Figure 103.
176
Several more sets of parameters are needed to complete our model. The first is
appearance variation. To model both shape and appearance variation, Matthews and Baker [67]
use the technique from [47], which rewrites the sum of squares as the L2 norm:
( )( ) ( ) ( )2
10; ∑
=
−−m
iii AAI xxpxW λ (71)
This can be decomposed into the vectors projected into the linear subspace spanned by Ai and its
orthogonal complement:
( )( ) ( ) ( )( )
( )( ) ( ) ( )( )
2
10
2
10 ;;
ii Aspan
m
iii
Aspan
m
iii AAIAAI ∑∑
=⊥=
−−+−− xxpxWxxpxW λλ (72)
Due to orthogonality, the appearance variation drops out of the first term, so it no longer depends
on λi. The expression can be minimized by finding p from the first term, then λ from the second
term, but we don’t need λ for our application. Solving for λ, then comparing with the
appearance and shape parameters needed to reconstruct the input images from the appearance
modes would allow different individuals to be recognized. Without appearance parameters, we
just need to minimize
( )( ) ( ) ( )2
0;iAspan
AI xpxW − (73)
This can be done as before, with the additional step at initialization modifying the
Jacobian to map it into this subspace:
( ) ( ) ( )∑ ∑= ∈ ⎥
⎥⎦
⎤
⎢⎢⎣
⎡
∂∂
∇⋅−∂∂
∇=−m
ii
ji
jj A
pAA
pA
100
0
xWxWxJsx
(74)
The next set of parameters is a global 2D shape similarity transform N(x; q). Since
building the shape model involved removing translation, scale, and in-plane rotation from the
177
shape data, our model needs to handle such transformations separately. Following Matthews and
Baker [67], we define
⎥⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡+−+
=y
x
tt
yx
abba
11
);( qxN (75)
where 1 + a = k cos φ, b = k sin φ, k is the scale, φ is the in-plane rotation, and tx and ty are the
translations. We can put this in the same form as the shape model by defining
( )( )( )( )T
T
Tvv
Tvv
s
s
xyxys
yxyxss
0101
1010*4
*3
0001
01
*2
0001
010
*1
K
K
K
K
=
=
−−=
==
(76)
and normalizing. Therefore
( ) ∑=
+=4
1
*0;
iiiq ssqxN . (77)
Note that N(x; 0) = x. Our function to minimize is now
( )( )( ) ( ) ( )2
0;;iAspan
AI xqpxWN − (78)
The composition of N and W is achieved by first warping the mean shape by N, finding
the affine transformation for each triangle, then warping the mean shape by W, and applying the
affine transformations to the result. If a vertex is a member of more than one triangle, the
transformation for each triangle is applied, and the results are averaged.
To avoid precision problems possible when subtracting images of different ranges, we
added brightness and contrast parameters b = (α β)T. The new function to minimize is
( ) ( )( )( ) ( ) ( )2
0;;1iAspan
AI xqpxWN −++ βα (79)
178
Since the parameters so far only describe the 2D shape, and we wish to recover the 3D
rotation of the head, we consider the approach by Xiao et al. [106], who combined the 2D AAM
with a constraint that the 2D shape matches a projection of a non-rigid 3D model, but provided
few details. Since the 3D model projection parameters include the 3D model orientation, this
will allow us to recover the 3D head rotation. However, Xiao et al. used a 3D model created
from 900 frames of a video of an individual for tracking. Our augmented reality application
cannot handle a 30 second delay between first observing an individual and starting the
augmentation. Therefore, we used the Candide model instead. In addition to the benefits we
derived from using this model for creating the 2D AAMs, using this model during tracking also
allows us to recover meaningful parameters during tracking. The 3D model built by Xiao et al.
doesn’t have useful labels for the non-rigid parameters, but the Candide model was built to use
MPEG-4 action units, so can be used for video compression as well as other applications such as
recognizing facial expressions.
The 3D model constraints are added as a second term, so our final function to minimize
is:
( ) ( )( )( ) ( ) ( )
( )( ) ( ) ( ) .~~~,,1;;
;;12
0
20
tassqpxWN
xqpxWN
−⎥⎦
⎤⎢⎣
⎡+++−
+−++
∑∑i
iii
ii
Aspan
RfK
AIi
τσψθφ
βα
(80)
The first term describes the 2D AAM fitting and is evaluated for each pixel inside the mean
shape. The second term penalizes differences between the 2D shape and the projection of the 3D
shape, and is evaluated at each vertex. The parameter K makes this a soft constraint. The new
179
parameters are a vector r = (f, φ, θ, ψ, tx, ty)T, and σ and τ, which are the coefficients for the 3D
shape and animation parameters respectively. The scale factor is (1+f), R(φ, θ, ψ) is the 3x3
rotation matrix formed by roll angle φ, pitch angle θ, and yaw angle ψ:
( )
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
−−−
−+−+=
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡ −
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
−⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡ −=
θψφθψφψφθψφψθφθφθ
θψψθψφψφθψφψ
φφφϕ
θθθθ
ψψ
ψψψθφ
coscoscossincossinsinsinsincoscossinsincoscossincos
cossincossinsinsincossinsinsincoscos
1000cossin0sincos
cossin0sincos0
001
cos0sin010
sin0cos,,R
(81)
and the translation is t = (tx, ty)T. It is not strictly necessary to use q in the 3D term, but using it
allows the vertex warps from the 2D term to be reused. Note that r is parameterized so that r = 0
results in the identity transform.
Expanding to show the incremental update yields:
( ) ( )( )( ) ( ) ( )( )( ) ( )
( )( )( )( )
( )( ) ( ) ( ) ( ) .~~~,,1
11
11
;;;;
;;1;;12
0
20
ttass
qpqpxWNWN
qpxWNqpxWN
Δ−−⎥⎦
⎤⎢⎣
⎡Δ++Δ++
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
Δ−ΔΔΔΔ−Δ−
Δ++
−Δ−Δ−
+Δ−ΔΔΔ+−++
∑∑i
iiii
iii
Aspan
RffK
AIi
ττσσψθφθψ
θφψφ
βαβα
(82)
Taylor series expansion around [Δb Δq Δp Δr Δσ Δτ] = 0 results in:
( ) ( )( )( ) ( ) ( )( )
( )( )
( ) ( ) ( ) ( )
,
~,,1~,,110
01
0
~~0
~~
~
));;(());;((;;
;;1
2*
2
0000
τψθφσψθφ
ψθφ
βαβα
Δ+−Δ+
−Δ⎥⎦
⎤⎢⎣
⎡−Δ⎥
⎦
⎤⎢⎣
⎡−Δ⎥
⎦
⎤⎢⎣
⎡−−Δ⎥
⎦
⎤⎢⎣
⎡−Δ⎥
⎦
⎤⎢⎣
⎡−−Δ
−Δ−Δ−
+
Δ∂∂
∇−Δ∂∂
∇−Δ−Δ−−++
as
vvv
vv
pqpsWNqqpsWNqpxWN
ppWq
qNxxqpxWN
z
zx
y
RfRf
ttfK
AAAAI
yx
Aspan i
(83)
180
where
( ) ( ) ⎥⎦
⎤⎢⎣
⎡+++= ∑∑
iii
iiiRf assv ~~~,,1~
0 τσψθφ . (84)
The Jacobian for the 2D portion (evaluated per pixel) is
( ) ( )( )
( )
( )
( ) ( ) ( ) 0xJxJxJpWxJ
qNxJ
1xJxxJ
r
p
q
===∂∂
−∇=
∂∂
−∇=
−=−=
τσ
β
α
0
0
0
A
A
A
(85)
Note that all these quantities can be computed at initialization. The Jacobian for the 3D portion
(evaluated per vertex) is
( )( ) ( )( )( ) ( )( )( )( ) ( )( ) ( )( ) ( )( )( ) ( ) ( )( ) ( ) ( )avJ
svJvJ
vvJ
vvJ
vvvJ
vvJ
qpsWNvJ
qpsWNvJ
0vJ
t
p
q
b
~,,1
~,,1
0,~
~,0
~,~
~;;
;;*
ψθφψθφ
τ
σ
ψ
θ
φ
RfKRfK
KK
K
K
K
K
K
Tz
Tz
Txy
f
+−=+−=
−=
−−=
−=
−−=
−=
−=
−=
=
(86)
The Jacobians for the 3D term generally do depend on the model parameters, so they must be
recomputed each time, but since there are fewer vertices than pixels, this is not nearly as
expensive as recomputing the 2D Jacobians each iteration.
181
The Hessian must be recomputed each iteration as well, since it depends on the
Jacobians. As described in [4], the Hessian for the sum of the 2D and 3D terms is the sum of the
2D Hessian (which is constant) and the 3D Hessian (which is not). This combined Hessian must
then be inverted each iteration to solve for the incremental parameters.
5.3.3 AAM Experimental Results
The solution of any gradient descent algorithm is dependent on the initial guess. For
now, we initialize the translation and size using the center and width of the rectangle returned by
the face detection module, and initialize the orientation angles to zero. Further refinements will
be discussed later on. We apply progressive transformation complexity, as suggested by
Matthews and Baker [67], solving for the shape normalization parameters q, before the shape
variation parameters p. This gets the mesh in the correct location and orientation before
adapting to the non-rigid changes. Convergence is declared when the change in the parameters
between iterations is less than a threshold.
Figure 105 shows several iterations of the fitting process. Of these images, only the last
two without the wireframe overlay are computed as part of the process. The others are for
visualization and debug only. In each set, (a) shows the 2D mesh warped by the current
estimates for p and q overlaid on the original image. In addition, the result of the face detection
module is shown with a rectangle. The 3D mesh warped by the current estimates for r, σ, and τ
is overlaid on the original image and shown in (b). The input image warped by p and q is in (c),
with the mean shape mesh added for visualization. The error image, ( )( )( ) ( )xqpxWN 0;; AI − , is
displayed in (d).
182
The first row shows the initial guess. Note that the mouth is open in (a) and closed in (b).
This is due to the mean 2D shape having the mouth part way open (because it averaged shapes
with the mouth closed and the mouth fully open) and the base 3D shape model having the mouth
closed. The animation parameters τ are adjusted during the first iteration so that the 2D and 3D
meshes look alike in subsequent frames. The second row shows the intermediate result after 5
iterations, when the q parameters (the 2D shape normalization) have converged. It takes 8 more
iterations for the p parameters to converge, as shown in the third row. Since we used the 3D
model to constrain the 2D shape to valid projections of the 3D shape, we can recover the 3D
position, size, and orientation of the resulting fit, providing the parameters needed for
augmentation.
183
(a) (b) (c) (d)
Figure 105: Iterations 0, 5, and 12 of fitting the AAM. Column (a) is the 2D shape on the original image with the face detection result shown by a rectangle. Column (b) is the 3D shape on the original image. The warped input image I(N(W(x; p); q)) with the mesh overlaid is shown in Column (c), and the error image ( )( )( ) ( )xqpxWN 0;; AI − is shown in column (d).
The example above shows the fitting process for a single image. We can refine the initial
guess, and therefore reduce the number of iterations required by employing the initial pose
estimation technique presented in Section 5.2. In order to use AAMs to track a face efficiently
in video, we need to provide the best possible guess for starting each subsequent frame. Using
the results from frame i to initialize frame i + 1 resulted in an average of 9 iterations per frame.
By applying the translation obtained by point tracking (from Section 5.1.2), that dropped to 3
iterations per frame. Using the translation, rotation, and scale obtained from point tracking
further improved the initial guess so that only 2 iterations were required on average. The cost of
point tracking is negligible compared to a single iteration of AAM. We further reduced the
model complexity by making the 3D shape constant after the first frame. The 3D animation
units are still updated to reflect expression changes, but the 3D shape units that reflect the rigid
head shape are fixed after the initial fit for a given head instance.
184
5.3.4 Optimization
Extensive optimization was done with the aid of the profiler in Microsoft Visual Studio
6.0. This optimization achieved a speedup of nearly 50, and made the difference between an
interactive application and one that runs too slowly for interaction. In this section, we will
discuss the kinds of improvements that were made to achieve this performance boost. The
profiler results that follow were measured on a single processor 1.7 GHz. Pentium 4. For
repeatable testing of the algorithms without the overhead of capturing the live video and
displaying it, we used a timing test application that processed frames from a movie file, calling
the same functions as the live application. The use of a profiler is crucial to optimization in
order to identify the operations that take the most time and to verify the speed improvement
achieved at each step. For example, a 10% improvement in a function that takes 50% of the
processing time yields a 5% improvement in overall speed, but making a function that takes 1%
of the time 10 or even 100 times faster will not yield more than a 1% improvement in
performance of the whole application.
In the first profiler run, using the first 50 frames of the 720x480 resolution video, motion
detection was reported to take 538 ms. per frame, face detection took 92 ms. per frame, and the
AAM took 263 ms. per iteration (not including initialization). The first round of optimization
replaced the most frequently used simple functions with inline code, thus eliminating the
function call overhead. Good candidates for this optimization were class methods that simply
returned class member variables, because these functions were called frequently, yet were only
one or two lines of code. This optimization makes the code run faster while still maintaining the
object oriented control of the data within the class. One example is
185
CActiveAppearanceModel::GetTriangle(), which returns the indices of a triangle within the
AAM mesh. This function was called 79 million times, and took 11.4% of the total time.
Data was converted to more computationally efficient quantities, like 32-bit fixed point
integer images instead of floating point, and storing the variance for the motion model instead of
the standard deviation. The integer images allowed operations on the images (like dot products)
to be done with integer arithmetic, which is typically significantly faster than floating point
math. Care was taken to be sure that the precision was sufficient for the algorithm to still work,
and that the integer values did not overflow.
The full resolution frame was needed for tracking, but not for motion detection or face
detection, so these two operations were done 360x240 images. The additional step of
subsampling the high resolution image to generate the smaller image was less computationally
expensive than the time to perform these two operations on four times as many pixels.
Another type of improvement sacrificed accuracy for speed, after verifying that the
reduced simpler operation still had sufficient fidelity for the algorithm to converge correctly. In
particular, after making some of the aforementioned optimizations, the OpenCV library function
cvGetRectSubPix() was taking the longest. This function uses interpolation to find the pixel
value at non-integer coordinates, and was used while warping the input image by the current
parameter estimate. Using the nearest neighbor pixel instead of performing the interpolation
resulted in a 12% speed improvement, with the algorithm still converging at the correct answer.
not all attempts were this successful. We tried to reduce the size of the images used for the
gradient descent process, starting with the mean appearance, from 256x256 to 128x128, but
found that the algorithm tended to not converge at that resolution.
186
After the code was fast enough to run 100 frames for the timing test, motion detection
took 91 ms. per frame (6 times speedup), face detection took 33 ms. (2.8 times speedup), and the
AAM took 20 ms. per iteration (13 times speedup). The most expensive component of each
AAM iteration was the dot product between the error image and each of the steepest descent
images for the 2D shape fitting. Optimization was done on this function, ensuring all math was
integer and skipping the borders outside the envelope of the valid error image. This was an
example where there were tradeoffs between reducing and simplifying operations. Image
operations like finding the difference between the warped input image and the mean image or
finding the dot product between two images are only valid for the pixels inside the mean shape.
There are pixels around the edges that are always zero, and do not affect the result of the
operation. Operating on all the pixels results in simpler code, since no conditional test is
required to identify which pixels are inside the mean shape. On the other hand, testing a mask
image to determine which pixels are potentially non-zero disrupts the pipelined operation of the
processor in addition to the extra instruction required for each pixel, so each loop iteration takes
longer. In our case, we found that the code ran slower using the mask. On the other hand, it
doesn’t cost anything to start at the first non-zero pixel and end at the last non-zero pixel, instead
of the corners of the image, since that only changes the loop start and end values. This is what
we used to achieve the best performance of the image processing steps like dot products.
Memory allocation has traditionally been a time consuming process. Although modern
techniques have decreased this cost, for optimal performance it is better to allocate temporary
arrays needed each frame or each iteration of the algorithm at initialization. This saves both the
allocation and the deallocation time.
187
The order of the operations was examined to avoid recalculating the same quantities. For
example, the vertex positions after warping by the current parameter estimate were needed for
both the 2D and 3D portions of the calculation. These vertex positions were calculated once and
used for both. A related optimization was to factor the equations efficiently. Each of the terms
of the 3D Jacobian the factor K. In the process of computing the dot product between the
Jacobian and the warped vertices, the terms are summed. This can be implemented more
efficiently by computing the Jacobians without the factor K, then multiplying the dot product
result by K.
By the time the test was increased to 200 frames, motion detection took 16 ms. (33 times
speedup) and the AAM took 10.7 ms. per iteration (24 times speedup). A general equation
solving function was replaced with the explicit solution for finding the affine transformations for
the triangle warps. The function was called for every triangle in each iteration of the AAM
fitting process. This avoided validity checks on the input arguments, tests to determine the type
of solving algorithm needed, and a couple levels of function call overhead. This particular
solution required inverting two 3x3 matrices, which can be done with closed form equations.
Use of appropriate data structures also contributed to this phase of optimization.
Computing the parameter update for the inverse compositional algorithm requires averaging the
warp for each vertex over all the triangles that contain the vertex. Since the connectivity of the
mesh does not change, a data structure listing the triangles that use each vertex allows for more
efficient execution than traversing the whole triangle list for each vertex.
For further streamlining, we created a version of the face detection module that
terminates after finding the first face. This aids the track verification process, which confirms
188
that the object being tracked is actually a face. In this case, the location and size of the face is
not needed. Because of this faster verification process, we remove the areas currently being
tracked from the region where new faces may be detected.
In the last run, motion detection took 11.5 ms. per frame (47 times speedup), face
detection took 24.7 ms. per frame (3.7 times speedup), and the AAM took 5.7 ms. per iteration
(46 times speedup). As an final improvement, we decreased the frequency for the major
components. On even frames, motion detection and face detection are done and the high contrast
points are used to update the tracking parameters for existing faces. On odd frames, the AAM
model is updated. The resulting track is smooth, and enables real time operation.
We experimented with adding skin detection as an additional filter to reduce the search
area for new faces, but in a room with white walls, the expense of performing color-based skin
detection exceeds the savings.
On a 2.2 GHz. AMD Opteron processor, the application runs at 10 to 15 Hz. tracking one
face in live 720x480 video. Since multiple factors affect the execution rate, including the
amount of motion in the scene and the rate of change of the tracked faces, it is difficult to give a
more precise execution rate.
5.3.5 Conclusion
As with any minimization process, it is possible to end up with a local minimum instead
of the global minimum, converging to an incorrect solution. Our integration of detection and
tracking, discussed in Section 5.1, handles this situation. If the AAM converges to the wrong
189
solution, the resulting region is likely to fail the face detector, so the face is removed from the
list. A new face will be detected in the next frame, reinitializing the process.
Although AAMs were initially too slow to be useful for real-time face tracking
applications, the evolution of the inverse compositional image alignment method as well as the
ever-increasing processing power in consumer PCs has made it practical. The original 2D-only
AAM also did not provide 3D head orientation, but adding 3D constraints to the original 2D
model has overcome that hurdle as well. We have implemented the existing AAM method and
extended it by simplifying the 2D AAM shape generation process so it requires orders of
magnitude less input data and hand processing. Our modification using an existing non-rigid 3D
head model allows meaningful facial expressions to be extracted from the tracking process,
which was not possible with the prior 2D+3D AAM, making this a useful tool for augmented
reality applications that track human faces.
5.4 Augmentation
Augmentation of a human face adds something that is attached to the head. It can be
external, such as a hat or glasses; on the surface, like a tattoo; or it can modify the face, for
example by substituting a different nose. The pose determined in the previous step must be
stable and accurate to position the virtual piece properly as the head moves.
We have used a simple eyeglass model and a wireframe graduation cap to demonstrate
the concept, but these could easily be replaced with any three dimensional model. The size,
position, and orientation of the face are applied as the virtual model is rendered on top of the
190
video image so it looks like it is attached to each face in the video. A sample frame is shown in
Figure 106.
Figure 106: Video frame augmented with cap and glasses.
5.5 Conclusions
Augmented reality is a rapidly expanding field that uses both computer vision and
computer graphics. Its interactive nature draws people in as they react to virtual objects added to
the real environment, but also creates challenges in achieving real-time update rates and precise
registration between virtual and real objects.
191
In our work, we have expanded the capabilities of augmented reality tracking systems
that use minimal hardware with no specialized environment. This simplified setup is suitable for
consumer applications targeted to people’s homes. It also places more of a burden on the
computer vision portion of the system to detect the applicable characteristics of the surroundings
solely from a single video input in an uncontrolled environment, where the background is
unknown and the lighting may vary.
This second augmented reality application detects and tracks human faces. Faces have a
variety of colors, shapes and additions such as glasses and facial hair. They also are non-rigid,
with local changes include jaw movement, eyes opening and closing, and lip motion. The faces
in the image are augmented in real time with virtual glasses, but this can easily be any virtual
object.
We have combined existing methods for face detection, motion detection, and tracking in
an integrated system that detects faces with higher precision and recall, handles a variable
number of faces, and maintains tracks of only those objects that are actually faces. Most face
tracking applications assume exactly one face is present that fills the majority of the screen, but
ours handles a much wider range of conditions.
We have presented a novel algorithm for computing the initial rotation of a detected face
quickly and accurately. We extended earlier symmetry measures to handle unstructured point
sets without the initial step of dividing the points into corresponding pairs. The use of high
contrast points in the detected face region speeds up the algorithm since image coordinates are
used instead of costly image correlations. The points used by the algorithm can be used later to
track the face.
192
The accuracy achieved by the method is sufficient for visual applications, i.e., those that
are judged by human eyes, including mixed reality and game control from video input.
Additional accuracy typically requires more complex computation, which is not feasible in real-
time applications.
We have implemented Active Appearance Models to refine the face pose estimation and
have presented extensions to the method that simplify model generation as well as provide
meaningful parameters as a result of the tracking to describe the 3D head orientation as well as
facial expressions. These parameters are suitable for video compression, such as MPEG-4.
This provides a critical step in the development of an augmented reality application that
automatically detects faces and determines their pose without prior calibration or manual
intervention.
193
6 CONCLUSIONS
6.1 Contributions
We have made several contributions while creating these virtual looking glass
applications, including
• Color space selection experiments. Theory and practice for color constancy in the
presence of varying illumination in different color spaces do not always agree. We have
performed experiments measuring the color values recorded by consumer cameras in
varying illumination. From this, a number of color spaces were compared to see how
well they remained constant as the light changed, yet still discriminated between
different colors. We found that some less well-known color spaces perform better than
the traditional ones.
• A real-time algorithm for tracking simple objects in monocular video. Many common
algorithms that track object silhouettes only work for objects with complex feature sets
and run too slowly for interactive use. Our tracking method uses color, edge, and
motion information to achieve real-time rates.
• Optimization methods for real time video processing. Careful optimization can make
the difference between an implementation being “fast enough” or not. We have
achieved speedups of two to ten on various functions involved in tracking a rectangle
and speedups close to 50 for motion detection and face tracking, even starting with an
optimized image processing library.
194
• Fast video deinterlacing. Consumer level video cameras are interlaced, with the even
and odd lines captured at different times. Moving objects appear to tear when displayed
on a noninterlaced computer display. Our deinterlacing algorithm is fast, maintains the
detail in stationary parts of the frame, and removes the tearing in the moving parts of the
frame.
• Extended continuous symmetry measure to handle not only shapes and graphs but
general 2D point sets as well.
• Real-time initial face pose estimation in video. After a face is detected, its orientation
must be determined. This is often done manually, which is not suitable for a general
augmented reality application. Other methods perform costly correlations to find facial
features. We present a novel method for recovering face orientation using generic high
contrast points. Using the coordinates of these points results in a faster algorithm than
using appearance comparisons, and sufficient accuracy, outperforming methods based
on skin detection.
• Enhanced Active Appearance Models (AAMs) for face tracking. Existing methods for
AAMs require more than a hundred manually marked points on each of several hundred
images to create the 2D and 3D models used for tracking, and the parameters extracted
during tracking do not have meaningful labels. By incorporating a parameterized 3D
nonrigid head model, we reduced the manual labeling effort to 30 points on tens of
images, and derive meaningful parameters from the result, such as “jaw drop” or
“eyebrows raised”, while still maintaining the robustness afforded by AAMs.
195
• Robust face tracking for augmented reality. We integrate motion detection, face
detection, and tracking to create a system that can track multiple faces, filter out
spurious detections in dynamic regions, eliminate detections in static regions, and verify
the track of faces outside the native range of the face detector.
6.2 Future Directions
The augmented reality application that tracks a rectangle described in Chapter 4 is well
suited to use in a mixed reality game. For example, the rectangle could be used as a virtual
paddle in a mixed reality tennis game. In order for this to work well, several improvements are
needed. First, the system needs to recognize when tracking is lost and initiate a recovery
process. If a part of the scene is mistakenly labeled as part of the rectangle, the color model is
updated to include the new colors, which can quickly lead to the whole frame being labeled as
part of the rectangle. In addition to efforts to avoid this situation in the first place, there needs to
be logic to detect this situation and recover, using the shape model and the original trained color.
The augmented reality application that detects and tracks faces is also useful in a mixed
reality game situation, as well as human-computer interface and surveillance tasks. Face
detection still takes a large proportion of the available time. We have implemented several
methods to limit the area where detection is needed, including motion detection and skin
detection, but faster detection methods are needed. While the rectangular features used by the
cascaded AdaBoost system are simple to compute, the requirement for scanning all possible
locations at a range of scales still makes face detection in a high resolution image a slow process.
196
The active appearance model has been shown to be effective for rigid and non-rigid
tracking of objects such as faces. With our use of a parameterized head model, the non-rigid
motion parameters have semantic labels, so operations such as controlling a cursor with jaw and
eyebrow motions are possible with AAMs. However, speed of the iterative fitting process is still
an issue on consumer hardware. The most computationally expensive operations of each
iteration are computing the error image and calculating the dot product of the warp Jacobians
and the error image. The error image is created by warping the input image and subtracting the
mean image. All of these steps can be done efficiently on a graphics processing unit (GPU).
The bottleneck in many non-graphics uses for a GPU is transferring the results back to the CPU,
since the graphics pipeline is optimized for transfers to the GPU. In this case, the images are
small enough (the mean images we used were 256 by 256), that porting to the GPU will likely
result in faster computation.
Another area that the AAMs can be improved is increasing the range of environments in
which the algorithm will work. The CMU group has been working to use AAMs to track faces
where parts of the face are occluded; either self-occlusion from rotation or occlusion from
another object [46][95]. There is also a need for the algorithm to work on lower resolution
images. Current work in that area includes [76] and [103].
Our current work has included face detection and tracking, but not face recognition.
However, many of the elements needed for face recognition are extracted during the process of
fitting the AAM model. The shape parameters recovered from fitting the parameterized head
model to the image provide anthropomorphic metrics like the height and separation of the eyes
and width of the mouth. These things are likely to remain constant for an individual in spite of
197
environmental factors like lighting, normal appearance changes like glasses, as well as
intentional disguises. Pose changes usually make measuring these quantities challenging, but
our AAM fitting process accounts for the 3D orientation when fitting the shape parameters. The
other common technique for face recognition is to find the best match between the observed
image and a database of frontal faces normalized in size and brightness. By solving for the λ
(appearance) parameters in the active appearance model, we should be able to do even better
than the standard methods because the AAM appearances have been normalized for pose, face
shape, and expression as well as just size and brightness.
Other more general improvements to the system would be to track the whole body
instead of just the face, to handle a moving camera, and to recognize actions based on the
tracking, to name a few.
Speed improvements are a never-ending cycle, and robustness to occlusion, lighting, and
low resolution are ongoing challenges as well. There is much more that can be done in the area
of augmentation, including more complex graphics, shadows on the virtual objects, and objects
that conform to the face as it changes shape.
The areas explored here have wide application, both in augmented reality and in more
general computer vision fields. Face detection and tracking are widely used in many domains,
including surveillance and security applications, video conferencing, video coding (MPEG-4 is
object-based, not block-based), video indexing (a factor used in classifying scenes is the number
of people present), and human-computer interfaces. In a live teleconference, the speaker’s
identity could be shielded by replacing the texture map or shape with a different one. This
would allow the pose and facial expressions to be communicated using a different face. The
198
speed and accuracy requirements for these uses vary, but all benefit from the high standard of
interactive rates and precision needed for augmented reality.
The augmented mirror applications have the potential for being ways to bring augmented
reality into consumer’s homes. It has appeal both for pure entertainment and for subtly
conveying information, like the weather forecast (sun hat vs. rain hat) or whether your stock
portfolio is up or down (top hat vs. dunce cap). If extended to recognize individuals, it could put
gender appropriate hats on guests, different hats on different family members, with special hats
for special days, like birthdays.
Going beyond entertainment, virtual hairstyles have been applied to images, but what if
the virtual hairstyle was 3D and interactive? Plastic surgeons could use an augmented mirror to
show what a patient would see in the mirror after surgery. Extending from the head to a full
body would open the door to a virtual dressing room, where the mirror shows interactively what
an outfit would look like. The possibilities are truly endless.
199
REFERENCES
[1] http://www.ai.mit.edu/projects/medical-vision/surgery/surgical_navigation.html
[2] http://artoolkit.sourceforge.net/
[3] M.F. Augusteijn and T.L. Skufca, “Identification of Human Faces through Texture-Based Feature Recognition and Neural Network Technology,” Proc. IEEE Conf. Neural Networks, pp. 392-398, 1993.
[4] S. Baker, R. Gross, and Iain Matthews, “Lucas-Kanade 20 Years On: A Unifying Framework: Part 4,” Technical Report CMU-RI-TR-04-14, Robotics Institute, Carnegie Mellon University, 2004.
[5] S. Baker and I Matthews, “Lucas-Kanade 20 Years On: A Unifying Framework: Part 1: The Quantity Approximated, the Warp Update Rule, and the Gradient Descent Approximation,” International Journal of Computer Vision, 56(3), pp. 221-255, March 2004.
[6] V. Bakic and G. Stockman, “Real-time Tracking of Face Features and Gaze Direction Determination,” Proc. IEEE Workshop on Applications of Computer Vision, pp. 256-257, 1998.
[7] K. Barnard, V. Cardei, and B. Funt, “A Comparison of Computational Color Constancy Algorithms – Part I: Methodology and Experiments with Synthesized Data,” IEEE Trans. on Image Processing, 11(9), pp. 972-983, Sept. 2002.
[8] K. Barnard, F. Ciurea and B. Funt, “Sensor sharpening for computational color constancy,” J. Opt. Soc. Am. A, 18(11), pp. 2728-2743, Nov. 2001.
[9] K. Barnard, G. Finlayson, and B. Funt, “Color Constancy for Scenes with Varying Illumination,” Computer Vision and Image Understanding, 65(2), pp. 311-321, 1997.
[10] K. Barnard, L. Martin, A. Coath, and B. Funt, “A Comparison of Computational Color Constancy Algorithms – Part II: Experiments With Image Data,” IEEE Trans. on Image Processing, 11(9), pp. 985-996, Sept. 2002.
[11] S. Basu, I. Essa and A. Pentland, “Motion Regularization for Model-Based Head Tracking,” Proc. International Conf. Pattern Recognition, Vol. 3, pp. 611-616, Aug. 1996.
[12] http://www.bk.isy.liu.se/candide/
200
[13] J.-Y. Bouguet, “Pyramidal Implementation of the Lucas Kanade Feature Tracker,” Intel Technical Report, included with OpenCV distribution, 2000.
[14] G. Bradski, A. Kaehler, and V. Pisarevsky, “Learning-Based Computer Vision with Intel’s Open Source Computer Vision Library,” Intel Technology Journal, 9(2), pp. 119-130, 2005
[15] G. Buchsbaum, “A Spatial Processor Model for Object Colour Perception,” J. of the Franklin Institute, vol. 310, pp. 1-26, 1980.
[16] J.B. Burns, A.R. Hanson and E.M. Riseman, “Extracting Straight Lines,” IEEE Trans. Pattern Analysis and Machine Intelligence, 8(4), pp. 425-455, July 1986.
[17] J. Canny, “Finding Edges and Lines in Images,” Tech Report TR-720, Artificial Intelligence Lab, MIT, Cambridge, MA, 1983.
[18] F. Ciurea and B. Funt, “Tuning Retinex Parameters,” J. Electronic Imaging, 13(1), pp. 58-64, Jan. 2004.
[19] E.M. Coelho, B. MacIntyre, and S.J. Julier, “Supporting Interaction in Augmented Reality in the Presence of Uncertain Spatial Knowledge,” Proc. ACM Symposium on User Interface Software and Technology, pp. 111-114, 2005.
[20] T.F. Cootes, G.J. Edwards, and C.J. Taylor, “Active Appearance Models,” Proc. European Conference on Computer Vision, vol. 2, pp. 484-498, 1998.
[21] T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham, “Active shape models – their training and application,” Computer Vision and Image Understanding, 61(1), pp. 38-59, Jan. 1995.
[22] M. Corbalán, M.S. Millán, and M.J. Yzuel, “Color measurement in standard CIELAB coordinates using a 3CCD camera: correction for the influence of the light source,” Optical Engineering, 39(6), pp. 1470-1476, June 2000.
[23] T. Darrell, G. Gordon, J. Woodfill, and M. Harville, “A Virtual Mirror Interface using Real-time Robust Face Tracking,” Proc. 3rd Int’l Conf. on Face and Gesture Recognition, pp. 616-621, 1998.
[24] http://www.etc.cmu.edu/projects/magicmirror/index.html
[25] F. Dornaika and J. Ahlberg, “Fast and Reliable Active Appearance Model Search for 3-D Face Tracking,” IEEE Trans. Systems, Man, and Cybernetics – Part B: Cybernetics, 34(4), pp. 1838-1853, 2004.
201
[26] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf, Computational Geometry – Algorithms and Applications, Springer-Verlag, Berlin, 2000.
[27] R.O. Duda and P.E. Hart, “Use of the Hough Transform to Detect Lines and Curves in Pictures,” Communications of the ACM, 15(1), pp. 11-15, Jan. 1972.
[28] G.D. Finlayson, “Coefficient Color Constancy,” Ph.D. thesis, Simon Fraser University, School of Computing, 1995.
[29] G.D. Finlayson, M.S. Drew, and B.V. Funt, “Spectral sharpening: sensor transformations for improved color constancy,” J. Opt. Soc. Am. A, 11(5), pp. 1553-1563, 1994.
[30] G.D. Finlayson, S.D. Hordley, and P.M. Hubel, “Color by Correlation: A Simple, Unifying Framework for Color Constancy,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 23(100), pp. 1209-1221, Nov. 2001.
[31] G. Finlayson and G. Schaefer, “Colour indexing across devices and viewing conditions,” Proc. 2nd Int’l Workshop on Content Based Multimedia and Indexing, pp. 215-221, 2001.
[32] G. Finlayson and G. Schaefer, “Hue that is invariant to brightness and gamma,” Proc. British Machine Vision Conf., pp. 303-312, 2001.
[33] F. Fleuret and G. Geman, “Fast Face Detection with Precise Pose Estimation,” Proc. IEEE International Conf. Pattern Recognition, Vol. 1, pp. 235-238, 2002.
[34] J.D. Foley, A. van Dam, S.K. Feiner and J.F. Hughes, Computer Graphics Principles and Practice, Addison-Wesley, Reading, MA, 1997.
[35] G.L. Foresti, C. Micheloni and L. Snidaro, “A Robust Face Detection System for Real Environments,” Proc. IEEE Int’l Conf. on Image Processing, pp. 897-900, 2003.
[36] D. Forsyth, “A novel algorithm for color constancy,” Int’l Journal of Computer Vision, vol. 5, pp. 5-36, 1990.
[37] Y. Freund and R.E. Schapire, “A Decision-theoretic Generalization of On-line Learning and an Application to Boosting,” J. Computer and System Sciences, 55(1), pp. 119-139, 1997.
[38] B. Funt, F. Ciurea and J. McCann, “Retinex in MATLAB,” J. Electronic Imaging, 13(1), pp. 48-57, Jan. 2004.
[39] B.V. Funt and G.D. Finlayson, “Color Constant Color Indexing, ”IEEE Trans. on Pattern Analysis and Machine Intelligence, 17(5), pp. 522-529, May 1995.
202
[40] J. Gausemeier, M. Grafe, C. Matysczok, and R. Radkowski, “Optical Tracking Stabilization using Low-Pass Filters,” Proc. IEEE Augmented Reality Toolkit Workshop, pp. 16-17, 2003.
[41] R. Gershon, A.D. Jepson and J.K. Tsotsos, “From [R,G,B] to Surface Reflectance: Computing Color Constant Descriptors in Images,” Perception, pp. 755-758, 1988.
[42] Jan-Mark Geusebroek, Rein van den Boomgaard, Arnold W.M. Smeulders, and Hugo Geerts, “Color Invariance,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 23(12), Dec. 2001, pp. 1338-1350.
[43] J.-M. Geusebroek, R. van den Boomgaard, A.W.M. Smeulders, and A. Dev, “Color and Scale: The Spatial Structure of Color Images,” Proc. Sixth European Conf. Computer Vision, vol. 1, pp. 331-341, 2000.
[44] T. Gevers and A.W.M. Smeulders, “Color Based Object Recognition,” Pattern Recognition, vol. 32, pp. 453-464, 1999.
[45] V. Govindaraju, “Locating Human Faces in Photographs,” Int’l J. Computer Vision, 19(2) pp. 129-146, 1996.
[46] R. Gross, I. Matthews, and S. Baker, “Constructing and Fitting Active Appearance Models with Occlusion,” Proc. IEEE Workshop on Face Processing in Video, June 2004.
[47] G. Hager and P. Belhumeur, “Efficient region tracking with parametric models of geometry and illumination,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(10, pp. 1025-1039, 1998.
[48] C. Harris and M.A. Stephens, “A Combined Corner and Edge Detector,” Proc. 4th Alvey Vision Conference, pp. 147-151, 1988.
[49] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press, 2000.
[50] G. Healey and D. Slater, “Global Color Constancy: Recognition of Objects by Use of Illumination-invariant Properties of Color Distributions,” J. Opt. Soc. Am. A, 11(11), Nov. 1994, pp. 3003-3010.
[51] Y. Hu, L. Chen, Y. Zhou, and H. Zhang, “Estimating Face Pose by Facial Asymmetry and Geometry,” Proc. IEEE International Conf. on Automatic Face and Gesture Recognition, 2004.
[52] C.-J. Huang, M.-C. Ho, and C.-C. Chiang, “Feature-based Detection of Faces in Color Images,” IEEE Int’l Conf. on Image Processing, pp. 2027-2030, 2004.
203
[53] S.-H. Huang and S.-H. Lai, “Real-time Face Detection in Color Video,” Proc. IEEE International Multimedia Modeling Conference, pp. 338-345, 2004.
[54] C. Huang, B. Wu, H. Ai, and S. Lao, “Omni-directional Face Detection Based on Real Adaboost,” IEEE International Conf. on Image Processing, pp. 593-596, 2004.
[55] http://www.intel.com/research/mrl/research/opencv/
[56] M. Isard and A. Blake, “CONDENSATION – conditional density propagation for visual tracking,” Int. J. Computer Vision, 29(1), pp. 5-28, 1998.
[57] R. Kjeldsen, J. Kender, “Finding Skin in Color Images,” IEEE 2nd Intl. Conf. on Automatic Face and Gesture Recognition, pp. 312-317, 1996.
[58] M. La Cascia, S. Sclaroff, and V. Athitsos, “Fast, Reliable Head Tracking under Varying Illumination: An Approach Based on Registration of Texture-Mapped 3D Models,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 22(4), pp. 322-336, 2000.
[59] E.H. Land, “Recent Advances in Retinex Theory and Some Implications for Cortical Computations: Color vision and the Natural Image,” Proc. Nat. Acad. Sci., vol 80, pp. 5163-5169, Aug. 1983.
[60] E.H. Land, “The Retinex Theory of Color Vision,” Scientific American, 237(6), pp. 108-129, Dec. 1977.
[61] E.H. Land and J.J. McCann, “Lightness and Retinex Theory,” Journal of the Optical Society of America, vol. 61, pp. 1-11, 1971.
[62] V. Lepetit, L. Vacchetti, D. Thalmann, and P. Fua, “Fully Automated and Stable Registration for Augmented Reality Applications,” Proc. Int’l Symposium on Mixed and Augmented Reality, pp. 93-102, 2003.
[63] V. Lepetit, L. Vacchetti, D. Thalmann, P. Fua, “Real-Time Augmented Face,” Int’l Symposium on Mixed and Augmented Reality, pp. 346-347, 2003.
[64] K.-L. Low and A. Ilie, “View Frustum Optimization To Maximize Object’s Image Area,” Technical Report TR02-024, Department of Computer Science, UNC at Chapel Hill, 2002.
[65] D.G. Lowe, “Object Recognition from Local Scale-Invariant Features,” Proc. Int’l Conf. Computer Vision, pp. 1150-1157, 1999.
[66] B.D. Lucas and T. Kanade, “An Iterative Image Registration Technique with an Application to Stereo Vision,” Proc. International Joint Conference on Artificial Intelligence, pp. 674-679, 1981.
204
[67] I. Matthews and S. Baker, “Active Appearance Models Revisited,” International Journal of Computer Vision, 60(2), pp. 135-164, Nov. 2004.
[68] P. Milgram, F. Kishino, “A Taxonomy of Mixed Reality Visual Displays”, IEICE Transactions on Information Systems, Vol. E77-D, No. 12, December 1994.
[69] H. Moravec, “Visual Mapping by a Robot Rover,” Proc. Int. Joint Conf. on Artificial Intell., pp. 598-600, 1979.
[70] J. Nocedal and S.J. Wright, Numerical Optimization, Springer, New York, 1999.
[71] E. Osuna, R. Freund and F. Girosi, “Training Support Vector Machines: An Application to Face Detection,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 130-136, 1997.
[72] T.S. Perry, “All in the Game”, IEEE Spectrum, 40(11), pp. 31-35, November 2003.
[73] http://www.pvieurope.com/
[74] A.N. Rajagopalan, K.S. Kumar, J. Karlekar, R. Manivasakan, M.M. Patil, U.B. Desai, P.G. Poonacha and S. Chaudhuri, “Finding Faces in Photographs,” Proc. IEEE Int’l Conf. Computer Vision, pp. 640-645, 1998.
[75] E. Reinhard, G. Ward, S. Pattanaik, and P. Debevec, High Dynamic Range Imaging: Acquisition, Display, and Image-Based Lighting, Morgan Kaufmann Publishers, San Francisco, CA, 2006.
[76] http://www.ri.cmu.edu/projects/project_448.html
[77] S. Romdhani and T. Vetter, “Efficient, Robust and Accurate Fitting of a 3D Morphable Model,” Proc. IEEE International Conference on Computer Vision, vol. 1, pp. 59-66, 2003.
[78] H.A. Rowley, S. Baluja and T. Kanade, “Neural Network-Based Face Detection,” IEEE Trans. Pattern Analysis and Machine Intelligence, 20(1), pp. 23-28, 1998.
[79] T. Sakai, M. Nagao and S. Fujibayashi, “Line Extraction and Pattern Detection in a Photograph,” Pattern Recognition, vol 1, pp. 233-248, 1969.
[80] A. Samal and P. A. Iyengar, “Human Face Detection Using Silhouettes,” Int’l J. Pattern Recognition and Artificial Intelligence, 9(6), pp. 845-867, 1995.
[81] H. Schneiderman and T. Kanade, “Probabilistic Modeling of Local Appearance and Spatial Relationships for Object Recognition,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 45-51, 1998.
205
[82] H. Schneiderman and T. Kanade, “A Statistical Method for 3D Object Detection Applied to Faces and Cars,” IEEE Conf. on Computer Vision and Pattern Recognition, pp. 746-751, 2000.
[83] P. Sinha, “Object Recognition via Image Invariants: A Case Study,” Investigative Opthalmology and Visual Science, 35(4), pp. 1735-1740, 1994.
[84] D. Slater and G. Healey, “The Illumination-Invariant Recognition of 3D Objects Using Local Color Invariants,” IEEE Trans. Pattern Analysis and Machine Intell., 18(2), Feb. 1996, pp. 206-210.
[85] S.M. Smith and J.M. Brady, “SUSAN – a new approach to low level image processing,” Int. J. Computer Vision, 23(1), pp. 45-78.
[86] http://www.sportvision.com/index.cfm?section=tv&cont_id=player&roster_id=34&personnel_id=992
[87] C. Stauffer and W.E.L. Grimson, “Learning Patterns of Activity using Real Time Tracking,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 9(1), pp. 747-767, 2000.
[88] J. Shi and C. Tomasi, “Good features to track,” Proc. IEEE Conf. Computer Vision and Patt. Recognition, pp. 593-600, 1994.
[89] I. Skrypnyk, and D.G. Lowe, “Scene Modelling, Recognition and Tracking with Invariant Image Features,” Proc. IEEE/ACM International Symposium on Mixed and Augmented Reality (ISMAR), pp. 110-119, 2004.
[90] L. Spencer, “Writing Video Applications,” http://www.cs.ucf.edu/~lspencer/vid_app.pdf
[91] L. Spencer and R. Guha, “Determining Initial Head Orientation for Real-Time Video Input,” Proc. 7th International Conference on computer Games: AI, Animation, Mobile, Educational & Serious Games (CGAMES), Angoulême, France, Nov. 2005.
[92] L. Spencer and R. Guha, “Augmented Reality in Video Games,” Proc. 6th International Conference on Computer Games: AI and Mobile Systems (CGAIMS), Louisville, KY, July 2005.
[93] H. Tamura, “Steady Steps and Giant Leap Toward Practical Mixed Reality Systems and Application,” Proc. International Status Conf. on Virtual and Augmented Reality, pp. 3-12, 2002.
[94] G. Taubin and D. Cooper, “Object recognition based on moment (or algebraic) invariants,” in J. Mundy and A. Zisserman, eds., Geometric Invariance in Computer Vision (MIT Press, Cambridge, Mass., 1992), pp. 375-397.
206
[95] B. Theobald, I. Matthews, and S. Baker, “Evaluating Error Functions for Robust Active Appearance Models,” Proc. International Conference on Automatic Face and Gesture Recognition, April 2006.
[96] K. Toyama, “Prolegomena for Robust Face Tracking,” Microsoft Research Technical Report MSR-TR-98-65, 1998.
[97] M. Turk and A. Pentland, “Eigenfaces for Recognition,” J. Cognitive Neuroscience, 3(1), ppp. 71-86, 1991.
[98] S. Uchiyama, K. Takemoto, K. Satoh, H. Yamamoto, and H. Tamura, “MR Platform: A Basic Body on Which Mixed Reality Applications Are Built,” Proc. IEEE/ACM International Symposium on Mixed and Augmented Reality (ISMAR), pp. 246-253, 2002.
[99] L. Vacchetti, V. Lepetit, and P. Fua, “Combining Edge and Texture Information for Real-time Accurate 3D Camera Tracking,” Proc. IEEE/ACM International Symposium on Mixed and Augmented Reality (ISMAR), pp. 48-56, 2004.
[100] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 8-14, 2001.
[101] J. von Kries, “Beitrag zur Physiologie der Gesichtsempfinding,” Arch. Anat. Physiol., 2, pp. 5050-524, 1878.
[102] http://www.webopedia.com
[103] Z. Wen, T.S. Huang, “Enhanced 3D Geometric-Model-Based Face Tracking in Low Resolution with Appearance Model,” Proc. IEEE International Conference on Image Processing (ICIP), Vol. 2, pp. 350-353, Sept. 2005.
[104] M. Woo, J. Neider, and T. Davis, OpenGL Programming Guide, version 1.1, 2nd edition, Addison-Wesley, 1997.
[105] H. Wuest, F. Vial, and D. Strieker, “Adaptive Line Tracking with Multiple Hypotheses for Augmented Reality,” Proc. IEEE/ACM International Symposium on Mixed and Augmented Reality (ISMAR), pp. 62-69, 2005.
[106] J. Xiao, S. Baker, I. Matthews, and T. Kanade, “Real-Time Combined 2D+3D Active Appearance Models,” Proc. IEEE Conf. on Computer Vision and Pattern Recognition, vol. 2, pp. 535-542, June 2004.
[107] J. Xiao, J. Chai, and T. Kanade, “A closed-form solution to non-rigid shape and motion recovery,” Proc. European Conference on Computer Vision, Vol. 4, pp. 573-587, 2004.
207
[108] G. Yang and T.S. Huang, “Human Face Detection in Complex Background,” Pattern Recognition, 27(1) pp. 53-63, 1994.
[109] M.-H. Yang, D.J. Kriegman, and N. Ahuja, “Detecting faces in images: a survey,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 24(1), pp. 34-58, 2002.
[110] M.-H. Yang, D. Roth and N. Ahuja, “A SNoW-Based Face Detector,” Advances in Neural Information Processing Systems 12, pp. 855-861, MIT Press, 2000.
[111] A. Yilmaz and M.A. Shah, “Automatic Feature Detection and Pose Recovery for Faces,” Proc. Asian Conference on Computer Vision, 2002.
[112] K.C. Yow and R. Cipolla, “Feature-Based Human Face Detection,” Image and Vision Computing, 15(9), pp. 713-735, 1997.
[113] H. Zabrodsky, S. Peleg, and D. Avnir, “Continuous Symmetry Measures, IV: Chirality,” J. American Chem. Soc., Vol. 17, pp. 462-473, 1995.
[114] H. Zabrodsky, S. Peleg, and D. Avnir, “Symmetry as a Continuous Feature,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 17(12), pp. 1154-1166, Dec. 1995.