Camera Based Gesture Control for an Information Appliance ......A camera based gesture control...

Camera Based Gesture Control for

an Information Appliance for the

Visually Impaired

Master's Thesis in Computer Science

Submitted by

Noam Yogev

born 11th November 1974 in Israel

completed at the

Institut für Informatik

AG Künstliche Intelligenz

Freie Universität Berlin

Prof. Dr. Raúl Rojas

Project commenced: 19thApril 2011Project completed: 10thAugust 2011

Ich versichere, die Masterarbeit selbständig und lediglich unter Benutzung der ange-gebenen Quellen und Hilfsmittel verfasst zu haben. Ich erkläre weiterhin, daÿ die vor-liegende Arbeit noch nicht im Rahmen eines anderen Prüfungsverfahrens eingereichtwurde.Ich bin damit einverstanden, daÿ ein Exemplar meiner Masterarbeit in der Bibliothekausgeliehen werden kann.

Berlin, den 10.VIII.2011

Noam Yogev

I would like to thank the informa team � Roman Guilbourd, Fabian Ru�, BastianHecht and Sime Pervan � for their help and their warnings, for staying cool, and forthe good music, Tim Landgraf for the occasional inspiration and his contagious enthu-siasm, all my friends, who with their questions have driven me to devise new answers,my mother, my father, my sister, my Rona.

רונה,!אילינוי,!

אמא,!אבא,!

! תודה! − וטליה נויה

!. . ו.

!! !Mהעולמי Mמתחילי Pתיכ

bla

� All media are extensions of some human faculty � psychic or physical

. . . the wheel is an extension of the foot, the book is an extension of the eye,

clothing � an extension of the skin, electric circuitry � an extension of the

central nervous system.� Marshall McLuhan

Contents

1 Introduction 1

2 Fundamentals 5

2.1 Colour models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 RGB colour model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.2 HSI/HSV colour model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.3 YCrCb colour model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 White balance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10

2.2.1 Di�erential RGB white balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.2 Retinex derived white balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Histogram back projection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15

2.4 Foreground - background segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 Fisher's linear discriminant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.6 The Kalman �lter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.7 Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.7.1 Contour extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.7.2 Polygon approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.7.3 Convex hull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23

3 Related work 27

3.1 Gesture recognition using Kinect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 A camera-projector system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 A simple PC and web camera setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 Algorithm 35

4.1 General approach and overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35

4.1.1 Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35

4.1.2 Design overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2 Detailed description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2.1 Preliminary image processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

i

CONTENTS ii

4.2.2 Initial detection and feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2.3 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.3 Alternative approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Integration 53

5.1 Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.2 Output control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.3 Text input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.4 Navigation and output control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .56

6 Evaluation 59

7 Conclusion 65

Bibliography 69

Chapter 1

Introduction

A camera based gesture control interface � in following cbgci � for an informationappliance for the blind and visually impaired is presented. Such an appliance, incor-porating a common web camera for its main functionality as a reading device, is beingdeveloped by the informa* project group at the Arti�cial Intelligence Laboratory atthe Freie Universität Berlin. cbgci is designed as a control interface for the informadevice, utilizing the web camera as a user interface (UI) and thus eliminating the needfor an external numeric keypad otherwise required for this purpose.The introduction of graphic user interfaces to personal computing may be regarded asthe �rst step towards more intuitive control concepts, examples of which may be foundtoday on an ever widening variety of personal and business devices. Users open, closeand resize windows, press virtual buttons, make their choices via tick-boxes, and specifydesired values using virtual sliders and knobs. Even more intuitive schemes has beenconjured in recent years, like the backwards resp. forwards �nger-swish motion whichmay be used to control the behaviour of a web browser, or multi-touch possibilities suchas the resizing and rotation of images. Multi-touch in itself may be considered as thelatest ripe stage in the evolution of tactile control technologies, preceded by touch padsand their more intuitive successors � touch screens.The immediacy conveyed by the experience with such systems is twofold: notonly can the user manipulate a virtual object as it is presented to her via thegraphic interface, she also practically performs in the real world the very mo-tion which corresponds with the desired virtual control�. Next to general ease ofuse and satisfactory overall user experience, the most important emergent qualityof this immediacy, especially with non-experienced users, is a steep learn curve [1].

Presenting blind and visually impaired users with the advantages of gesture con-trolled UI suggested itself during the development of the informa project. Inconnection with its initial functionality as a reading device the informa system

*See informa.mi.fu-berlin.de and symplektikon.de for further details�Assuming an adequate design

1

informa.mi.fu-berlin.de

symplektikon.de

CHAPTER 1. INTRODUCTION 2

distinguishes itself by relying on a �xed mounted web camera instead of the �atbedscanner usually found on other platforms. Once the reading out of a text is desired,the web camera is used to acquire a high de�nition image of the document, whichis then sent to a remote server. The server performs the necessary optical characterrecognition (OCR) and text-to-speech (TTS) tasks, and sends the audio output backto the informa client, which may then playback it for the user. Consequently, theresources of the web camera may be used to perform other tasks during the best partof the active usage of the platform.On the other hand the permanent Internet connection, dictated by the abovementionedprinciple of operation, suggests the utilization of the device as an information appliancein a much broader sense: various web content may be o�ered to the user, with theserver providing the necessary parsing and TTS. Accessing a variety of content requiresin its turn more elaborate control mechanisms, not called for by the initial bare-bonereading functionality. Moreover, depending only on technical arrangements on the serverresp. on the policy of a commercial provider, the variety of content may grow and with itpotentially also the di�erent modes of interaction with it � e.g. the possibility of activelyinitiating an on-line search.Thus introducing a video-based gesture control mechanism may

� maximise the workload of the web camera as an integral system resource of theinforma device while eliminating the need for external, supplement hardware

� provide a programmable framework which would allow developers to cope �exiblywith most use cases as they might arise in the course of expansion, and to designappropriate UI's

� prove advantageous for the user in comparison with other UI concepts as discussedabove

From a user's point of view, the interaction with a camera based gesture control interfaceshould not di�er conceptually from the typical experience with a touch pad or a touchscreen: once a gesture which is associated with a control action is performed withinthe camera's �eld of view (FOV), the corresponding process should be initiated by thesystem, preferably issuing a feedback of sorts to acknowledge the ful�lment of the task(should the outcome of the performed action itself have no apparent manifestation).This pattern of user interaction and system behaviour will henceforth be referred to ascontrol sequence.A prerequisite is the system's ability to clearly distinguish between idle periods (in termsof user interaction) and control sequences. Touch dependent designs are granted thisdistinction by their very principle of operation: a control sequence starts with a touch andends with it; a double click too is realized by measuring the time between two consequenttaps, in themselves elementary control sequences. A camera based gesture control on theother hand is bound to determine the beginning and end of a control sequence from the


constantly �owing image data, which is not used solely for this purpose. In other words:whereas touch-based designs, when determining the temporal containment of a gesture,depend on a sensor input which is exclusively reserved for this task � the gesture'sactual �payload� lies in the change of position, but not in the existence of the touch� a camera based design has to distinguish idle time from control sequences relying onthe same domain that may then convey the sequence's content, and consequently alsodeliver the signal for its termination: the video stream.

The challenge of ful�lling this prerequisite poses itself even more acutely in the currentcase: the informa device's basic functionality as a reading device rules out a guaranteefor a known, contrasting background in front of which control gestures may be performed� it cannot be ruled out that at any given time an unknown document might lie onthe device's surface, making out the background for any movement to be registered bythe camera. A solution for this issue in this particular context could though pave theway for applying video-based gesture control in more general scenarios, e.g. with mobileapplications.

cbgci is designed as an API. A software user can outline image regions and de�ne thefundamentals of the gestures that may be performed in each one of them. The regionsare then registered with a manager. While operating, video frames from the cameraare sent to the manager, which in turn passes the image data to each of the regions forprocessing along the following outline:

� Each incoming frame is being analysed to determine if it initiates a control se-quence.

� In the case a control sequence should be initiated, additional necessary data isextracted from the image and supportive structures are instantiated within theprogram.

� If a control sequence is already underway, the current frame is analysed furtherand in combination with the system's previous state the gesture's continuation isconstructed.

� At any time it may be decided to terminate an ongoing control sequence on thegrounds of information extracted from the frame.

cbgci uses di�erent image processing techniques for the ful�lment of each of these oper-ational stages. On completion of this procedure an output is made available, indicatingif and what gesture information could be extracted from the registered region within theframe under observation. Output from all registered regions is passed to the managerand made available for the software client, which may then interpret it as control dataand consequently perform the associated actions.


On the structure of this thesis

Various technologies and algorithms utilized by cbgci will be discussed in chapter 2,encompassing image processing paradigms like histogram equalisation, automatic whitebalancing and foreground-background distinction, as well as the Kalman �lter estima-tion model and di�erent approaches to colour representation. Chapter 3 shall o�er anoverview on some works in the same �eld, pointing out similarities resp. di�erences withthe presented system in terms of requirements, preconditions, approach and result. Thesystem's algorithm will be discussed in chapter 4: Following an outline of the work�ow,details will be given on the various processing stages. Conceptual as well as softwareengineering challenges that were met along the development shall be discussed togetherwith their implemented or suggested future solutions. A description of the integration ofthe system and its current and possible future usage scenarios within the framework ofthe informa device will be given in chapter 5. An evaluation of the presented systemis given in chapter 6. The following conclusion will o�er an outlook on future work andpossibilities for further development of the presented system.

Chapter 2

Fundamentals

2.1 Colour models

Colour models are mathematical representations of colour. Using a distinct mappingfunction from a range of number tuples into a space of absolute colour values, the colourspace of a respective model may be established[2]. Tkalcic and Tasic distinguish in [3]between three di�erent categories:

� Colour spaces based on the human visual system (HVS)

� Application speci�c colour spaces

� CIE* colour spaces

The cbgci algorithm utilizes two of the colour spaces categorised as HVS colour spaces,namely RGB and the HSI/HSV, as well as YCrCb, derived from the application speci�cYUV space.

2.1.1 RGB colour model

According to the trichromatic theory the human colour perception relies on the responsesof three di�erent photoreceptors in the retina, with sensitivity ranges correlating approx-imately with the red, the green and the blue light spectra [4, p. 396]. Accordingly, mostcapturing devices are constructed with red (R), green (G) an blue (B) light detectors,roughly corresponding to the retina cone-receptors for long (L), middle (M) and short(S) wavelengths. The light values recorded by the detectors are given by the sum of therespective sensitivity function [3, p. 304]:

*The International Commission on Illumination, as abbreviated after its French name Commission

international de l'eclairage

5

CHAPTER 2. FUNDAMENTALS 6

R =

830∫300

S(λ)R(λ)dλ

G =

830∫300

S(λ)G(λ)dλ

B =

830∫300

S(λ)B(λ)dλ

(2.1)

where S(λ) is the light spectrum and R(λ), G(λ), B(λ) the respective sensitivity func-tions for the detectors within the visible light spectrum with wavelengths in the range300nm ≤ λ ≤ 800nm.

A common representation of the RGB model is given with the aid of a Cartesiancoordinate system, where each of the three axes corresponds with one of the RGBvalues, normalized to the range [1, 0]. The model spans a subspace in the formof a cube, in which the RGB primary colours are at the three axes-aligned corners(1, 0, 0), (0, 1, 0) (0, 0, 1), the secondary colours cyan, magenta and yellow are at the cor-ners (1, 1, 0), (0, 1, 1) (1, 0, 1), and the grey scale � i.e. the points of equal RGB values� lie on the diagonal between black at the origin (0, 0, 0) and white at the corner farthestfrom the origin (1, 1, 1), as seen in �gure 2.1 [4, p. 402]:

Figure 2.1 � The RGB colour space cube

Images encoded in RGB usually consist of three component images, one for each primarycolour. The number of bits used for the representation of each pixel is called pixel depth.Thus, a 24-bit RGB image consists of three component images, each with a pixel depthof 8-bit, resulting in a maximal number of (28)3 = 16,777,216 ≈ 16.7M of encodablecolours [4, p. 403].


Disadvantages of the RGB colour model are [3, p. 305]:

� Lack of standards for the sensitivity functions for the registration and reproduction(i.e., display or print of RGB images) stages. This results in practice in a device-dependent colour space.

� Perceptual non-uniformity owing to the existence of metamers � colours of di�er-ent spectra, and hence di�erent RGB representation, but with the same perceptualvalue for a human viewer � on the one hand, and to the low correlation betweenthe perceived di�erence of two colours and their Euclidean distance in the RGBspace on the other hand.

� Psychological non-intuitivity, as RGB attributes are not the characteristics gener-ally used by humans to distinct one colour from another [4, p. 398]

2.1.2 HSI/HSV colour model

In order to meet the aforementioned psychological disadvantage of the RGB model theHSI/HSV colour models were developed. The abbreviations stand for the di�erent as-pects of colours as perceived by a human observer [4, p. 398, 407] [3, p. 305]:

� Hue � the attribute associated with the dominant wavelength in a mixture oflight waves, perceived by the observer as the dominant colour tone.

� Saturation � the purity, or level of non-whiteness of the colour. Desaturating apure colour � that is of one speci�c wavelength � may be done by adding lightwith power equally distributed over all wavelengths[5, p. 4]

� Intensity, or intensity Value � the brightness of the perceived light, which corre-sponds with the grey level of monochromatic images

Whereas the RGB model aims at imitating the sensory level of the HVS, HSI/HSVattempt to imitate its higher, perceptual level. Thus the former "is ideal for imagecolour generation (as in image capture by a colour camera or image display in a monitorscreen)"[4, p. 407], whereas the latter is more suitable for tasks and algorithms basedon natural, i.e. human colour descriptors.

There are di�erent approaches to the transform from the generative RGB representationinto the perceptive HSV/HSI representation. The discrepancies lie mainly in the di�erentde�nition of the intensity or intensity value aspect of the colour*. cbgci makes use of

*This manifests itself in the di�erent names of the corresponding colour models, e.g. HSV for Hue,Saturation, Value, HSI for Hue, Saturation, Intensity and HLS for Hue, Lightness, Saturation.


the HSV colour model. The following formulae describe the transformation from theRGB to the HSV colour model as implemented within the OpenCV SDK* [6]:

V ←−max(R,G,B) (2.2a)

S←−{V−min(R,G,B)

Vif V 6= 0

0 otherwise(2.2b)

H ←−

60 · (G−B)/S if V = R

120 + 60 · (B −R)/S if V = G

240 + 60 · (R−G)/S if V = B

(2.2c)

with H ←− H + 360 if H < 0

resulting in:

0 ≤ V ≤ 1

0 ≤ S ≤ 1

0 ≤ H ≤ 360

Projected on a plane perpendicular to the neutral axis (i.e. the diagonal between theorigin of coordinates and the point (1,1,1), corresponding to the grey scale axis betweenblack and white) the corners of the RGB cube form a hexagon with red, yellow, green,cyan, blue and magenta at its corners. The hue value as calculated in equation 2.2ccorresponds with the angle between the positive axis of this plane and the projection onthis plane of a vector pointing from the origin in the direction of the RGB colour point.Figure 2.2 illustrates this relationship�.

Figure 2.2 � The hue plane as a projection of the RGB cube

*OpenCV was used for all the image manipulations within the project�Image origin: http://commons.wikimedia.org/wiki/File:HSL-HSV_hue_and_chroma.svg

http://commons.wikimedia.org/wiki/File:HSL-HSV_hue_and_chroma.svg


The resulting values for hue, saturation and intensity are then scaled to �t in the reso-lution of the chosen representation. When working with a pixel depth of 8-bit the full360◦ resolution of hue values cannot be reproduced within its 255 discrete values, inwhich case it is integer-divided H ← H/2, which in turn introduces inevitable loss ofinformation.

The introduction of the HSI/HSV colour model does not relieve the device dependencyinherent to the RGB model, as it comprises of transformations of existing RGB values.In addition, operating with hue values is non straightforward due to the discontinuityaround 360◦. Finally, although the concept behind the model derives from the humanperception there is little correlation between the representation within the model andthe actual human notion of colour [3, p. 306].

2.1.3 YCrCb colour model

The introduction of colour TV broadcast dictated the need for an encoding schemefor the colour signals which would be backwards-compatible with the older receivers,decoding solely monochrome, i.e. luminance or grey-scale images. The solution lies inencoding the chrominance (colour related) information separately, retaining a dedicatedchannel for pure luminance (light intensity) broadcast.

Extracting the luminance information of the image into the Y channel is done accord-ing to the speci�cation of the used RGB model, wherein the contribution ratio of theindividual colour component along the neutral axis of the RGB cube is de�ned. Twoadditional channels Cb and Cr* are then used to encode the chrominance information interms of value di�erences. As the HVS sensitivity to luminance outranges its sensitivityto chrominance, it is possible to further compress the two chrominance carrying channelsin favour of broadcast bandwidth without notable loss of quality [3, p. 307].

The following formulae describe the general form of transformation from the RGB inthe YCrCb colour model:

Y = KR ·R +

KG︷︸︸︷(1−KR −KB) ·G+KB ·B (2.3a)

CR =1

2· R− Y

1−KR

(2.3b)

CB =1

2· B − Y

1−KB

(2.3c)

where KR = 0.299, KB = 0.114 and KG = 1 − KR − KB = 0.587 are the coe�cientsderived from the RGB space de�nition, and which express the perceptual contributionof each of the colour ranges to the human perception of luminance.

*Standing for Chrominance Blue and Chrominance Red, respectively


Adding a padding δ = 128 in order to shift the CR and CB values into the [0..255]range, we receive the transformation formulae for 8-bit RGB as implemented within theOpenCV SDK and used in the project:

Y = 0.299 ·R + 0.587 ·G+ 0.114 ·B (2.4a)

CR = (R− Y ) · 0.713 + δ (2.4b)

CB = (B − Y ) · 0.564 + δ (2.4c)

Rearranging equations 2.4 yields the back transformation from YCrCb to RGB:

R = Y + 1.403 · (CR − δ) (2.5a)

G = Y − 0.344 · (CR − δ)− 0.714 · (CB − δ) (2.5b)

B = Y + 1.773 · (CB − δ) (2.5c)

As both transformations are linear the may be also expressed in terms of matrix multi-plications [5, p. 15]: YCR

CB

=

0.299 0.587 0.1140.5 −0.419 −0.081−0.169 −0.331 0.5

·RGB

+

0δδ

(2.6a)

RGB

=

1 1.403 01 −0.344 −0.7141 0 1.773

· YCR − δCB − δ

(2.6b)

As is the case with the HSV/HSI models, the luminance information in the YCrCb modelis kept separated and thus may also be manipulated independent from the chrominancecomponents of the image*. Calculating the CbCr chrominance information does notinvolve the computational di�culties which arise from the angular representation of thehue channel. On the other hand it lacks the intuitive qualities of the hue-saturationrepresentation.

2.2 White balance

Automatic white balancing (AWB) is the attempt to compensate for the loss and shiftof colour gamut present on a (digitally) acquired image as compared with the humanperception of the same scene: e.g. a white body may appear reddish when photographed

*Caveat � although conceptually similar, the luminance Y and the intensity V are obtained di�er-ently from the RGB data, compare equations 2.3a and 2.2a.


under low temperature, and bluish under high temperature light conditions [7]. Incontrast to human colour constancy, i.e. the ability of the human visual system to adjustperceived colours independent of the prevailing lighting conditions, chromaticity anddirection of scene illumination e�ect digitally captured images and cause hue shifts andcolour casts. Bianco et al. state that di�erent AWB solutions �require some informationabout the camera being used, and/or are based on assumptions about the statisticalproperties of the expected illuminants and surface re�ectances� [8].

Aiming at a general, preferably hardware independent solution, cbgci relies only on thelatter approach for its WB procedures. Meeting the di�erent requirements and availablecomputational capacities in its di�erent modes of operation, cbgci utilizes two di�erentWB algorithms: a solution based on colour adjustments in the RGB space described byKim et al. in [7], and an approach combining steps proposed by Pilu and Pollard in[9] and by Kimmel, Elad et al. in [10], both derived from the Retinex model for humancolour constancy. These two algorithms are described in the following sections.

2.2.1 Di�erential RGB white balancing

As mentioned above, AWB algorithms aim at removing perceived colour casts fromimages, originating from inadequate lighting conditions and\or inherent colour bias ofthe capturing device. On the other hand, uniform coloured objects that occupy largeparts of an image may also be perceived as colour casts; compensating for such a shiftwould result in loss of colour integrity in the scene. Kim et al. proposed a methodwhich decides the cause of a colour cast in an image in order to determine if it shouldbe compensated for or not. Their algorithm is a re�nement of the Grey World AWBalgorithm, which assumes a universal grey colour cast for every image and corrects theindividual colour values accordingly [8].

The algorithm �rst integrates over the values of each channel in an RGB image repre-sentation. Given the integrated value Ix for channel X, an individual pixel response Xp

may be expressed as:

Xp = IX · AXp (2.7)

with AXp the appropriate channel gain for p. Any individual colour tone* in the RGBcolour model may be further represented as a point on the plane de�ned by the colourdi�erence values r−g and b−g (Figure 2.3).

*Comparable with the hue dimension in the hsv colour model


Figure 2.3 � The colour coordinate system [7, p. 431]

The underlying assumption is that the individually integrated colour values for the real-world scene are equal*, resulting in the target colour gains for the red and blue channelsfor the white colour point in the (r−g),(b−g) plane given by:

AR =IGIR· AG

AB =IGIB· AG

(2.8)

By adjusting the red and the blue gain values AR and AB the integrated colour valuesmove towards the white point at the point of origin of the colour-di�erences plane andeliminate the cast.Before such an adjustment might take place though, it must be decided if the colour castis to be corrected at all. For that sake a prede�ned AWB frame is imposed on the colour-di�erences plane. Only images whose colour casts lie within the frame's boundariesshould be corrected, otherwise they are assumed to be dominated by a single primarycolour and left untouched. A compensation has to be introduced for poorly illuminatedimages, in which colour casts of both causes � dominating primary colour as well asacquiring distortion � might fall within the AWB frame, owing to the di�erential natureof the (r−g),(b−g) plane with small channel values. The algorithm uses a normalizedcolour di�erence DN for this purpose:

DN(X,G) = D(X,G) · TGIG

for X ∈ {R,B}, and D(X,G) = |X −G|(2.9)

where TG is a prede�ned target value for the green channel. Figure 2.4 shows the AWBframe and low luminance colour-di�erence points before and after normalization.

*This corresponds with the fundamental assumption of the grey-world model


Figure 2.4 � Colour di�erence points in low luminance (a) and after normalization (b), in rela-tion to the AWB frame [7, p. 432]

2.2.2 Retinex derived white balancing

Proposed by Land in [11], the retinex theory o�ers a model aimed at explaining humancolour constancy. Its basic assumption is that the recorded image intensity I at anygiven point is a product of the light re�ectivity R, which is inherent to that part of thescene, with the illumination L at that point at the time of image acquisition:

I(x, y) = R(x, y) · L(x, y) (2.10)

Recovering the re�ectivity � which is assumed to be a quality speci�c to the objects� shall theoretically restore the colours of the scene to their neutral state, withoutthe distorting e�ects caused by illumination. Yet, as usually no direct informationconcerning the illumination conditions is at hand, further assumptions must be made.


Concentrating on grey scale intensity images, Pilu and Pollard assumed that the un-known illumination could be approximated by the low-frequency component of the imagedata � keeping in mind that most of the high frequency information in an image lies insurface boundaries which coincide with transitions between comparatively large areas ofuniform colours, and that on the other hand illuminants are in the vast majority of casesuniform* [9, p. 6]. Low-passing the image is realised by applying Gaussian smoothingwith a relatively large kernel. By reshaping equation 2.10 the assumed re�ectivity of thescene R(x, y) may now be recovered as the ratio between the image intensity and the

illumination L(x, y), inferred by I(x, y) = Ismoothed(x, y):

R(x, y) ≈ Iretinex(x, y) =I(x, y)

I(x, y)(2.11)

The results of this manipulation are demonstarted in �gure 2.5.

Figure 2.5 � Intensity curves of a grey scale image containing text and displaying unbalancedintensity (comparable with colour casts on colour images) (A), and the recoveredre�ectivity (B) [9, p. 7]

As mentioned above, this method was initially suggested for correcting grey scale im-ages�. Theoretically the same procedure might be applied to the individual channels ofan RGB image, thus recovering the scene's respective red, green and blue re�ectivity,and resulting in a colour-corrected RGB image. As Kimmel et al. argue, such a proce-dure might in practice "cause colour artefacts that exaggerate colour shifts, or reducecolour saturation"[10, p. 14]. An alternative is to map the image into a di�erent colourspace which distinguishes the intensity from the chromatic information � such as theHSV or the YCrCb spaces � and then apply the retinex correction only to the intensitylayer, transferring the image back to the RGB space should the image be displayed or ifsubsequent processing is required.

A further re�nement may be introduced to the results of the simple retinex correctionthemselves. Kimmel et al. observe that the re�ectance images returned by the retinexcorrection procedure are sometimes over-enhanced. Its basic motivation being an at-tempt to imitate the human ability to "see through" unbalanced illumination and to

*Although they might not illuminate the scene uniformly�Pilu and Pollard suggested it in the context of image processing for text recognition


recover the "true" colours of a scene, in reproducing a "pure" representation of the re-�ectance the model overlooks the fact that human sight does perceive shaded areas �i.e. areas of reduced illumination � in a scene as such. Moreover, removing the illu-mination entirely exposes noise that might lie in darker, and hence dynamically weakerregions of the original image. Adding a corrected version of the original illumination tothe re�ectance information might compensate for these drawbacks.Having the inferred illumination value L at hand* a gamma correction step is performed:

L′(x, y) = W ·(L(x, y)

W

)1/γ

(2.12)

where W is the white value, equal to 255 for 8-bit images, and γ a free parameter, to beset empirically.Finally the reconstructed intensity values are given with:

I ′(x, y) = L′(x, y) ·R(x, y)

≈ L′(x, y) · Iretinex(x, y)

=L′(x, y)

I(x, y)· I(x, y)

= W ·

(I(x, y)/W

)1/γ

I(x, y)· I(x, y)

I ′(x, y) =I(x, y)(

I(x, y)/W)1−1/γ

(2.13)

For γ = ∞ the e�ect of zero illumination is produced, resulting in the pure re�ectanceimage scaled to the interval [0,W ], or I ′ = W · I/I = W ·R, whereas for γ = 1 the entireillumination is added back, rendering the whole procedure void with I ′ = I.

2.3 Histogram back projection

Histograms are means of statistical representation of data: the value range of the obser-vations is divided into consecutive, disjoint intervals or bins, each in turn holds countof the data points whose values fall within the interval assigned to the bin. Thus thetotal sum of data points in all bins equals the total number of data points in the ob-servation. Consequently, when normalized with the total number of data points, a bin'scount stands for the relative probability for the class of data points, which is representedby that bin, to be present in the observation.

*As computed during the retinex correction � Kimmel et al. propose a di�erent method for theextraction of the re�ectance, while here the value will be the same as used in equation 2.11, i.e. L = I .


When applied to digital images histograms can capture and represent various charac-teristics of the data, either straightforward � such as the intensity of single channelpixels or the colour value(s) of multiple channel pixels � or derived � such as gradientattributes in contour maps [12, p. 193].Using a histogram's nature as a representation of the distribution of values to evaluate anew observation is called back projection: a new data point is tagged with the probabilityof the occurrence of its value in the original set � i.e. with the normalized count of thebin in whose value range it falls. Consider a histogram H over the value range [a, b] withbins {b1, . . . , bi} and values a=v0<v1<. . .<vi=b so that for an initializing data set X:

bj1≤j≤i

:=∣∣∣{x ∈ X∣∣vj−1 ≤ x < vj

}∣∣∣Back-projecting a new data set Y using H yields for every data point y ∈ Y :

BPH(y) = bj

with j = argj

(vj−1 ≤ y < vj

) (2.14)

Histogram back projection is used by cbgci for the detection of skin coloured pixelsusing a pre-set histogram of hue values.

2.4 Foreground - background segmentation*

Motion is manifested self-evidently at the foreground of a visible scene, thus de�ningthe scene's background as a matter of fact. Establishing a model for the background(BG) of an observed scene � i.e. the elements in the scene that are assumed to remainstationary � plays a crucial role in many systems which involve motion detection inimage sequences. Each image is checked against the BG model to determine variations,which may consequently be classi�ed as foreground (FG). These FG regions, standingfor a good preliminary hypothesis for moving objects, may then be out-segmented andpassed on for analysis in further steps, according to the system's objectives [14, p. 1][15,p. 113�114][13, p. 1].cbgci utilises the FG-BG segmentation functionality implemented within the OpenCVvideo surveillance utility, which follows the Adaptive Background Mixture Model algo-rithm (ABMMA) suggested by KaewTraKulPong and Bowden [16, 13]. The ABMMAalgorithm models each background pixel with a weighted mixture of K, 3 ≤ K ≤ 5Gaussian distributions. Thus the probability that a certain pixel displaying the colourxN at time N belongs to the BG is given with:

p (xN) =K∑j=1

WjN (xN ; Θj) (2.15)

*Outline and notation in this section follow [13]


where Wj is the weighting factor of the jth Gaussian, represented through

N (x; Θj) = N (x;µj, Σj)

=

(1

(2π)12 |Σj|

12

)e−

12

(x−µj)T Σ−1j (x−µj)

with its mean at µj and the covariance Σj = σ2jI.

Each one of the distributions stands for a di�erent colour, while its weight representsthe time proportion this colour is present in the image sequence. The basic assumptionis that BG colours tend to display longer and more static presence in the scene incomparison with FG colours; therefore the BG is considered to be comprised of theB, B ≤ K most probable colours. The �tness value Wj/σj is used to sort the distributionsso that B is given with

B = arg minB

(b∑

j=1

Wj > T

)

for a threshold T .

Pixels in new incoming frames are checked against the prevailing model and markedas FG if their colour lies more than 21

2standard deviations farther than any of the B

distributions.

As the algorithm is designed to run without explicit prior knowledge about the sceneand to be able to adapt in real time to changes in scene-illumination and backgroundcomposition, a selective update scheme for the BG model is implemented. Pixels aremarked as BG in case they �t in one of the B Gaussians, and are then used to updatethis distribution. Otherwise a new Gaussian component is initialized with the new valueas its mean and a large covariance, and added into the mixture with low initial weightingparameter, as a seed for possible new BG colour distribution.

The update scheme follows the Estimation-Maximisation paradigm, as described in equa-tions 2.16(a. . .c) for an initial window of L frames, and in equations 2.17(a. . .c) for allfollowing frames. Here ωj stands for the j

th Gaussian, and there follows:

p (ωj|xN+1) =

{1 if ωj is the �rst Gaussian into which xN+1 �ts

0 otherwise


WN+1

j = WN

j +1

N + 1

(p (ωj|xN+1)− WN

j

)(2.16a)

µN+1

j = µN

j +

p (ωj|xN+1)N+1∑i=1

p (ωj|xi)

(xN+1 − µN

j

)(2.16b)

ΣN+1

j = ΣN

j +

p (ωj|xN+1)N+1∑i=1

p (ωj|xi)

((xN+1 − µN

j

) (xN+1 − µN

j

)T − ΣN

j

)(2.16c)

WN+1

j = WN

j +1

L

(p (ωj|xN+1)− WN

j

)(2.17a)

µN+1

j = µN

j +1

L

(p (ωj|xN+1)xN+1

WN+1

j

− µN

j

)(2.17b)

ΣN+1

j = ΣN

j +1

L

(p(ωj|xN+1

) (xN+1 − µN

j

) (xN+1 − µN

j

)TW

N+1

j

− ΣN

j

)(2.17c)

The authors argue that the fast converging update scheme for the �rst L frames improvesthe stability of the system and its accuracy in the initial stage; on the other hand,switching to an L-recent window update for later incoming frames prioritises recentdata and allows the system to adapt quickly to changes in the observed scene.

2.5 Fisher's linear discriminant

When dealing with observations, which are known to belong to two* classes "special in-terest attaches to certain linear functions of the measurements by which the populationsare best discriminated" [17, p. 179].For two classes� X1 := {~x 1

1 . . . ~x1l1}, X2 := {~x 2

1 . . . ~x2l2} with the following characteristics:

~mi∈{1,2} :=1

li·

li∑j=1

~x ij Class means

SB := (~m1 − ~m2)(~m1 − ~m2)T Scatter matrix between classes

SW :=∑i=1,2

∑~x∈Xi

(~x− ~mi)(~x− ~mi)T Scatter matrix within classes

(2.18)

*Fisher's discriminant is applicable also in multi-class settings. This not being the case with themethod's application within cbgci, the elaboration stays con�ned to the two-class scenario.

�Notation following [18]


Fisher's linear discriminant ~ω is the vector which maximizes the ratio P (w) between theprojected between-class and the within-class scatter matrices [18, p. 2]:

P (w) =wTSBw

wTSWw

⇓

~ω = arg max~w

(wTSBw

wTSWw

) (2.19)

In other words, the Fisher discriminant is the hyperplane which produces the strongestclass separation of the projections of the feature vectors on it, as these projections displayboth maximal between-class scatter and minimal within-class scatter.Solving ∂J

∂w= 0 for ~ω yields the discriminant vector:

~ω = S−1W · SB (2.20)

Having the discriminant at hand, it is still necessary to set the threshold c which sepa-rates the one-dimensional distribution of the projections of the class vectors ~x upon ~ω,so that a discrimination

~x ∈

{X1 if ~ω~x < c

X2 otherwise(2.21)

could be made. A good hypothesis may be the point on ~ω in equidistance from theprojections of both class-means, in which case c is given with:

c = ~ω · 1

2(~m1 + ~m2) (2.22)

Fisher's discriminant is used in the cbgci algorithm for colour-based pixel classi�cation.

2.6 The Kalman �lter*

Estimating the state of a process relying only on measurements which are availableat the current time is subject to distortion due to channel noise, measurement errorsand dropouts. Introduced by Rudolf E. Kalman in [20], the Kalman �lter is a methodwhich, under certain assumptions regarding the nature of the observed process and ofthe measurements "provides an e�cient computational (recursive) means to estimatethe state of a process, in a way that minimizes the mean of the squared error."[19, p. 1].The prerequisites that must be met for the Kalman �lter to be applicable are:

� The modelled process is linear

� The noise or error associated with the measurement is uncorrelated (white)

*Outline and notation in this section follow [19] and [12, p. 350�358]


� The noise or error associated with the measurement has a normal (Gaussian)distribution

The state vector xk ∈ Rn at time t = k of a time discrete process with n dimensionsthat meets these prerequisites may therefore be described with:

xk = Axk−1 +Buk−1 +wk

with p(wk) = N (0, Qk)(2.23)

where xk−1 is the previous state of the process, A a n×n transfer matrix describing thelinear transformation the state undergoes during the time step, uk−1 ∈ Rl≤n a vector ofexternal control values applied to the state's features at time k− 1 via the n× l controlmatrix B, and wk the Gaussian process uncertainty, or noise, de�ned through the n× ncovariance matrix Qk.A measurement carried out at time t = k may be expressed with:

zk = Hxk + vk

with p(vk) = N (0, Rk)(2.24)

where H is a m× n, m ≤ n matrix, expressing those features of the state of the processthat are measured in e�ect, and vk the Gaussian measurement uncertainty, or noise,de�ned through the m×m covariance matrix Rk.In practice a di�erence is drawn between the a priori estimate of the process's state �the most accurate estimation in the light of the knowledge available up to time point k� marked with x−k , and the a posteriori estimate which incorporates the measurementzk into the prediction, marked with xk. Accordingly, two error covariances are de�ned:

P−k = E[e−k e

−T ] for the a priori error e−k = xk − x−kPk = E

[eke

T]for the a posteriori error ek = xk − xk

(2.25)

Thus the a posteriori state estimate may be expressed as a linear combination of the a

priori estimate and a weighted di�erence of the actual with the predicted measurement:

xk = x−k +Kk

(zk −Hkx

−k

)(2.26)

The n × m blending factor matrix Kk is set so as to minimize the a posteriori errorcovariance. This prerequisite is ensured through

Kk = P−k HT(HP−k H

T +R)−1

=P−k H

T

HP−k HT +R

(2.27)

and the a posteriori covariance matrix itself is eventually expressed with:


Pk = (I −KkHk)P−k (2.28)

Equipped with a de�nition of the transition matrix A, the control matrix B, the matrixH which de�nes the linear relation between an internal state and its measureable man-ifestation, and a notion about the process and measurement covariance matrices Q andR, the Kalman �lter � expressed through equations 2.23, 2.26, 2.27 and 2.28 � deliversoptimal estimates of the future behaviour of a linearly characterized time-discrete pro-cess, and the possibility to recursively optimize the estimation with available measureddata.

Details concerning the application of the Klaman �lter within the cbgci algorithmis given in chapter 4. The CvKalman structure together with the associated functionsavailable from the OpenCV API were used for the integration of the Kalman �lterfunctionality in the system.

2.7 Contours

A contour is "an outline representing or bounding the shape or form of something"[21]. The ability to make out contours in images may therefore play a key role in theprocess of discerning and then tracking certain objects of interest in streaming video.The following subsections describe the methods used by cbgci for the extraction andfurther manipulation of contours.

2.7.1 Contour extraction

The visual manifestations of contours are at sharp transitions, i.e. edges between other-wise homogeneous areas in an image. Yet an intensity transition alone is not su�cientfor determining a contour � it is necessary to assemble contiguous edge pixels intocurves in the image [12, p. 222].

The function FindContours() from the OpenCV API, which was used for this very taskwithin cbgci, operates on binary images*. Thus edge pixels are given as a matter of factat sites of binary transitions. Iterating from pixel to pixel the function assembles theminto contours according to their connectedness. At the same time it keeps track of thecontours' hierarchy: A contour divides the plane into the area outside its perimeter � itsbackground � and the area or the shape it encloses. The latter portion itself might alsodisplay an intensity transition, which may be interpreted as a "hole" in the shape, giventhe binary nature of the image. The boundary of this hole is the interior contour of theshape de�ned by the �rst, exterior contour. The same relationship could recur withinthe hole, giving rise to a line of contours, all nested within the �rst. Such hierarchies

*An image wherein a pixel may take values from a binary set, e.g. {0, 1}, or {0, 255} for 8-bit images.


are retrieved* by FindContours() in a manner that re�ects their inner relationship andenables the caller to traverse them accordingly. Figure 2.6 illustrates the functionalityof FindContours()[12, p. 235].

Figure 2.6 � The boundaries of the bright areas A, B, C and D in the upper image will be markedas contours (c) or holes (h), and tagged according to their respective parent contour,as shown below

2.7.2 Polygon approximation

For an e�cient description of a shape or a part of it it may su�ce to operate on anapproximation of the shape's outline rather than on its detailed depiction. This couldbe achieved by omitting some of the points comprising the shape's contour. The criteriagoverning the decision, which of the contour vertices should be kept and which may bedropped, has to be adjusted according to the situation.Douglas and Peucker suggested in 1973 an algorithm for reducing the number of pointson a curve while maintaining an approximation within a pre-set tolerance [22]. For acurve c := {p1, . . . , pn} the algorithm begins with searching for a point pi displayingthe largest distance from the segment [p1, pn] between the �rst and the last point, andproceeds recursively with the two remaining point sets {p2, . . . , pi−1}, {pi+1, pn−1} and

*If so desired by the user � the level of nested contours to be extracted from an image is determinedat the function call.


the segments [p1, pi], [pi, pn] respectively. The algorithm terminates when for a segment[pj, pk] no point on the enclosed curve can be found whose distance from the segment isgreater than a certain ε ≥ 0. The algorithm is given in pseudocode 2.1, and it mode ofoperation on an open curve is illustrated in �gure 2.7.

Pseudocode 2.1 The Douglas-Peucker curve approximation algorithm

Algorithm: DP-Alg(contour := {p1, . . . , pn}, ε ≥ 0

)1. dmax ← 0, index← 02. for i = 2 . . . (n− 1) do3. dcurr ← OrthDistance (pi, [p1, pn])4. if dcurr ≥ ε and dcurr > dmax then

5. dmax ← dcurr6. index← i7. end if

8. end for

9. if index = 0 then

10. return{p1, pn

}11. else12. return

{DP-Alg

({p1, . . . , pindex}, ε

), DP-Alg

({pindex, . . . , pn}, ε

)}13. end if

Being of the divide and conquer class of algorithms, the running time of the Douglasand Peucker approximation algorithm when operating on a curve c := {p1, . . . , pn} canbe expressed with

T (n) = 2T(n

2

)+ f(n)

The function f , to be executed in each recursive call, consists mainly of traversingthrough all the input points, therefore f(n)∈Θ

(n)≡Θ(nlog2 2

). Consequently, by apply-

ing the Master theorem we get for the overall running time of the algorithm:

T (n) ∈ Θ(n log n

)(2.29)

The OpenCV function cvApproxPoly() implements the algorithm proposed by Douglasand Peucker in 1973 [12, p. 245], and is used by cbgci. When operating on closedcurves the function initializes by searching for the pair of points on the contour whichare farthest from one another, continuing along the guidelines of the original algorithm.

2.7.3 Convex hull

Being a set of points in a two-dimensional space, it is possible to establish a contour'sconvex hull. Following the general de�nition, for a contour comprised of the points


Figure 2.7 � Douglas-Peucker algorithm operating on an open curve [23]

{p1, . . . , pn} the convex hull S satis�es

S := {p1, . . . , pk} ⊇ {p1, . . . , pn}

andk∑i=1

αipi ∈ S for any {α1, . . . , αk} such thatk∑i=0

αi = 1

The OpenCV function cvConvexHull2() returns the convex hull of an input contourusing a version of the scanning algorithm proposed by R. L. Graham in 1972 [24]. Thealgorithm is outlined in pseudocode 2.2.The contour is input as an array sorted with ascending x-coordinates of its points. Thealgorithm cycles through the array with an o�set, commencing with the point with thesmallest y-value and the two consecutive points. During each iteration the subroutine3pOrient is called to check if the segment (p2, p3) between the last point which wasintroduced to the hull and the next candidate point displays a smaller angle with thex-axis compared with the segment (p1, p2) between the point before last and the lastpoint on the hull � a "right turn" � or if the points are collinear. In both cases the lasthull point is dropped in exchange for the candidate point (see line 10, with special-casetreatment when reaching the root indices in lines 7-8). In the case of a "left turn" thecandidate point is appended to the hull (lines 13-14). The next iteration is performedwith the next candidate point on the contour. On termination the algorithm returns


Pseudocode 2.2 The Graham algorithm for convex hull construction

Subroutine: 3pOrient(p1, p2, p3

)return

(p3.y − p2.y

)(p2.x− p1.x

)−(p2.y − p1.y

)(p3.x− p2.x

)

Algorithm: ConvexHull[p1, . . . , pn

]Require:

[p1, . . . , pn

]array sorted with ascending pi.x

1. j ← arg mini(pi.y)

2. Hull: Hull[i]← p(j+i)mod(n) with i = 1 . . . n3. M ← 24. for i = 3 . . . n do5. while 3pOrient

(Hull[M − 1],Hull[M ],Hull[i]

)≤ 0 do

6. if M = 2 then7. Hull[M ]←Hull[i]8. i++9. else

10. M ←M − 111. end if

12. end while

13. M++14. Hull[M ]←Hull[i]15. end for

16. return Hull, M

the M points which comprise the convex hull of the contour. Figure 2.8 illustrates themode of operation of the Graham scan algorithm.

Figure 2.8 � Graham scan: starting at P, the algorithmappends points A and B to the hull, and drops C when itimplies a right turn when checking for D [25]

Sorting the points is gener-ally performed withinO

(n log n

);

and although it may appear thatthe algorithm implicitly exhibitstwo nested 1 . . . n loops � hintedat by the reverse tracing in thecase of "right turns" � in prac-tice each point could be visitedtwo times at the most: it is theneither retained or dropped. Hence we get for the overall running time of Graham'salgorithm on a contour with n points:

T (n) ∈ O(n log n

)(2.30)


After establishing the convex hull, the areas in which it deviates from the contour, orin other words � the convexity "defects" of the contour � can be determined anddescribed. The OpenCV library o�ers the function cvConvexityDefects() for theaccomplishment of this task. The function documents for each such deviation its startingand its ending points, as well as the point within it for which the contour's pathway isfarthest from the hull and this point's distance [12, p. 260].This is accomplished in a running time within O

(n)for a contour with n points: each

point has to be checked at the most against the two hull points on which a contourde�ection containing it lies; thus the function has to cycle once through all the pointsof the contour.

Chapter 3

Related work

Computers and computer controlled systems enjoy an ever-rising penetration into ev-eryday life. The search for new forms of human interaction with computers derives fromthe rationale of keeping hardware costs low and system assembly robust, from the grow-ing variety of �elds and scenarios in which these systems operate, and from the wish tosimplify this interaction and make it intuitive.

When operating traditional UI equipment such as alphanumeric keyboards, pointingdevices or game controllers, the user accommodates herself to a certain extent to thetechnical limitations of the interface: commands are communicated in manners whichare unique to the interaction with the speci�c hardware, and are alien to natural hu-man interaction. Exchanging such equipment in favour of perceptive interfaces, may berewarded with granting computers such "abilities which so far have been a prerogativeof human-human communication" [26, p. 28].

In this sense Marco Porta de�nes in [26] a vision-based computer interface as a systemthat via an integrated camera seeks to

"autonomously acquire information about the user, in order to interprethis/her explicit or implicit "natural" body-related (i.e. based on body pos-tures or movements) commands" [there, p. 30].

Hand gestures make out a signi�cant portion of the various visible manifestation of nat-ural human communication. Hassanpour et al. describe it as a "means to express or em-phasize an idea or convey a manipulative command to control an action" [27, p. 2].Hence interfacing with computer systems using hand gestures may bene�t both fromtheir being used otherwise for similar purposes, and from the possibility to adopt therepertoire of ideas and commands commonly associated with certain gestures.

A hand gesture may be described as a temporal, spatial and conceptual phenomenon.In order to extract its "payload", i.e. the idea or the command it conveys, it should beregistered by the perceptive interface in all its dimensions. Zabulis et al. discern threecorresponding layers which comprise hand interactive systems [28, p. 2]:

27

CHAPTER 3. RELATED WORK 28

� The detection layer, operating in the spatial dimension and extracting features inthe image sequence that may be attributed to a hand or a �nger

� The tracking layer, operating in the temporal dimension and associating featuresin successive frames to possible integral movements

� The recognition layer, which makes out the conceptual dimension by combiningthe spatiotemporal data into known, interpretable gestures

How these layers and their interconnections are realised depends both on the speci�cationof the application which the gesture-interface system is designed for and on the hardwareat its disposal; whereby these two elements themselves are often strongly related.Three such systems are described in the following sections, together with the solutionsimplemented in them to meet the di�erent usage scenarios and available hardware.

3.1 Gesture recognition using Kinect

Kinect is a motion sensing input device introduced in 2010 by Microsoft as an interfacefor its Xbox 360 video game console. Kinect allows players to control the game usingbody motions and vocal commands. To accomplish its movement and gesture recognitioncapabilities Kinect is equipped with an RGB camera and an infrared depth sensor. Whilerelying on ambient light for its RGB camera, an infrared Laser device is integrated intoKinect, which is used to project a specially designed light pattern: the deformation inthis pattern as recorded by the infrared sensor are decoded to reveal the depth structureof the scene [29, 30].The data streams from the Kinect device include, beside the raw RGB and depth datafrom both video sensors*, also skeletal models which are extracted from the imageson board the device [31]. However it is reported that such skeletal model tracking isavailable only when the entire body of the user can be captured by Kinect's sensors [32].When in use as a game controller for the Xbox 360 console additional computations areperformed on board the console to accomplish gesture recognition.Matthew Tang presented a system which uses a Kinect as an input device for a commonPC and that is able to identify "grasp" and "drop" gestures [33]. Tang distinguishedthree stages within the algorithm of the system:

� Segmentation of hand pixels from the input image

� Extraction of features from the identi�ed pixels in order to classify the hand posture

� Recognition of the speci�c gesture

*Kinect provides also an audio data stream from its four-element microphone array, which is usedfor the voice control functionality


The identi�cation and segmentation of hand pixels is accomplished using both the RGBand the depth data available. In a preparatory stage the RGB image is colour-balanced.There follows the evaluation of pixels, using a trained Bayesian network which expressesthe probability dependencies between RGB and depth values and the identi�cation of thepixel as hand or non-hand. It is important to note that these operations are performedin a small region of interest within the entire image, in which the skeletal model marksthe user's forearm; moreover, the Bayesian network expresses a restriction on the depthregion in which the systems expects to detect a hand in the �rst place � nearer orfarther occurrences will not be detected. The segmentation stage is accomplished withthe application of a morphological closure on the resulting probability image, followingthe intuition about the spatial contiguousness of human skin.There follows the extraction of a set of features from the labelled pixels in order to beable to determine in which posture the inferred hand appears in the image. An a-prioriassumption about the position of the user of the system at a certain distance from thesensor is accepted as a scale invariance for the features; rotation invariance is providedusing the skeletal model: the position of the forearm corresponding to the labelled handpixels is retrieved and its rotation angle is used to rectify the image.The scale invariance assumption enables the system to limit the region of interest con-taining the user's arm in each frame to an area of 64 × 64 pixels. This area is dividedinto subregions of 8× 8 pixels, and for each pixel in a subregion the discrete derivationsin x and y direction are calculated. The feature F extracted from each subregion r isan ordered 4-tuple, comprised of the sums of the derivations in each direction and of thesums of their absolute values:

F (r) =

∑(x,y)∈r

dx(x, y),∑

(x,y)∈r

|dx(x, y)|,∑

(x,y)∈r

dy(x, y),∑

(x,y)∈r

|dy(x, y)|

This de�nition is a modi�cation of the SURF features proposed by Bay et al. [34],[33, p. 3]. Using these features and a Support Vector Machine the system is able todistinguish between open and closed hand-palms in the labelled images. Tang accountsthe success of this method in comparison with other approaches to the SURF-features,which were originally designed for small image regions, as is the case in his proposedsystem.The �nal recognition of gestures relies on the tags determined by the extracted features.Tang's system "understands" two higher gestures:

Dropping - a closed hand followed by an open hand

Grasping - an open hand followed by a closed hand

A Hidden Markov Model that expresses the transition probability between hand posturesis used to suppress noisy transitions, otherwise present when relying solely on the feature


labelling. Tang presents for the results of his system a mean square deviation of 0.0907from a human labelling of the same frame sequence.Apart from the use of Kinect, which delivers rich data from two basically independentdomains (RGB and depth) and additionally inferred meta data (skeletal model) withno computational load for the host system, the algorithm relies on crucial invarianceassumptions for its proper operation.

3.2 A camera-projector system

Working through a multimedia presentation involves continuous interaction with thecomputer hosting it. Several solutions were presented which aim at freeing the userfrom the use of standard input devices for this interaction. Licsar and Syziranyi criticizedsome of these solutions for requiring additional hardware such as touch-sensitive displaysor laser-pointers, for involving non-intuitive handling such as the encoding of commandswith the number of raised �ngers or the restriction of the performance of gestures toa predesignated area, and for poor performance under real-world conditions such asnon-uniform background colour or cluttered projected image.The system Licsar et al. suggest in [35] is based on coupling the projector with a videocamera for the detection of the gestures, and is claimed to correctly detect and interpreta large gesture vocabulary in the presence of cluttered background. The system's inputis the image captured by the camera, and its work�ow consists of the following loop:

� Frame grabbing

� Background segmentation

� Gesture analysis

� Command execution

� Display-image update and/or refreshing of the projection

The segmentation process relies on the re�ectance di�erence between the screen usedfor the projection and the human skin of the hand and arm at its foreground, which theauthors estimate with 30%. The captured image deviates from the image data which isused to create the projection due to deformation as a consequence of the geometry of theprojection situation, and as a result of colour distortion caused by the colour transferfunctions of the projector and the camera. A colour lookup-table and geometrical warp-ing equations are established in a preliminary calibration phase and used to internallysynthesize a background image. A comparison of the synthesized image with the grabbedframe reveals the areas that display poorer re�ectance, which are then segmented andregarded as forearm and hand regions.


As the main attention is given to gestures performed by the palm and �ngers, the forearmis segmented out as well, relying on the geometrical characteristics of the wrist: the fairlyconstant width of the outline typical of the forearm changes abruptly at the transitionpoint at the base of the palm. Consequently the segmentation process delivers thecontour of the wrist and hand palm. The various segmentation stages are demonstratedin �gure 3.1.

Figure 3.1 � Segmentation steps (from left to right): input image, captured image, segmentedarm, and the extracted palm contour [35, p. 86, Fig. 2]

In order to analyse and classify it as one of the gestures in the vocabulary, meanshave to be devised for the description of the extracted contour and its comparisonwith the available class descriptors. Licsar et al. use modi�ed Fourier Descriptors* tocharacterise the contours: retracing the open wrist and palm outline a closed contouris created on the representation of which a DFT is performed. The authors were ableto experimentally reduce the number of Fourier coe�cients to the �rst six, maintaininga su�cient description of the contours while e�ciently masking noise. For two curvesdescribed via their coe�cients F 1

n and F 2n the implemented metric used is given with

Dist(F 1n , F

2n

)= σ

(|F 1n ||F 2n |

)+ σ

(|F 2n ||F 1n |

)It was observed that users perform each gesture for a period of 1 to 2 seconds. Followingthis observation, the system determines the command to be executed by �nding themost likely gesture along the time window in which a gesture is detected. The authorsfound this approach to outperform the straightforward classi�cation relying on per-frameidenti�cation, thus smoothing the system's response.

Licsar et al. identify the advantage of their solution in the simplicity and �exibility ofoperation it introduces � the system is reported to "understand" nine di�erent gestures� without such drawbacks as requiring the user to wear or to operate additional hard-ware or the introduction of unintuitive gestures. Yet, the system depends on a certainspatial arrangement of its components � projector, camera, and user � in order toestablish the visual feedback loop which ensures its correct operation in the �rst place,and on a crucial calibration phase before each employment.

*For further details about modi�ed Fourier Descriptors and 2D shape representation see e.g. [36]


3.3 A simple PC and web camera setup

Manresa et al. proposed in 2000 a system which utilizes a common web camera as avisual interface and a user's hand to control a PC running a video game [37]. Theirgoal was to o�er a solution which would provide an acceptable real-time experience forrandom users and would require neither invasive measures such as coloured gloves orhand marking nor coloured or otherwise prede�ned background for its proper operation.The system's algorithm goes through the following steps when analysing each incomingframe:

� Hand segmentation

� Tracking and determining of the position of the hand in the frame

� Feature extraction and gesture recognition

The segmentation criterion chosen by the authors was colour, owing to its invarianceagainst shape and position of the hand, and to the computational simplicity of its manip-ulation. In a preparation phase before each employment, the user must place her handin front of the camera within a prede�ned training region , enabling the system to learnits skin colour features. The pixels in the training region are transformed from the RGBinto the HSL colour space*, and with the pixels' hue and saturation values a Gaussianmodel is computed that represents the skin-colour probability density function. For anynew candidate pixel ~x = (H,S) with H and S its hue and saturation values respectively,the probability that it represents a skin pixel is therefore given with

P (~x) =1√

(2π)2 |Σ|· e−

12

(~x−x)Σ−1(~x−x)T

where x is the mean and Σ the covariance of the probability density function.

In each incoming frame the probability for each pixel to be a representation of a skinregion is calculated. Applying a connected components algorithm to an image repre-sentation of these probabilities and grouping the appropriate scores, the system createsblob(s) which can be identi�ed as skin areas belonging to a hand. Instability in huevalues in poorly lit pixels is eliminated through �ltering out image areas with low sat-uration, which improves the robustness of the system against background and lightingchanges. In case the system does after all loses track of the operator's hand it can becalibrated again.

*The HSL colour space is similar to the HSV space discussed in section 2.1.2. The di�erence lies inthe transform from RGB to the channel which represents the value in HSV and the lightness in HSL:V ←− max(R,G,B) (see equation 2.2a) whereas L←− 1

2 ·(max(R,G,B) + min(R,G,B)

).


The features describing each blob at time a point t are its position in the image, itswidth and height and the angle of its orientation within the frame*. A hypothesis isconstructed about the state of each blob in the new incoming frame according to theblob's state in former frames. In particular the hypothesis involves a transition only forthe blob's position p, which is expressed through a second-order autoregressive processwith information from previous frames:

p(t+ 1)− p(t) = (pt)− p(t− 1)

Pixels in each of the detected blobs in the new frame are tested for their distance fromthe ellipse implied by the width and height of the hypothesis. If the distance is equalor less than zero, the pixel lies inside the ellipse or on its boundary, is considered tosupport the hypothesis, and is added to the representation of the new hand state [37,p. 99].The system is designed to recognize eight gestures, as depicted in �gure 3.2.

Figure 3.2 � Manresa et al.: gesture alphabet and valid transitions [37, p. 100, Fig. 3]

*The authors do not elaborate on the method used to determine the orientation. It may be assumedthat the �rst and second principle components of the blob are taken for its height and width, respectively(compare [37, p. 101, �gure 4] ). It follows that the angle between each of the blob's dimension andthe corresponding 2D axis of the frame (e.g. width and the x-axis) determines the rotation angle of theblob.


Recognition is done on the basis of the blob features which were extracted in the previoustracking stage, and of the convex hull of the hand contour and the related convexitydefects (see section 2.7.3 on page 25):

� Through averaging over the depths of convexity defects the start gesture is rec-ognized, as it displays a higher value than all other gestures.

� stop is recognised when the di�erence between the axes of the ellipse circumscrib-ing the blob is smaller than a �xed threshold.

� The front and back gestures are recognised by comparing the area of the blobto an initial value, which is computed when the user �rst places her hand in themove position.

� When in the move state, left and right gestures are determined when theorientation angle of the blob exceeds an angular threshold in each direction.

� The no hand state is satis�ed when no hand is detected, or if the detected blobsare too small or fall out of the frame.

Manresa et al. report a very high recognition rate of 99% after running tests with 24 users.The possibility to reinitialise the skin-colour �ngerprint whenever there is a deteriorationof the system's performance adds to the overall stability experienced throughout anyemployment session.It is not mentioned if direct feedback information � e.g. in form of a window showing theinput image � is available for the users in a way which might assist them in maintainingtheir hand within the camera's �eld of view and help them "improve" their gestures incase of failed recognition. The authors do mention a certain training period that isrequired before the testers could interact successfully with the system.The implemented gesture vocabulary is presumably tailored to meet the speci�cationsof the video game which was used for the test � though no detailed information is givenabout it. The authors do not report how the system behaves when a detected sequencesof gestures does not comply with any of the prede�ned valid transitions. The positionfeature used to characterise detected blobs is used only for hypothesis validation andblob detection � according to the given de�nitions the gestures in the vocabulary areindi�erent to 2D translations.Manresa et al. did not present a general approach to the problem of gesture recognition;by extracting a small but robust set of features they did implement a stable solution, anapplication-speci�c camera based UI.

Chapter 4

Algorithm

4.1 General approach and overview

4.1.1 Motivation

The motivation for the design of the cbgci system was the informa project. Theinitial aim of informa was to provide blind and visually impaired users with a low-costyet reliable alternative to common reading devices on the basis of a thin-client approach.The hardware assembly consists of a common web camera mounted on a �xed arm, anda small computer, as seen in �gure 4.1(a). This setup was a mere front-end, used tocapture images of text documents placed on the �at surface under the camera, forwardthem to a remote server which takes over the computationally demanding OCR andTTS processing tasks, and play back the received audio �les. It is conceivable thatthe requirements for a control UI within a simple realisation of this initial informaconcept may be satis�ed with a user interface consisting of a single-button: as the solefunctionality would have been the reading of documents, a single unique signal is whatthe system would require to trigger an image capture, send it to the server and playback the TTS-generated audio �le once received.

In the course of its development the project took to o�ering a more comprehensivesolution, best described as multimedia-gateway [38]: the network connection � whichenabled the basic functionality of informa by delegating the OCR and TTS tasks toa remote server, thus reducing the role of the device itself to the capturing of imagesand the playback of the received audio � may be utilised to connect to the internet andretrieve information, access services and use web-based applications.

This new design objective necessitates a more versatile and �exible mode of interaction:the user needs to be able to choose between the services o�ered by the system, and,if applicable, interface with them as well. Moreover, control over the audio output of

35

CHAPTER 4. ALGORITHM 36

(a) informa main assembly and numericcontrol pad. The camera is mounted inthe arm, pointing down

(b) A closer look at the numeric controlpad

Figure 4.1 � The second informa prototype

the informa platform should be possible too, allowing the user to manipulate both thevolume and the speed* of the playback.

The UI which was chosen for the second informa prototype was a solution based ona common USB numeric keypad. Removing most of the keys of the original setup, agroup of four keys was organized to resemble a directional pad (D-pad), another key wasleft which served as a home-button, as well as two pairs of keys controlling the playbackbehaviour, as seen in �gure 4.1(b).

The D-pad re�ected the logical design of the menu system used to access the servicesand functions: The menu is constructed as a recursive choice tree with all the availableservices lined up in the uppermost level, and with varying depths according to the natureof each individual service, as illustrated in �gure 4.2. At any level the user could cyclethrough the di�erent choices using the left and right controls of the D-pad, make achoice with the down key or return one level back by hitting the up key. Hitting thehome-button at any time would return the system to the main menu. Making a choicethat resulted in one of the leaves of the choice-tree would initiate the desired action, andupon termination return to the last choice point. The user is permanently informed bya machine voice, announcing the possible options as they are traversed or reciting thetextual content requested by the user.

*This is a standard feature with many existing reading solutions for the blind and visually impaired;test subjects reported during the informa �eld test that their experience with such systems, andpresumably also an acquired higher acoustic awareness, enables them to use this functionality and thussave time.


Figure 4.2 � A �ow diagram showing the logical structure of the informa user menu. The dia-gram is not exclusive: the variety of services and functions is theoretically in�nite.

As the D-pad controls are practically soft-buttons � i.e. contextless controls whicheither mirror the relations within the conceptual structure of the system's menu ortake over a simple triggering functionality at certain situations � they support themodular, interchangeable and expandable principle guiding the informa design. Anew functionality or service may be inserted at any point along the main menu, or anew node at any point in the respective depth of a sub-menu, without having to adaptthe UI or to introduce new buttons*.

However, this advantage holds only for those services which may be o�ered as multiplechoices between equivalent possibilities, or for those functions which may be controlled� or rather triggered � with a single action. For example, if the data represented bythe menu items were organized in a data bank, there would be no possibility to access it

*Such amendments and expansions could in fact be carried out in a manner perfectly transparent tothe user: the device only sends the navigation commands to the server � a change of the structure orthe content of the menu takes place only on the server, and it is its reactions that would change ratherthan the client's, i.e. the informa device's behaviour


along di�erent relations between entries without following a representation of each suchrelation, which would have to be explicitly de�ned as a tree search path .

Moreover, the nature of a control concept which is wrapped around navigation in cyclicchoice menus can prove tedious to use as much as it is simple: the need to repeatedlystep through many menu items in order to get to the desired functionality is cumbersomeand frustrating. Although the design theoretically allows for personalisation of the orderof menu items, the problem is intrinsic of the linear structure of the menu levels. Thefeedback from the system informing the user of her current position within the menustructure being acoustic and therefore one at a time � as opposed to the simultaneousassessment of multiple choices available to a user without sight impairment operating agraphical UI � stresses out the problem.

To overcome these limitations a new UI had to be developed. It should be backward com-patible with the existing menu structure, and at the same time be �exible and generic,allowing for the realisation of di�erent, service-speci�c control concepts. Utilising theinforma on-board camera as an input device for user interaction in the form of gesturescould satisfy both these requirements: The modus operandi of the D-pad could be em-ulated through an adequate design. At the same time the variety of gestures that maybe recognised and interpreted by such a system and with it the �exibility in UI-designis theoretically unlimited. The elimination of the need for the extra hardware � thenumeric pad � is a welcome by-product. The informa device could be fully functionalrelying on a minimal hardware setup.

cbgci was developed in order to meet these challenges. The main design constraintswere on the one hand the limited computation power � the informa on-board computerwas assembled to meet the minimal requirements of a thin client, which only deliversrequests to the server and plays back the returned audio �les � and on the other handthe objective of creating a gesture interface that is robust against such elements aschange of lighting, di�erent users or unpredictable background, and at the same timeis �exible and expandable. Figure 4.3 illustrates the work�ow of cbgci. Some generalaspects of the system's design are discussed in the following section.

4.1.2 Design overview

Vision based user interfaces are designed to capture and interpret relevant events whichtake place in the �eld of vision of a visual sensor � a camera. As such, hand gestureinterfaces aim at recognising hand gestures performed by the user and interpret them ascommands. Hassanpour et al. point out the dynamic nature of human gestures, stressingtheir being spatial as well as temporal phenomena. In addition, they mention the threephases which make out a hand gesture [27, p. 2]:

Preparation � bringing the hand from its resting position to the starting point of thegesture


Figure 4.3 � A work�ow diagram, describing the main cbgci algorithm

Nucleus � the hand motion which conveys the intended concept, the "payload" of thegesture

Retraction � returning the hand to its resting position

The three technical layers � detection, tracking and recognition� which comprisea vision based hand-gesture driven interface (see section 3 on page 28, as well as [26,p. 31]) must eventually provide the interfaced system with the appropriate internal rep-resentation of that concept which the performed gesture stand for. A correct recognitionof the gesture is therefore dependant on a correct distinction of the nucleus phase fromthe anticipating preparation phase and the following retraction phase.The cbgci relies on the camera which is integrated into the informa device as aninput sensor. The camera is mounted on a �xed arm so that its �eld of view and therectangular upper surface of the device are congruent. This setup is dictated by theoriginal functionality of the platform: on this even surface with the dimensions of a


standard A4 sheet the user should place the text documents she wishes the system toanalyse and read out.As it �lls the �eld of view of the camera this surface could ful�l the task of a detectiontrigger: Once a �nger or a hand were detected on its background the beginning of agesture could be signalled, which would terminate once the foreground object leaves theframe. At the same time the discernible boundaries of the surface might serve as a tactilethreshold � the visually impaired user could expect her hand or �nger movements tobe detected by the system and interpreted as gestures as long they rest or move on thesurface, and only then.While the second functionality � serving as a guideline for the user � is straightforwardand is practically a by-product of the physical design of the platform, the gesture controlsystem cannot always rely on the surface to serve as an invariant background, on the basisof which gestures could be detected using only FGBG methods. It is much more likelyfor the camera to capture di�erent backgrounds � and at that such that are unfamiliarto the system � on di�erent occasions on which the user may be using gestures, asthese might be performed on the surface of a document which is about to be or thathas already been processed by the informa reading functionality. A gesture interfacesystem for this platform must therefore apply a combined approach for its detection andtracking phases, i.e. for the continuing decision about the presence and current locationof a gesturing hand or �nger.

Many proposed solutions for the task of hand or �nger tracking apply colour-based ap-proaches, relying on pre-set and/or dynamically inferred spectral characteristics of theskin, and in some cases assuming prerequisites such as certain garment colours whichenhance the contrast with the skin, special coloured markers on hands or �ngers, or aprede�ned and constant background [26, p. 55], [28, p. 3].Looking for the shape of a hand or a �nger in the captured image is a straightfor-ward attitude. However, it requires knowledge about the sought shapes and reliablesegmentation of the relevant image areas, which is di�cult to sustain in the presenceof cluttered background and sub-optimal illumination; occlusion and perspective re-lated shape-distortion challenge this approach further [28, p. 5]. The application ofshape-based tracking techniques using only ad-hoc knowledge about shapes, which weredetected using other methods, may be used as well.Detection through motion may be applied too, though relying solely on this approachrequires that the only moving objects in incoming frames should be of interest, i.e. handsor �ngers [26, p. 61], [28, p. 9].cbgci uses a mixture of these paradigms to detect gestures. Combining the di�er-ent approaches compensates for the obstacles each of them might encounter if appliedseparately: colour instability due to unpredictable lighting conditions and backgroundtexture, no prior knowledge about the expected shapes and limited computational re-sources. Necessary image processing algorithms are applied to the incoming frame,providing the system with the information necessary for detection.


Once a positive detection occurs certain features are extracted from the image segmentthat has been detected. While processing the following frames this information, con-cerning the colour, shape and position of the object, will provide the necessary basisfor e�ciently tracking it until it is no longer detected. At the same time, the extractedfeatures and their temporal variations through the gesture's life-span are used for itsidenti�cation and, where applicable, for the interpretation of its conceptual payload.Image processing techniques are applied for these purposes, along with the modelling ofthe spatial behaviour of the tracked object using the Kalman �lter.At the end of the processing-cycle of each frame the relevant gesture features, if theywere found, stand ready for interpretation. Access to these features is gained throughan intermediate layer � the tracking manager. The software user can register di�erentimage regions with the manager and de�ne in which form the gesture features should bedelivered: as absolute or relative position information, as a motion gradient direction,or as discrete localisation within a pre-set cell grid. This concept supports the �exibleapplication of the cbgci system in di�erent scenarios:

� The higher meaning of the gesture information is conferred by the client software,freeing cbgci from this task.

� Di�erent commands or controls can be emulated using di�erent regions of theimage, which correspond to di�erent regions on the informa device's surface.These can be dynamically registered and de-registered according to context.

� Con�ning the computationally expensive image processing tasks to the de�nedimage regions maximize the e�ciency of the application. Moreover, some gesturefeatures that su�ce for the emulation of a speci�c command or control might beacquired with less computational e�ort; selective use of the di�erent detection andtracking methods at the disposal of cbgci in such cases may increase the e�ciencyfurther.

The software client � the image acquiring and processing program within the informasystem � provides the cbgci tracking manager with incoming frame data. Conse-quently it can step through the tracking areas registered with the tracking manager andinterpret the gesture information available for each of the areas as a command. Di�erentoptions and services which the informa user might step through while working with thedevice, and which might require di�erent control schemes according to the functionalitythey o�er, can be dynamically emulated through continuous reshaping and adaptationof the image regions registered with the manager and the way they are interpreted.Examples for this behaviour are given in chapter 5.Details of the cbgci algorithm are discussed in the following section, including thevarious image processing techniques and the modelling and estimation methods usedfor e�cient and conclusive detection and tracking. Section 4.3 presents an alternativeimplementation of the detection and tracking algorithm, together with the motivationthat led to its development.


4.2 Detailed description

As described above, the cbgci system core takes over the detection and tracking phasesof gesture recognition. The following sections provide detailed descriptions of the vari-ous data processing and analysis measures applied by the system during its operation,outlining their roles within the detection and tracking mechanisms. It is important topoint out, that all the stages of the algorithm are performed individually for each regionwhich was registered with the tracking manager. Thus the system can adapt more accu-rately to changing environmental changes that e�ect certain regions of incoming framesmore than others � e.g. colour casts caused by shading � while saving the resourceswhich would otherwise be wasted on processing portions of the frame which were notdesignated for the performance of control gestures.

The control concept which the described implementation of the cbgci supports is basedon the de�nition of one or more areas in the image, which act as horizontal or verticalvirtual sliders. The user interacts with the system by moving a �nger along the main axisof an area, or by simply placing a �nger within a certain region, if the latter emulatesa virtual button. The various processing stages were designed for optimal performancein this scenario; this manifests itself mainly in heuristic decisions regarding thresholdsand �ltering approaches, which are realised so that the system could successfully detectand track �ngers in the designated areas. Chapter 5 provides an elaborate description ofthe usage of these basic virtual slider elements in the implementation of speci�c controloverlays for the informa device.

4.2.1 Preliminary image processing

In order to discern a gesturing hand or �nger in a frame cbgci uses characteristic colourfeatures as well as motion information (see 4.1.2 on page 40). Establishing a statisticalbackground model and colour balancing the frame are two preliminary stages whichprepare the frame and the frame related data for these purposes.

Automatic white balance

For its detection mechanism cbgci relies on a pre-set characteristic hue pro�le of humanskin (see section 4.2.2). When transforming image data from an RGB to HSV repre-sentation poor saturation may result in distorted hue values, as minor di�erences inmagnitude between the RGB channels may result in a 60◦ hue shift (compare equations2.2a and 2.2c). This phenomenon is most problematic when an object with a relativelysmooth surface colour appears in the frame within a region which displays a lightinggradient: although it is obvious that the hue response for the object should not varysubstantially, it might display strong irregularities because of the di�erence in the lightcast.


Using a retinex-derived white balance method � following the solutions by Pilu andPollard and by Kimmel et al. �- cbgci attempts to remove the in�uence of the illumi-nant on the scene and to form a hypothesis about the original colours of the re�ectantsurfaces. As mentioned in section 2.2.2 (page 14), these methods should operate on thelight-intensity related channel within a luminance-chrominance colour representationrather than on the individual channels of an RGB representation.Applying the retinex-WB method to a HSV representation of the image would leave thehue channel untouched, and render the e�ort futile. Therefore the frame is transferredfrom its original RGB to a YCrCb representation, of which the Y channel undergoesthe retinex-WB and the additional γ-correction. Following this the colour-balancedYCrCb image is transformed back to RGB and further to the HSV representation ifinitial detection is needed (as described in the following section 4.2.2). Due to thenature of the YCrCb→RGB transform, which involves a linear combination of the Ychannel with the shifted Cr and Cb channels (compare equations 2.5 and 2.6), the WBmanipulation e�ects each of the RGB channels, and thus also the resulting hue channelafter an RGB→HSV transform. Figure 4.4 demonstrates the e�ect of colour balancingon the input frame.

(a) Original input frame (b) After white balance

Figure 4.4 � E�ect of retinex-WB and Gamma correction

In frames following positive detection, i.e. during tracking, only the colour-balancedYCrCb frame is required for further analysis, therefore the transform back to HSV isnot performed (see further section 4.2.3 on page 48).

FGBG-Model

cbgci utilises an implementation of the Adaptive Background Mixture Model algorithmsuggested by KaewTraKulPong and Bowden for its FGBG model (see section 2.4). Eachregistered region is barred from tracking for the �rst 30 incoming frames: these are usedexclusively to establish an initial version of the model. After the initialisation period the


model keeps being continuously updated with incoming frames, yet not during trackingsequences. This prevents the model from gradually regarding pixels that belong to thetracked objects as background pixels, even if the objects remain immobile for some time.When tracking ceases new frames are once again used to update the model.

Figure 4.5 demonstrates the functionality of the FGBG model, when a hand appears onthe solid background which was learned during previous frames.

(a) Background (b) Hand in foreground

(c) Image foreground (d) Foreground mask

Figure 4.5 � Extracting a foreground mask

4.2.2 Initial detection and feature extraction

The system searches the foreground pixels � determined by the FGBG model � forspectral and morphological characteristics that could indicate that a image region mightbelong to a hand or a �nger. Positive detection is followed by the extraction of fea-tures which are used both for internal tracking mechanisms and as gesture output data.Consequently the system switches to active tracking mode.


Histogram back projection

Generally cbgci operates only on foreground pixels, where motion will be manifestedwhen it occurs. Many phenomena might cause a colour shift of a pixel, causing it tobe temporarily registered as foreground with the FGBG model: moving objects in theframe, a camera shift, as well as colour casts, lighting variations and noise.The motivation of a gesture recognition system in this context is to recognize the fore-ground regions which correspond with hand or �nger skin, �ltering out those which weremarked as foreground as a result of any other reason.The colour balancing carried out in the preliminary stage is aimed at suppressing colourshifts and casts caused by illumination, so that the colour data available for the detectionstage is closer to the proper chromaticity of the re�ectants in the scene. Operating onan HSV representation of the frame enables to further detach the chromaticity from theintensity value of each pixel: Histogram back projection is applied to the hue channel ofthe HSV foreground image, using an o�-line learned histogram of hand-skin hue values.Following this the back projection response is checked against a threshold in order to�lter out pixels displaying poor values � practically negatives. The resulting binarymask is then ready for further analysis. This mask represents the system's hypothesisof skin coloured regions in the foreground of the incoming frame. An example of thishypothesis and its deviation from the mere foreground are given in �gure 4.6.

Determining connected components

The binary mask � the result of the histogram back projection of the colour balancedforeground of the tracking region � stands for the system hypothesis about the locationof moving objects with skin hue resemblance in the frame. The detection of a markedblob in the binary mask as a �nger is consequently done according to its size; in par-ticular, the ratio between the blob's area and that of the tracking region must exceeda threshold for the blob to be positively detected as a �nger. Relying on this simplefeature enables tracking with di�erent �nger postures and with one or more �ngers,without requiring the system to have prior knowledge of these shapes and indulge inintensive and costly shape recognition.In order to make out blobs contours are marked along the transitions between 1's and0's in the binary mask. The contour is approximated with a polygon, and the polygon'sarea* is checked with respect to the abovementioned ratio. The contour, or rather itspolygon approximation, is retained only if the ratio lies above the threshold.As mentioned above, the basic elements of the implemented control concept are virtualsliders, along which the user slides her �nger. The implemented heuristic considers onlythe �rst encountered contour that passes the area-ratio threshold as a detected �nger.The only feature of the detected �nger which is needed for the gesture interpretation

*Area calculation is performed by the OpenCV function cvContourArea(). through discrete inte-gration along the contour


(a) Original frame (b) Image foreground

(c) Back projection response (d) Thresholded back projection as mask

Figure 4.6 � Back projection with skin-hue histogram. Compare the foreground hypothesesbefore and after back projection

within the control concept is its position along the axis of the virtual slider. The positionis determined according to the centroid Pc = (xc, yx) of the detected blob, which iscalculated using its 0th and 1st moments*:

Pc =

(m10

m00

,m01

m00

)(4.1)

where mij =∑x

∑y

xiyjI(x, y)

Upon positive detection a Kalman �lter is initialised with the coordinates of the centroid,and is used for smoothing the recorded movement and bridging over detection dropoutsduring tracking in the following frames. The Kalman �lter expresses also a linearity

*As the operation is carried out in this case on the positive regions of a binary mask, we receive forthe 0th and 1st moments: m00 =

∑x

∑y1, m10 =

∑x

∑yx, m01 =

∑x

∑yy


hypothesis about the nature of the �nger's motion, holding in its internal state additionalvalues for the velocities vx and vy in the x and y directions, respectively. Consequentlythe �lter's internal state has four dimensions:

xk−1 = (xc, yc, vx, vy)T

and its transfer matrix takes the form:

A =

1 0 ∆t 00 1 0 ∆t0 0 1 00 0 0 1

(4.2)

where ∆t = 1/fps[s] is the time which elapses between two frames at a frame rate of fpsframes per second. The dot product of the transfer matrix with the internal state vectordelivers the expected displacement of the centroid within one frame (before integratingthe process uncertainty, and without any control values which are absent in the setting;compare equation 2.23).

Establishing a colour �ngerprint

Positive detection in the previous step was based upon the histogram back projectionresponse of the foreground regions. The histogram against which the image data ischecked is pre-set, representing a general notion about the hue value distribution ofhuman skin*. Moreover the back projection operation is computationally demanding, asit involves a lookup for every pixel in the image.In order to be able to track the detected object in the following frame with highere�ciency a colour "�ngerprint" is extracted: the pixels within and those without theboundaries of the detected object are treated as two separate classes, of which a Fisherdiscriminant vector is calculated according to equation 2.20 (see also [39, p. 105, 107]).The discriminant vector is retained during the entire tracking sequence which followsthe current initial detection, and is used instead of the back projection mechanism.The detection accuracy is improved, as the hypothesis can be established with thede-facto colour information rather than on the basis of the pre-set histogram; from thecomputational point of view the operations required for the classi�cation � followingequations 2.20 and 2.21 � are less intensive than the lookup operations necessary for aback projection. The discriminant vector is extracted from the YCrCb representationof the image data.

At this point, and in the case of positive detection, the system holds information aboutthe position of the tracked object within the tracking region and about the object's

*In particular, the hue histogram used in the described implementation includes 24 bins for thediscrete resolution of 180◦ of the hue channel within OpenCV.


colour �ngerprint. The active tracking status of the tracking region, together with thetracking position information are available to the tracking manager � and hence to itsclient. Controlling the client using the user's �nger may begin.

4.2.3 Tracking

After initial detection the region is in tracking modus, in which it remains until thetracked object can no more be detected. During tracking the FGBG model is not beingupdated, as mentioned above in section 4.2.1 (p. 44). Thus the foreground portionof the image data further contains the moving �nger. Colour balancing takes place,but without transformation from the YCrCb to the HSV representation. Accordinglythe YCrCb data of the foreground of the tracking region is available for the trackingmechanism.

Colour based sub-segmentation

The identi�cation of image regions which belong to the tracked �nger is done on thebasis of colour characteristics also during tracking. Yet in this context the classi�cationis not done using histogram back projection, but with the Fisher discriminant whichwas extracted from the image region at the beginning of the tracking sequence: TheYCrCb image matrix is dot-multiplied with the discriminant vector. The product matrix,containing the classi�cation score for each pixel, is then thresholded with half of thebandwidth � 128 when dealing with 8-Bit images � to form the binary classi�cationmask (compare equation 2.22). The pixels marked with 1 belong to the positive class,i.e. display colour characteristics similar to those sampled from the initially detectedobject.Following this the colour discriminant mask and the image foreground mask undergo abinary AND operation. The pixels marked with 1s in the resulting binary image arepixels belonging to moving objects in the scene which also display colour of a goodsimilarity with the initially detected object. A moving �nger in the frame, should itexist, would be comprised of these candidate pixels.

Discerning and tracking

Making out the �nger out of the candidate pixels is performed in the same manner asdescribed above (section 4.2.2, p. 45), by �nding the �rst blob of connected componentsthat exceeds the area ratio threshold. The centre of mass of this blob is considered asthe system's current measurement of the presence of the �nger in the current frame.The measurement is synthesised with the prediction of the Kalman �lter (followingequation 2.26) to form the cbgci current hypothesis, which is used both to update the�lter and to deliver all the relevant output, as described in the following section.


If no blob is found that meets the area threshold requirements, the tracking sequenceis terminated. The information that no tracking takes place in the particular trackingregion is made available to client of the tracking manager. The Fisher discriminantvector is reset. cbgci resumes updates of the background model of the region, andnew frames undergo once more the detection branch of the system's work�ow. Betweentracking sequences each tracking area retains its last output value; reading the outputvalues is possible at all time and is independent of the tracking status.

Providing control relevant output

A tracking area can be initialised and registered with the cbgci tracking managerwith di�erent con�gurations, which determine the manner in which the area's outputis e�ected by the tracked movement of the �nger, and in what form it is deliveredback to the client. As they are designed to virtually emulate control sliders or faders,the con�guration provides the software user with the possibility to adjust each of theregistered tracking areas according to the needs of the application.As a �rst measure when a tracking area is initialised its relevant axis is determined:following the slider model, the relevant axis is congruent with that dimension of therectangular portion of the frame within which the area is de�ned that displays thegreater magnitude. Using the area should therefore be done by moving the �nger alongthis axis.Three basic possibilities for transforming the �nger's location into meaningful outputvalues are available:

Absolute � Delivering absolute values is straightforward: the �nger's location alongthe relevant axis is returned to the client, with pixel precision. The absolute con-�guration allows the user to control continuous variables within known boundaries.

Gridded � The client may register a tracking area with n grid cells. Upon registrationthe length of the relevant axis is equally divided into n cells. In this case the track-ing information contains the ordinal of the grid cell within which the tracked �ngeris located. In order to suppress erratic output in case the user places her �ngernear or over the virtual boundary between cells a transition threshold is imple-mented: only if the tracked point moves deep enough into an adjacent cell will thechange be registered and output. Gridded settings are convenient for controllingor accessing discrete options or values. Furthermore they o�er a solution to theproblematic user experience with cyclic menu browsing addressed in section 4.1.1on page 38 � the user can place her �nger directly on the position of a certainmenu item within the grid*.

Relative � By con�guring the tracking area to output relative values, the client canemulate a "soft slider": during tracking the output values are altered relative to

*Tactile aids may further enhance this advantage � see section 5.4 on page 56


the point where initial detection occurred. Further it is possible to set a sensitivityfactor, which determines the relation between the speed of the registered motionand that of the change in the output values. Decreasing the sensitivity can assist insuppressing erratic response of the cbgci system, while increasing it can help theuser to cover large value spans with less e�ort. A relative response delivers similarexperience as does a common computer mouse, and is suitable for convenientlycontrolling or accessing large and varying value spans.

Examples of the usage of the di�erent con�gurations are given in chapter 5.

4.3 Alternative approach

Poor or unstable lighting conditions as well as problematic camera behaviour mighte�ect the performance of the detection and tracking phases dramatically. Irregularitiesor �uctuations in the illumination intensity prevent the establishment of a solid FGBGmodel, as the model might register changes in colour and intensity values of pixelsalthough in fact no change in the scene has taken place; detection and tracking may alsodirectly su�er from instability and distortion of the registered colour, which underminethe proper function of both the hue-histogram back projection and the Fisher colourdiscriminant vector methods.

Similar e�ects may result from certain camera characteristics. Some web cameras arefeatured with automatic mechanisms that regulate exposure time and WB. Attemptingto keep an even distribution of the registered illumination levels, as is the aim of theautomatic exposure functionality, is generally a welcome feature, as it enhances thecontrast of the RGB image, and as a result also the value distribution of any otherrepresentation of the image data (see also [4, ch. 3]). The problem lies in the behaviourof these systems. They react on their own accord to changes in the illumination as theyare registered by the camera's sensor: If the cause is a change in the scene's illuminationthe automatic adaptation might indeed support a stable representation of the scene. Onthe other hand, if the camera reacts to an alteration in the scene's overall re�ectance� e.g. due to an object being introduced or removed from the scene while keeping thelight source constant � with a change of its exposure settings, the result might be acounter-productive distortion of the image data. The same applies to the automatic WBmechanisms in the context of chromaticity.

Some web cameras supply the software user with methods to control and turn o� suchautomatic features, whereas others do not include such methods in their con�gurationinterfaces. It was observed during the development of cbgci that in such cases thedetection of �ngers in the frame su�ers from both false-positive and false-negative re-sults: the resulting cross-channel intensity �uctuations and colour irregularities disruptthe construction of a stable background model and the operation of the colour-relateddetection methods.


An alternative approach had to be devised to enable the cbgci system to detect andtrack a user's �nger under these conditions. As image instability is the very obstaclethat has to be overcome, the system could not rely in this context on an on-line trained,continuously updating background model; a static model, invariant to lighting changesmust be used. Detection could not be made on the basis of colour characteristics,as their constancy in the input data in this context is not granted too; neither the huebased initial detection nor the YCrCb based Fisher discriminant could guarantee reliableresults.In order to neutralise the instability in the colour response the entire processing is doneon a grey-scale representation of the input image data. The input RGB image data istransformed into a single channel grey scale image Y (compare equation 2.4a):

Y = 0.299 ·R + 0.587 ·G+ 0.114 ·B

The dynamic background model is replaced with an implicit static one: the white ho-mogeneous surface of the informa device. Pixels displaying intensity responses � i.e.grey scale values � that are lower than those of the background may consequently beconsidered to represent a foreground object. This clearly poses a constraint on the usagescenario: for a proper operation of the system it is mandatory for the portion of theimage that is registered as tracking area with the manager to remain free of all otherobjects save the user's �nger.

(a) (b) (c) (d) (e)

Figure 4.7 � Alternative algorithm � the di�erent processing stages: The original input frame(a) is transformed to grey scale (b), thresholded (c) and binarised (d); after mor-phological erosion the largest blob of connected components is marked (e)

In order to enhance the system's robustness against colour casts or lighting gradientsin the original image which might in�uence the grey-scale response of the background,a threshold T is drawn with respect to the values of the maximum and minimum greyvalue in the image:

T =1

3

(max(I) + 2 ·min(I)

)(4.3)


with which the grey-scale pixels are low passed. Further, in order to suppress noise� either hardware dependent* or resulting from prior processing � the mask of lowintensity pixels undergoes morphological erosion with a 3 × 3 kernel with its anchor atits centre�.The resulting mask stands for the system's hypothesis about the presence of foregroundobjects and their eventual location. cbgci utilises here the same method as describedabove in section 4.2.2 on page 45 to outline, area-threshold and eventually locate a blobit would then consider to be the user's �nger. This measure, together with the grey-scale thresholding and the erosion of the resulting mask, helps to ease the constraintwhich was pointed out above: only objects that are dark enough in comparison with theinforma surface and that are big enough in comparison with the size of the trackingarea will be detected as a �nger and trigger a tracking sequence.The mechanisms used in this alternative implementation for initial detection and fordetection during a tracking sequence are identical to those described in section 4.2.Visualisations of the di�erent proccessing stages are shown in Figure 4.7.Upon positive initial detection the tracking area's Kalman �lter is initialised, and is usedduring the following tracking sequence in the same manner as described above on page45 and on page 48. The control relevant output is presented to the client in the sameway as described on page 49.Giving up the elaborate image processing and manipulation stages that make out thepreviously described cbgci algorithm in favour of the alternative approach has alsoa positive e�ect on the performance of the system, both in CPU-time and in memoryconsumption � further details are given in chapter 6.

*Notice that in this case hardware dependent, non-correlated noise from all three RGB channels iscombined into one single grey-scale channel

�Erosion is done using the OpenCV function cvErode(), [12, p. 115�118]

Chapter 5

Integration

Similar to any other input device cbgci has to be nested in the main runtime loop ofthe system it should serve as a UI. Accordingly, a cbgci tracking manager is initialisedwithin the grab program . This program runs on the informa device, controls imagecapturing and distributes image data to potential software clients, local or remote. Allthe tracking areas necessary for controlling the di�erent functions and services on boardthe informa system may be registered with the manager. During runtime, each framegrabbed from the camera is submitted to the manager. The latter passes it on tothe registered regions for tracking, and consequently makes any registered control dataavailable for software clients.

The interpretation of the control information received via cbgci lies in principle withthe system within which cbgci is embedded � informa in this case. The system mayinterpret the same input information in di�erent manners according to its current controlcontext. Alternatively, clients may alter the control concept to meet di�erent contexts,either by registering and deregistering tracking regions, or by registering overlappingtracking regions and selectively relating to their output. The control information receivedfrom the manager is region-related, and each region can be identi�ed with the propername with which it was initialised, enabling selective access and manipulation.

The following sections describe some of the informa control scenarios which were re-alised using cbgci.

5.1 Navigation

A principle control context of the informa platform is navigation. The usermust be able to navigate between the di�erent services and menu levels, artic-ulate her choices, and return to the uppermost level from anywhere within thevirtual menu tree (see section 4.1.1 on page 36). cbgci enables a straightfor-ward emulation of the D-pad keys used to navigate through the informa menu:

53

CHAPTER 5. INTEGRATION 54

Figure 5.1 � Tracking regionslayout for navigation

left and right � A strip along the side of the framethat corresponds to the short side of the informasurface nearest to the user is con�gured as a track-ing region with relative output. Moving her �ngeralong the strip the user can scroll through a list ofitems in a menu either to the left or to the right:The sign of the gradient of the relative motion isinterpreted as a direction command; Setting thevalue of the sensitivity factor allows for tuning thecontrol for a convenient reaction speed, i.e. whatdistance must the �nger cover in order to switchto the next or previous menu item.

up, down and home � Three tracking regions arede�ned at three di�erent locations on the in-

forma surface. These are initialised as griddedregions with only one grid cell. As all three virtualbuttons should act as triggers, the only relevantcontrol information is the beginning of a tracking sequence � placing a �ngerinside the designated surface portion is interpreted as a button hit. Because theregion has only one grid cell no other control output can be generated until trackingin the region ends, i.e. until the user pulls her �nger back from the region.

The layout of the tracking regions is shown in �gure 5.1. An intermediate client inter-prets the control data and passes it on to the menu management system, using the samecommand-tokens that are used when working with the D-pad. Maintaining this inter-face renders the deployment of cbgci transparent from the perspective of the menumanagement system.

5.2 Output control

The entire feedback from the informa platform is conveyed to the user audibly, overintegrated speakers. Functions, options and services are announced by a computerisedvoice, as well as all the output that is generated during runtime. The user shouldhave control over the volume and the speed of recitation (see footnote on page 36).The virtual slider paradigm is most suitable for this application. As seen in �gure5.2, two tracking regions are registered, along the side nearest the user and the rightside of the informa surface, and are being used to control the speed and the vol-ume respectively. Each region is declared to deliver either absolute or gridded output:


Figure 5.2 � Tracking regionslayout for output control

In the �rst case the output must be scaled with theratio between the dimension of the region's relevant sidein pixels and the resolution of the system controls forspeed and volume (typically 0% . . . 100% or 0 . . . 255).In the second case the maximal number of grid cells isdictated by the minimal width of each cell which is stillpractical to use with the user's �nger; followingly, eachgrid cell is associated with a discrete volume or speedcontrol value.The values, accordingly adapted by an intermediateclient, are passed on to the OS sound system or to themedia player which is used for playback.The emulated up and home buttons are retained at thesame locations as described in the previous section, fol-lowing the original design of the informamenu naviga-tion: the user should always have the option to leave arunning service and return to the previous choice menu,or alternatively all the way up to the main informa

menu.

5.3 Text input

When o�ering dynamic and interactive content the informa platform must providepossibilities to input text and other symbols. A textual interface can make most ofthe content available on the internet accessible to the user, in combination with theplatform's capability to recognize and audibly output text. Using an online searchmachine is one example for a possible service: browsing through search results, choosingthe desired item and listening to a recitation of the text on the web site are theoreticallypossible with the old informa setup, but without a possibility to actively input textthe user could not submit her search query.Retaining the up and home virtual buttons, four additional control areas are introduced:

Character slider � A strip along the right side of the surface is con�gured as a trackingregion with relative output. Sliding her �nger along the region, the user can browsethrough the di�erent characters until the desired character is reached: accordingto the �nger movements the character set is stepped through; the current characteris called by the computer voice*.

Choose character � Once the desired character is reached the choose virtual buttonis "pressed", and the character is added to the current string, which is managedby an intermediate client.

*The necessary audio data with the audio tags of all the available characters is stored locally


Delete last character � The last character in the string can be removed.

Enter text � Once all the characters were input, hitting the virtual enter text but-ton completes the text � a single word, or multiple words, i.e. a string containingspaces � and gives the intermediate client the signal to forward it to the respectiveservice or function.

The layout of the tracking regions for the realised text input interface is depicted in�gure 5.3.Using an external keyboard for the same purpose would counter the motivation of keep-ing the informa hardware minimal, while incorporating a speech recognition system inorder to input text as spoken tokens would be expensive to develop (or acquire other-wise) as well as in regard to the computational demands it would set during runtime.

Figure 5.3 � Tracking regionslayout for text input

cbgci enables the realisation of a text input environ-ment, with the following positive side e�ects:

� No additional hardware is required

� No extra computational burden is introduced

� Flexible and modular: the available symbols andcharacters can be altered, complemented and re-organised according to needs, either general �such as deployment in di�erent languages � orparticular and function dependent � e.g. o�eringonly numeric and arithmetic symbols for a calcu-lator service.

� The interface follows the same principles as theother control schemes which are realised usingcbgci. Thus the user is not required to adaptto a di�erent "feel" when using the various in-forma functions.

5.4 Navigation and output control

The alternative, rudimentary version of the cbgci technology described in section4.3 was used to realise a basic control scheme for the informa platform. Apply-ing this technology requires the area designated for gesture control to be free of allother foreground objects save for the user's �nger during a tracking session. As aconsequence, registering many tracking areas for the use with this technology wouldhinder the functionality of informa as a reading device signi�cantly, as the re-gions used for tracking could not be used for placing documents. It is therefore


necessary to alter the UI concept, so that the portion of the surface of the in-

forma device which is reserved for gesture control could be reduced to a minimum.

Figure 5.4 � A minimal layout of trackingregions for combined control of navigationand output parameters, using the alternativecbgci algorithm

The choice of services and functions is trimmeddown, and the informa menu is reorganisedto consist of a single depth level. It is then pos-sible to navigate through the menu and controlthe output parameters of the platform by ges-turing with two �ngers over a single strip alongthe side of the device nearest the user.As can be seen in �gure 5.4 the strip is di-vided in two adjacent tracking regions: the leftportion is used to switch between the di�er-ent controls, while the right portion is used todetermine the relevant value. The interpret-ing intermediate client stands by to register achoice once two �ngers are present � one within the left and one within the right track-ing region. The choice is made and registered when the user draws both her �ngersback.The design reduces the �exibility of the interface, primarily in regard to the menucontrol: enhancements of the choice of services and functions, which could be carriedout transparently in case the region was declared to deliver relative output, must involvereprogramming when it is declared as gridded. On the other hand, applying protrusionsor grooves to the surface of the device in correspondence with the boundaries of thedeclared grid cells could increase the ease of use of the interface.

Chapter 6

Evaluation

The cbgci API enables the development of a gesture control interface for the informaplatform. This interface emulates the current mode of operation of the platform, con-sisting of navigation within menus and between menu levels, articulation of choices andcontrol over audio output parameters. In this, and as the concept of virtual menu trees is�exible and expandable in itself (compare section 4.1.1 on page 37), cbgcimaintains the�exibility of the informa client unit, supporting the transparency of any maintenanceand expansion of the menu structure while ensuring robust and seamless usability.

Each of the systems that were reviewed in chapter 3 introduced a vocabulary of gestures.The systems have to detect the performance of a gesture and correctly recognise it intheir respective vocabulary before they could interpret it and deliver or execute thecorresponding command. Elaborate recognition algorithms were implemented in eachof the systems in order to ful�l these tasks. At the same time the �exibility of theseconcepts is held at bay due to the explicitly de�ned set of recognisable gestures: thisset being exclusive, any modi�cation or extension of the gesture and with it also ofthe command vocabulary dictates an explicit adaptation of the recognition algorithm.Furthermore, depending on the features that are used to distinguish the gestures it isconceivable that some systems might very quickly exhaust their possibilities of extendingtheir gesture vocabulary. This might presumably be the case with the solution reviewedin section 3.3: the distinction criteria between the start, move and stop gestures aretailor made and leave scarce space for additional, well distinct de�nitions for new gestureclasses.

In relation to Porta's de�nition of a vision-based computer interface ([26, p. 30], seequote on page 27) these reviewed systems may be considered as attempts to drawnearer to an understanding of complex concepts expressed through explicit gestures:grasp and drop, as they are interpreted by the system discussed in section 3.1, variouspresentation-related commands in the solution presented in section 3.2, and the gamecontrols in section 3.3.

From the same perspective, cbgci seems to "understand" less: the system interpretslinear movements of the user's �nger, in pre-de�ned regions within the �eld of view of

58

CHAPTER 6. EVALUATION 59

the camera. The "idea" or the "manipulative command" expressed by these gestures(as mentioned by Hassanpour [27, p. 2], see quote on page 27) relate to concepts withinthe controlled computer system rather than within the domain of the user's wishes anddesires. Each gesture does not imply in itself a distinct domain of commands withinwhich it is valid. The meaning of a gesture is determined by the context within which itis preformed � in itself de�ned also by the region in which it is performed � and notby the way or manner of its performance. In this sense it may be argued that cbgcistill demands the user to accommodate herself to the design of the controlled systemand not vice verse.Nevertheless, cbgci's advantages lie in the reduction of its gesture vocabulary to anecessary minimum:

� Reducing the set of extracted features that are necessary for detection, trackingand for correct interpretation of performed gestures to a minimum increases thesystem's stability and robustness.

� The learning curve of the UI is increased � the user can concentrate on exploringthe possibilities of the system she is operating rather than on training herself toperform the control gestures correctly.

� As the gestures gain their meaning form the context within which they are per-formed, the system could be integrated as UI in every environment where controlscould be expressed as triggers or parameter browsers.

Rather than an attempt to make the computer "understand" gestures that might be con-sidered as intuitive in a certain context, cbgci enables the user to intuitively manipulategeneral prototypes of control modules � sliders, with either discrete or continuous valueranges, endless or cyclic controls (as could be realised with tracking regions with rel-ative output), and buttons � that may in turn be installed in di�erent contexts andcon�gurations.All the control scenarios described in chapter 5 were implemented, tested, and provedfully functional. Informal laboratory tryouts proved that cbgci performs well withdi�erent users and lighting conditions. A �eld test with blind and visually impairedparticipants is still to take place. Nevertheless, as the system merely emulates controlconcepts and mechanisms that are familiar from many other home appliances, and doesnot require its users to learn and practice speci�c gestures, it is conceivable that an in-forma interface realised with cbgci would become transparent to its users after initialguidance. Following a �eld test of the second informa prototype, which was carriedout with thirty blind and visually impaired users between April and October 2010, nonotable criticism was received from the test participants concerning the orientation inthe menu structure or the use of the numeric control pad in itself. Navigation withinthe menu tree with a cbgci emulation of the functionality of the numeric pad shouldnot lead to di�erent or to negative user experience.


The informa hardware was a major aspect that in�uenced the development of cbgcibeside the declared functional motivation. This in�uence is twofold:

1. The behaviour and capabilities of the web-camera in use determine the quality ofthe input image the system has at its disposal.

2. Processing speed and memory capacity are limited by the installed hardware

While a deciding factor for the quality of the OCR output � as manifested throughbetter character-recognition rates � higher image resolution proved to have no in�u-ence on the detection and tracking performance. cbgci was developed with an inputimage size of 320×240 pixel, independent of the maximum resolution available from thedi�erent web cameras that were tested*.Software control of the di�erent camera parameters on device initialisation and duringrun-time is an important issue, which dictated some aspect of the cbgci design. Manyweb cameras, including the models which were tested for installation on the informaplatform and during the development of cbgci, are equipped with automatic WB andexposure-time optimising mechanisms. These features usually contribute positively touser experience in common home and o�ce applications as they aim at maintainingan optimal distribution of brightness and colour values within the respective ranges ofthe image data. At the same time, and as mentioned in section 4.3, the continuousoperation of these features may cause sudden variations in image values � both inoverall brightness, caused by automatic exposure adaptation, and in chromaticity, as aresult of automatic WB� which in turn undermine the proper function of the brightnessand colour dependent portions of the cbgci algorithm.The entire informa project, including cbgci, run within a Linux distribution. Webcamera integration, control and addressing is accomplished by the "video for Linux"(v4l) interface with the available driver con�guration for the speci�c camera model.These con�gurations are not standardised: while v4l virtually enables control over allpossible device features, driver implementations for speci�c cameras do not necessarilygrant access to all of them. This is the case with some of the web camera models whosedrivers do not allow the software user to control and turn o� the automatic WB andexposure features, thus forcing the client system � cbgci in this case � to cope withthe resulting unstable image stream. The alternative algorithmic approach discussed in4.3 was developed in order to still be able to o�er some of the cbgci functionality undersuch circumstances.Another problem caused by the use of the v4l interface concerns the image size whichthe camera should deliver: the resolution is set at the time of device initialisation,and cannot be altered without freeing the device's handle and terminating the callingprogram. While an image resolution of 320 × 240 is su�cient for the proper operation

*The cameras that were used for development and testing are: Logitech Quickcam Pro 9000, HPElite Autofocus, and Logitech HD 1080p


of cbgci, the OCR functionality relies on the highest available resolution for optimalresults. When running informa with cbgci any occasional call of the reading functionof the platform necessitates the termination and restart of the grab program* with adi�erent resolution con�guration for the camera. Terminating and restarting grab is atime consuming process � mainly due to the release and re-initialisation of the camera� and thus a major disadvantage from the perspective of user experience.

The development of a camera within the informa project, that would have o�ered con-trol over all its features during both initialisation and runtime, had to be abandoned be-cause of various reasons. Using informa with cbgci while still o�ering the platform'sdocument-reading functionality is at this point only available under the abovementioneddeterioration of reaction time and user comfort.

cbgci relies on a certain level of light intensity to ensure its proper function. However,the system operates successfully under lighting conditions in which the OCR softwareused within informa already fail in the ful�lment of its tasks. Separating colour infor-mation from light intensity values, as well as relying on a FGBG model, or alternativelythe use of relative di�erences of grey values as is the case in the alternative approach,are the elements supporting this robustness. Additionally, OCR needs high contrastimages with optimal focus in order to deliver acceptable results, whereas cbgci is fairlyindi�erent to image contrast as long as colour constancy is maintained � which is theaim of the integrated WB mechanism.

Nevertheless, the informa �eld test stressed out the fact that many blind and visuallyimpaired users do not maintain high illumination levels at their homes if at all, asthey personally do not require light for the accomplishment of their daily tasks. As aconsequence an on-board light source for the illumination of the surface of the devicewill be an integral part of the design of the next informa prototype. This would ensurea stable and reliable operation of the platform � including that of the cbgci controlinterface � at all times, and without the user having to change her habits.

cbgci was tested on the second prototype of the informa device � a computer with anIntel Atom1 CPU with a clock rate of 1.6GHz, 512MB cache and 1GB of RAM, runninga Linux distribution . The grab program, encapsulating a cbgci tracking managerwith the tracking region layout described in section 5.3, consumes 7MB of memory witha 43% CPU load when operating on a stream of 320× 240 pixel images with 10 framesper second. This allows for a smooth operation, mainly because the informa platfromdoes not run any other demanding tasks: the entire content processing is done by theserver, save for the playback of the remotely generated audio output. When employingthe alternative approach of the algorithm with the tracking region layout described insection 5.4 memory consumption is reduced to 6MB at a 14% CPU load.

Nevertheless, optimisation of the computational costs is desired, especially for the de-ployment of cbgci within the UI of other, computationally more intensive environ-

*grab controls image acquisition and administrates the cbgci tracking manager as well as anyintermediate client involved in the interpretation of the control data; see chapter 5 on page 53


ments. A dedicated camera, that could take over the WB task from the cbgci algorithmand at the same time provide it with the YCrCb di�erential colour representation whicis required for its proper operation, could relieve the main CPU from the performanceof that processing stage.

Chapter 7

Conclusion

Initially motivated as a proof of concept, cbgci in its current state enables the realisa-tion of a comprehensive solution for the user interface of the informa platform.Relying on a video stream from a common web camera, the entire cbgci functionalityis realised via two basic structures:

Tracking region � The functional element, which performs detection and tracking ofthe user's �nger within a pre-de�ned region of the input frame. The tracking regioncan be con�gured to deliver its output in a variety of formats, and is addressablethrough a unique name.

Tracking manager � The control element, used by a client software to exchange datawith the cbgci system. The manager passes input video frames from the client tothe tracking regions registered with it and consequently provides the client softwarewith the resulting tracking data.

Maintaining a tracking manager within a main image-acquisition loop equips a clientsoftware with a toolbox for the modular creation of UI setups: virtual sliders and triggerbuttons can be created and destroyed � i.e. tracking regions registered and de-registeredwith the manager � on demand and con�gured to deliver the appropriate output formats� absolute, relative or gridded; combining these elements to control interfaces can meetthe requirements of many di�erent control scenarios.All the scenarios that are required for controlling the current informa design weresuccessfully realised, using tracking regions in adequate con�gurations and layouts (seechapter 5). Moreover, a convenient interface for text input was created using cbgci(there, section 5.3). This can be seen as a demonstration of the advantages the systemo�ers developers and users of other frameworks too: Introducing new functions to anappliance or changing its concept of user-interaction must not carry the introductionof a new UI hardware in its wake, or the need to memorise new command-setups forsoft-buttons. Instead, a suitable layout of virtual control elements could be created thatwould meet the new needs while being intuitive and thus easy to learn and use � neither

63

CHAPTER 7. CONCLUSION 64

new hardware nor training is required. The sole prerequisite for the use of cbgci isthat the framework should be equipped with a camera, which would be free to serve asan input device for control gestures.

cbgci performs well in terms of computational e�ciency: satisfactory CPU loads wereachieved on the informa device with a complex tracking region layout, and even betterresults could be measured with the use of the alternative algorithmic approach to thetracking and detection problem; in both cases a small memory footprint was kept (seechapter 6 on page 62).

Owing to the integrated WB functionality cbgci performs well in normal indoor lightingconditions and is tolerant enough in regard to shadows and colour casts to enable smoothoperation. The combination of an on-line updating FGBG model with a general andad-hoc colour-based �ltering provide a simple and reliable mechanism for detection andtracking of hands or �ngers, and allows for a heterogeneous usage of the informa

surface.

Future work

Introducing further technical re�nements, incorporating additional image processing andfeature recognition methods, as well as using cbgci in di�erent contexts could form abody of future work on the basis of the current algorithmic concept:

Preprocessing stage � in addition to the implemented WB functionality, the incor-poration of shadow-suppressing mechanisms, e.g. as suggested by Amato et al. in[40], may increase the segmentation accuracy and in turn lead to better detectionresults.

Finger and hand characterisation � with suitable shape descriptors at its disposalcbgci might maximise the information harvested from the contours of the seg-mented blobs. An optimal solution would support detection on the basis of alimited set of easily extracted features, such as high gradient changes at �ngertipsor characteristic curves between thumb and index �nger.

Re�ned detection and tracking � rich information from the preprocessing and fea-ture extraction stages would allow the system to detect and distinguish e.g. betweenone, two or several �ngers, or even identify a thumb, an index and a middle �nger*.

Multitouch � being able to track di�erent �ngers within a single tracking region, or toidentify a hand posture according to the orientation of the di�erent �ngers wouldmake it possible to emulate multi-touch control interfaces with cbgci.

*Realised in a monitored environment and on the basis of a simpli�ed hand segmentation mechanism,the system described by Malik in [41] demonstrates similar ideas.

CHAPTER 7. CONCLUSION 65

New control scenarios � equipped with these capabilities cbgci could become aninteresting tool for the development of UI applications also for users without visionimpairment. Possible control scenarios could be:

� Controlling the functions of a digital camera with gestures, performed withinits �eld of view.

� Operating computer systems without the need to touch neither a keyboardnor a screen � an interesting solution for commercial applications such asATMs or check-in terminals as well as in the context of sterile environmentssuch as operating theatres.

� Computer and video gaming.

� Controlling augmented reality glasses: with a gesture that overlaps with theappropriate region in the projection on the lens the user could make a choiceor control a function.

� etc...

Relying on a simple and robust set of features for its detection and tracking algorithm,cbgci provides a versatile tool-box for the realisation of �exible, extensible and userfriendly control interfaces for devices equipped with video-enabled cameras. With thissolid basis, the abovementioned suggestions for future development are surely but someof the possible paths both cbgci and its possible applications could thrive along.

Bibliography

[1] Y. S. Choi, C. D. Anderson, J. D. Glass, and C. C. Kemp, �Laser Pointers anda Touch Screen: Intuitive Interfaces for Autonomous Mobile Manipulation for theMotor Impaired.,� in ASSETS'08, pp. 225�232, 2008.

[2] Anonymous Authors, �Colour space.� Website, 2.VI.2011. http://en.wikipedia.org/wiki/Colour_space.

[3] M. Tkalcic and J. F. Tasic, �Colour spaces - perceptual, historical and applicationalbackground,� in Proceedings of the IEEE Region 8 EUROCON 2003 - Computeras a Tool, pp. 304 � 308, 2003.

[4] R. C. Gonzalez and R. E. Woods, Digital Image Processing, Third Edition. PearsonPrentice Hall, 2008.

[5] C. Poynton, �Frequently Asked Questions about Color.� Website. http://www.

poynton.com/.

[6] Open source community, �OpenCV 2.2 C++ Reference.� Website. http://opencv.jp/opencv-2.2_org/cpp/.

[7] Y. Kim, J.-S. Lee, A. Morales, and S.-J. Ko, �A video camera system with en-hanced zoom tracking and auto white balance,� IEEE Transactions on ConsumerElectronics, vol. 48, no. 3, pp. 428 � 434, 2007.

[8] S. Bianco, F. Gasparini, and R. Schettini, �Combining strategies for white balance,�in In Proc. Digital Photography III, vol. IS&T/SPIE Symposium on ElectronicImaging, 2007.

[9] M. Pilu and S. Pollard, �A light-weight text image processing method for handheldembedded cameras.,� in BMVC'02, 2002.

[10] R. Kimmel, M. Elad, D. Shaked, R. Keshet, and I. Sobel, �A Variational Frameworkfor Retinex,� International Journal of Computer Science, vol. 52, no. 1, pp. 7 � 23,2003.

66

http://en.wikipedia.org/wiki/Colour_space

http://en.wikipedia.org/wiki/Colour_space

http://www.poynton.com/

http://www.poynton.com/

http://opencv.jp/opencv-2.2_org/cpp/

http://opencv.jp/opencv-2.2_org/cpp/

BIBLIOGRAPHY 67

[11] E. H. Land, �The Retinex Theory of Color Vision,� Scienti�c American, vol. 237,pp. 108 � 128, December 1977.

[12] G. Bradski and A. Kaehler, Learning OpenCV. O'Reilly Media, Inc., 2008.

[13] P. KaewTraKulPong and B. Bowden, �An Improved Adaptive Background MixtureModel for Realtime Tracking with Shadow Detection,� in 2nd European Workshopon Advanced Video Based Surveillance Systems, AVBS01, Kluwer Academic Pub-lishers, September 2001.

[14] F. Porikli and O. Tuzel, �Human Body Tracking by Adaptive Background Mod-els and Mean-Shift Analysis,� in IEEE International Workshop on PerformanceEvaluation of Tracking and Surveillance, March 2003.

[15] T. P. Chen, H. Haussecker, A. Bovyrin, R. Belenov, K. Rodyushkin, A. Kuranov,and V. Erushimov, �Computer Vision Workload Analysis: Case Study of VideoSurveillance Systems,� Intel Technology Journal, vol. 9, pp. 109 � 118, May 2005.

[16] Open source community, �OpenCV Video Surveillance / Blob Tracker Facility.�Website. http://opencv.willowgarage.com/wiki/VideoSurveillance.

[17] R. A. Fisher, Sir, �The Use of Multiple Measurements in Taxonomic Problems,�Annals of Eugenics, vol. 7, pp. 179 � 188, 1936.

[18] S. Mika, G. Rätsch, J. Weston, B. Schölkopf, and K.-R. Müller, �Fisher DiscriminantAnalysis With Kernels.� Website, 1999. http://courses.cs.tamu.edu/rgutier/cpsc689_f08/mika1999kernelLDA.pdf.

[19] G. Welch and G. Bishop, �An Introduction to the Kalman Filter,� tech. rep., Uni-versity of North Carolina, Chapel Hill, 1995, revised 2006. TR95-041.

[20] R. E. Kalman, �A New Approach to Linear Filtering and Prediction Problems,�Transactions of the ASME-Journal of Basic Engineering, no. 82 (Series D), pp. 35� 45, 1960.

[21] C. Soans and A. Stevenson, eds., Oxford Dictionary of English. Oxford UniversityPress, second, revised ed., 2005.

[22] D. H. Douglas and T. K. Peucker, �Algorithms for the reduction of the numberof points required to represent a digitized line or its caricature,� Cartographica,vol. 10, no. 2, pp. 112�122, 1973.

[23] Anonymous Authors, �Colour space.� Website, 9.VI.2011. http://upload.

wikimedia.org/wikipedia/commons/9/91/Douglas_Peucker.png.

http://opencv.willowgarage.com/wiki/VideoSurveillance

http://courses.cs.tamu.edu/rgutier/cpsc689_f08/mika1999kernelLDA.pdf

http://courses.cs.tamu.edu/rgutier/cpsc689_f08/mika1999kernelLDA.pdf

http://upload.wikimedia.org/wikipedia/commons/9/91/Douglas_Peucker.png

http://upload.wikimedia.org/wikipedia/commons/9/91/Douglas_Peucker.png

BIBLIOGRAPHY 68

[24] R. L. Graham, �An e�cient algorithm for determining the convex hull of a �niteplanar set,� in Information Processing Letters, no. 1, pp. 132�133, North-HollandPublishing Company, 1972.

[25] Anonymous Authors, �Graham scan.� Website, 13.VI.2011. http://en.wikipedia.org/wiki/File:Graham_Scan.svg.

[26] M. Porta, �Vision-based user interfaces: methods and applications,� InternationalJournal of Human-Computer Studies, vol. 57, pp. 27 � 73, July 2002.

[27] R. Hassanpour, S. Wong, and A. Shahbahrami, �VisionBased Hand Gesture Recog-nition for Human Computer Interaction: A Review,� in IADIS InternationalConference Interfaces and Human Computer Interaction, 2008.

[28] X. Zabulis, H. Baltzakis, and A. Argyros, �Vision-based Hand Gesture Recogni-tion for Human-Computer Interaction.� Website. http://citeseerx.ist.psu.

edu/viewdoc/download?doi=10.1.1.159.2565&rep=rep1&type=pdf.

[29] PrimeSense company website. http://www.primesense.com/.

[30] M. Schramm. joystiq, website, June 19, 2010. http://www.joystiq.com/2010/06/19/kinect-how-it-works-from-the-company-behind-the-tech/.

[31] �Kinect for windows sdk.� Website. http://research.microsoft.com/en-us/um/redmond/projects/kinectsdk/about.aspx.

[32] �Kinect for windows sdk beta launches, wants pc users to get amove on.� engadget, website. http://www.engadget.com/2011/06/16/

microsoft-launches-kinect-for-windows-sdk-beta-wants-pc-users-t/.

[33] M. Tang, �Recognizing Hand Gestures with Microsoft's Kinect.� Website.http://www.stanford.edu/class/ee368/Project_11/Reports/Tang_Hand_

Gesture_Recognition.pdf.

[34] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, �SURF: Speeding Up RobustFeatures,� Computer Vision and Image Understanding, vol. 110, no. 3, pp. 346 �359, 2008.

[35] A. Licsar and T. Syziranyi, �Hand Gesture Recognition in Camera-Projector Sys-tem,� in Computer Vision in Human-Computer Interaction (N. Sebe, M. Lew, andT. Huang, eds.), vol. 3058 of Lecture Notes in Computer Science, pp. 83�93, SpringerBerlin / Heidelberg, 2004.

[36] J. De Vylder and W. Philips, �Improved fourier descriptors for 2-d shape represen-tation,�

http://en.wikipedia.org/wiki/File:Graham_Scan.svg

http://en.wikipedia.org/wiki/File:Graham_Scan.svg

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.159.2565&rep=rep1&type=pdf

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.159.2565&rep=rep1&type=pdf

http://www.primesense.com/

http://www.joystiq.com/2010/06/19/kinect-how-it-works-from-the-company-behind-the-tech/

http://www.joystiq.com/2010/06/19/kinect-how-it-works-from-the-company-behind-the-tech/

http://research.microsoft.com/en-us/um/redmond/projects/kinectsdk/about.aspx

http://research.microsoft.com/en-us/um/redmond/projects/kinectsdk/about.aspx

http://www.engadget.com/2011/06/16/microsoft-launches-kinect-for-windows-sdk-beta-wants-pc-users-t/

http://www.engadget.com/2011/06/16/microsoft-launches-kinect-for-windows-sdk-beta-wants-pc-users-t/

http://www.stanford.edu/class/ee368/Project_11/Reports/Tang_Hand_Gesture_Recognition.pdf

http://www.stanford.edu/class/ee368/Project_11/Reports/Tang_Hand_Gesture_Recognition.pdf

BIBLIOGRAPHY 69

[37] C. Manresa, J. Varona, R. Mas, and F. J. Perales, �Hand Tracking and GestureRecognition for Human-Computer Interaction,� Electronic Letters on ComputerVision and Image Analysis, vol. 5, no. 3, pp. 96 � 104, August 2005.

[38] �symplektikon.� Website. http://www.symplektikon.de/.

[39] A. Blake and M. Isard, �Active Contours.� Website, 1998. http://research.

microsoft.com/en-us/um/people/ablake/contours/.

[40] A. Amato, M. G. Mozerov, A. D. Bagdanov, and J. Gonzalez, �Accurate moving castshadow suppression based on local color constancy detection,� IEEE Transactionson Image Processing, vol. 20, no. 20, October 2011 (in press).

[41] S. Malik, �Real-time Hand Tracking and Finger Tracking for Interaction.� Web-site, 2003. http://www.cs.toronto.edu/~smalik/downloads/2503_project_

report.pdf.

http://www.symplektikon.de/

http://research.microsoft.com/en-us/um/people/ablake/contours/

http://research.microsoft.com/en-us/um/people/ablake/contours/

http://www.cs.toronto.edu/~smalik/downloads/2503_project_report.pdf

http://www.cs.toronto.edu/~smalik/downloads/2503_project_report.pdf

Camera Based Gesture Control for an Information Appliance ......A camera based gesture control...

Documents

Transcript of Camera Based Gesture Control for an Information Appliance ......A camera based gesture control...