COMBINING OBJECT DETECTION AND BRAIN COMPUTER … · 2012-05-08 · COMBINING OBJECT DETECTION AND...

COMBINING OBJECT DETECTION AND BRAIN COMPUTER INTERFACING: A NEWWAY OF SUBJECT-ENVIRONMENT INTERACTION

Arne Robben, Nikolay Chumerin, Nikolay V. Manyakov, Adrien Combaz, Marijn van Vliet andMarc M. Van Hulle

K.U.Leuven, Laboratorium voor Neuro- en Psychofysiologie, Campus Gasthuisberg, Herestraat 49, B-3000 Leuven, Belgium

ABSTRACT

In this paper we present an application which combines tworesearch topics: object detection and brain-computer inter-facing. The goal is, by combining these disciplines, to con-struct an alternative medium for a person to interact with hisor her environment. In contrast to normal ways of interac-tion, our application does not depend on the ability of motorcontrol of the subject. As such, it is in particular useful forpatients suffering from a severe motor impairment to relatein a higher level with their environment.

1. INTRODUCTION

The application proposed in this paper connects the domainof object detection with the research topic of Brain-Compu-ter Interfacing (BCI) in order to construct an alternative wayfor a subject to interact with their surroundings. The al-gorithm behind our application starts from a digital picturetaken from the environment of a subject and containing ob-jects with a certain functionality that this subject might wantto select. The subject-environment interaction is then estab-lished by a 2-fold strategy: first, these functional objects aredetected in the image and presented to the user and second,the subject is enabled to select an object of his or her choice.The detection of objects in the picture will be accomplishedby the construction of object detectors, which are trained onthese specific objects. The selection process will be accom-plished by an EEG-based Brain-Computer Interface (BCI).

AR and AC are supported by a specialization grant from theAgentschap voor Innovatie door Wetenschap en Technologie (IWT, Flem-ish Agency for Innovation through Science and Technology).

NC is supported by the European Commission (IST-2007-217077).NVM is supported by the Flemish Regional Ministry of Education

(Belgium) (GOA 10/019).MMVH is supported by research grants received from the Excel-

lence Financing program (EF 2005) and the CREA Financing program(CREA/07/027) of the K.U.Leuven, the Belgian Fund for Scientific Re-search - Flanders (G.0588.09), the Interuniversity Attraction Poles Pro-gramme – Belgian Science Policy (IUAP P6/054), the Flemish Re-gional Ministry of Education (Belgium) (GOA 10/019), and the EuropeanCommission (STREP-2002-016276, IST- 2004-027017, and IST-2007-217077), and by the SWIFT prize of the King Baudouin Foundation ofBelgium.

Contrary to more common selection procedures the abilityof motor control is not a necessity for our application be-cause BCIs are able to directly measure brain activity andestablish a communication pathway between the brain anda computer (bypassing the need for muscular activity). Be-cause of this feature it can significantly improve the qual-ity of life of patients suffering from impairments as amy-otrophic lateral sclerosis, stroke (CVA), brain/spinal cordinjury, multiple sclerosis, etc.

Object detection is a challenging branch in the domainof computer vision. The task is, given an image, to con-clude on the presence of a specified object and, if present,to determine its location in the image. Research on detec-tors for faces, cars, motorcycles, pedestrians, road signs andmany more, already led to successful and reliable applica-tions [1, 2, 3, 4]. Most techniques are based on the match-ing of local features of an image to a database of features(a codebook) derived from a training-set of images of anobject. Among these local features are grayscale patches,Haar-like features, local shape contexts, SIFT, SURF, etc.[2, 5, 6, 7].

For this application we made codebooks of the Scale In-variant Feature Transform (SIFT) descriptors. These localfeatures are 128-dimensional vectors, proposed by DavidLowe in 1999 and are constructed in such a way that theyare invariant to scale and rotation [6]. They are also claimedto be partially invariant to a substantial change of affine dis-tortion, change in 3D viewpoint, addition of noise and il-lumination [8]. We constructed 3 codebooks: one for acoffee-thermos, one for a cup with an apple-print and onefor a white CRT-monitor (see Fig. 1), objects which mightappear in everyday environments. Our methods are howeversimilarly applicable to any choice of object.

Our application works with electroencephalograms (EEG)recorded from the scalp while a subject focuses on visualstimuli on a computer screen. If such visual stimuli flickerat sufficiently high rates (≥ 6 Hz), individual transient vi-sual responses in the EEG overlap, resulting in a steadystate signal, observable mostly in the occipital area [9]. Thisparadigm is called the steady-state visual evoked potential(SSVEP). Not only the stimulus frequency f can be discov-

Fig. 1. The cup, thermos and monitor for which codebookswere constructed.

ered in the signal, the harmonics 2f and 3f are also oftenembedded in the signal.

When there are multiple targets, like in our case, thestrategy is to let the corresponding stimuli flicker in differ-ent frequencies. A search for prominent frequencies in thepower spectral density (PSD) of the EEG-signal intuitivelywould then be sufficient to make a decision on which fre-quency the subject was focusing (as illustrated in Fig. 2). Itis however often not that easy. One problem is that the am-plitude of a typical EEG in the spectral domain is inverselyproportional to the frequency, which hinders the search forprominent frequencies. The major problem though, is due tothe nature of the EEG-recordings: a lot of noise and otheron-going brain activity are present in the signal. Standardtechniques for dealing with this problem are to record overa long time interval, average over several time intervals ormake use of preliminary training. For our application weuse a different approach, inspired by the method proposed in[10], which does not require a preliminary Training-stage.Two main results of our studies will be treated: the perfor-

Fig. 2. Standard SSVEP-decoding approach: (A) a subjectlooks at Target 1, flickering with frequency f1, (B) noisyEEG-signals are recorded, (C) taking the Fourier Transformover a sufficient large window shows dominant peaks atf1, 2f1 and 3f1.

mance of our object detection system, and the performanceof the SSVEP-decoder. A study with 7 subjects was carriedout to estimate the accuracy of the SSVEP classifier.

2. METHODS: OBJECT DETECTION

2.1. Acquiring SIFT-features

Our object detector is mainly based on the matching of newSIFT-features, found in a scene-image, to SIFT-features pre-viously found on pictures from the cup, thermos and moni-tor under different viewpoints and stored in a codebook.

Acquiring SIFT-features from an image starts with thegrayscaling of this image. The next step is the detection ofinterest points (or keypoints) in the image, these are salientpoints with rich local information. In order to obtain theseinterest points, an image pyramid is constructed by stackinglayers of images. The bottom of this pyramid is the con-volution of the image with a Gaussian kernel with isotropicvariance σ2I , where σ = 1.6. The next layer is obtainedby smoothing the bottom layer by a Gaussian kernel withisotropic standard deviation 21/s. The image pyramid isthen incrementally build up layer by layer by convolvingeach last layer with this kernel, until, after s layers, theoriginal standard deviation is doubled. This set of layersis called an octave, in the implementation we use s = 3.The top image in this octave is then downsampled by takingevery second pixel in each row and column. This downsam-pled image forms the bottom layer for a new octave whichis then build up as before. In this way an image pyramid isconstructed.

For each octave a different set of images is now com-puted called the difference-of-Gaussians. These images areconstructed by taking the difference of each layer by thelayer underneath it (except for the bottom layer in the oc-tave). So for every octave there are s − 1 difference-of-Gaussians. Interest points or keypoints are then defined aslocal minima/maxima in these difference-of-Gaussians overall layers.

Next, for each interest point, a gradient orientation andmagnitude are computed based on local grayscale imageproperties of the difference-of-Gaussians where the inter-est point was found. This is not only done for the interestpoint but also for every pixel in a rectangular neighborhoodaround this keypoint. The resulting gradient magnitudes inthese pixels are weighted by a two-dimensional Gaussiandistribution with isotropic variance and center in the interestpoint in order to give more importance to nearby pixels thenpixels further from the keypoint. To achieve orientation in-variance, the gradient orientations in these neighboring pix-els are rotated relative to the orientation of the interest point.

The rectangular neighborhood is then divided into 4 ×4 subregions. Each subregion is assigned a histogram of

Fig. 3. Construction of codebooks and relative keypoint lo-cations, from training-images on different scales.

8 orientation bins, covering the 360 degree range of ori-entations. The gradient orientations of the pixels in thesesubregions are then accumulating into these bins by sum-ming the weighted gradient magnitudes. In the end thereare 8 ∗ 16 = 128 orientation bins, each containing a sumof gradient magnitudes. These orientation bins make up the128-dimensional descriptor. By normalizing the descriptorsto unit length, the effects of illumination change are said tobe reduced (see [8]).

The images used in this study were all taken by a digitalcamera and have a resolution of 3072×2304 pixels each. Insuch large images many small gradients are detected even ifthese gradients are due to small texture variations, reflectinglight or any form of unwanted noise. What we really wantare keypoints found in the same scale as the objects in theimage: on the contours of objects, on logos on the object,on handles on the object, etc. This is why images, both fortraining and detection, are first downscaled appropriately.Good performance was obtained when images were initiallydownsampled to 300×225, 1000×750 and 700×527 pixelsfor the thermos, cup and monitor respectively.

2.2. Training a detector

For training, around 15 pictures were taken of each objectfrom different viewpoints. The objects were manually seg-mented from the background and downscaled as mentionedearlier. Each training image then only contains informationrelated to the object. Although we will also downscale newscene-images according to the object that is sought, it is notunlikely that in these scene-images this object will appearsmaller than in the training-images. Even though the SIFT-descriptors are scale-invariant, we do not want, like men-tioned before, interest points acquired by too small image

Fig. 4. Detecting the thermos: (A) SIFT-features are ex-tracted and matched against the codebooks, (B) votes forthe center of the thermos are casted, (C) the maximal voteis selected as our guess for the object location.

gradients. Therefore we define a set α of reasonable scalesfor which the object might occur in a scene-image. A scalein this sense is defined as the height of the object in a scenedivided by the height of the scene-image. For the thermos,for example, α is a set with 8 scales ranging from 26% to40%. For each scale α and for all training-images, SIFT-descriptors are computed and stored in a codebook Cα. Fi-nally, the center c of each object in each training-image ismanually assigned and the location lock,α of each keypointk in the image on scale α, relative to c, is stored (as illus-trated in Fig. 3).

2.3. Detecting an object in a scene

Now the codebooks Cα and the relative keypoint locationslock,α are computed for an object, the recognition for thisobject in a scene-image can begin. First a scene-image isloaded in and rescaled as explained in the end of Section 2.1.After gray-scaling the resulting image, SIFT-descriptors arecomputed and matched for each αwith the descriptors in thecodebooks Cα (matching as in the sense of [8]). In order torecognize an object in the scene, a voting space Vα is cre-ated for each scale α; this is a zero-valued matrix with thesame size as the image. We will now fill these voting spaceswith votes for the center of the object in the scene. The ideais to eventually compare all votes over all scales, find themaximum vote and compare it to a user-defined threshold.In this way we will be able to conclude on both the presenceand the location of the object.

When the descriptor of a keypoint k in the scene matcheswith a the descriptor of a keypoint k′ in Cα, a vote can be

cast for the center of the object in Vα: we just subtract fromthe position of k the relative position to the center that westored for k′ . The vote goes thus to the location: k−lock,α.To incorporate the uncertainty of this vote we store a two-dimensional Gaussian distribution N(k − lock,α, σ

2I) inVα with σ around 0.10 times the height of the downscaledscene-image.

Iteratively, all keypoints in the scene-image are proces-sed in this manner (see Fig 4) and votes in Vα are accu-mulated. Finally the maximal vote over all scales α is usedas our guess for the location of the object and our applica-tion draws a colored dot on the object. The value of thismaximum can be treated as a confidence measure for thisguess, especially when it is normalized with respect to Vα.By comparing this weighted maximum with a user-definedthreshold t, the detector is thus able to decide on the pres-ence of the object in the scene-image.

3. METHODS: SSVEP DECODING

3.1. EEG Data acquisition and filtering

The EEG recordings were performed using a prototype ofan ultra low-power 8-channels wireless EEG system, whichconsists of an amplifier coupled with a wireless transmitterand a USB stick receiver, developed by Imec1 [11]. Thedata is transmitted with a sampling frequency of 1000 Hzfor each channel. We used a brain-cap with large fillingholes and sockets for active Ag/AgCl electrodes (ActiCap,Brain Products). The recordings were made with eight elec-trodes located on the occipital pole, at positions P3, Pz, P4,PO9, O1, Oz, O2, PO10, according to the international 10–20 system. The reference electrode and ground are respec-tively placed on the left and right mastoids.

The raw EEG signals are high-pass filtered with a cutofffrequency of 3 Hz, with a fourth order zero-phase digitalButterworth filter, so as to remove the DC component andthe low frequency drifts. A notch filter is also applied toremove the 50 Hz powerline interference.

3.2. Experiment design

To assess the accuracy of our SSVEP-classifier, an experi-ment was set up. 7 healthy subjects (6 male, 1 female, age22-33, mean age 25.6, 6 left handed, 1 right handed) par-ticipated in the experiment. Because our application wastrained on 3 objects, 3 flickering dots were presented tothe subject, flickering in different frequencies f1, f2 and f3.The visualization of the stimuli was implemented via thePsychophysics Toolbox Extensions for MATLAB [12], as il-lustrated in Fig 5. The subjects were asked to focus on each

1Interuniversity Microelectronics Centre

dot for 10 seconds, a pause was taken and this task was re-peated once again. The choice of stimulation frequenciesis, in terms of accuracy, very subject dependent. For thisreason the experiment was performed 3 times for each sub-ject. The first time with low frequencies 6, 8 and 10 Hz, thesecond time with middle frequencies 10, 12 and 15 Hz andone last time with high frequencies 18, 20 and 25 Hz. Theaccuracy was computed for all 3 combinations in terms ofthe length of the EEG-recordings and the results from thebest combination are reported in Section 4.

Fig. 5. After detecting the objects in the scene-image (A),the dots are enlarged and positioned as in (B), now the actualstimulation can begin.

3.3. Spatial filtering and classification

We implemented a spacial filter called the Minimum En-ergy Combination as described in [10]; a linear combina-tion of the signals from our 8 channels is sought whichdecreases the level of noise in the frequencies of our in-terest: the target frequencies f1, f2, f3 and their harmonics2f1, 3f1, 2f2, 3f2, 2f3 and 3f3. This can be done in 2 steps.First consider the (T × 8) matrix X containing the recordedEEG data containing the T samples for each channel in thecolumns. A (T × 18) matrix A is then constructed with thefunctions sin(2πhfit) and cos(2πhfit) in its columns, inthe samples t ∈ 1, ..., T and where h ∈ {1, 2, 3} denotesthe harmonics of fi, i ∈ {1, 2, 3}. By multiplying X withthe T ×T projection matrix PA = A(ATA)−1AT and sub-tracting this matrix from X, the matrix X = X − PAX isobtained, an 8-channel signal like X but without the infor-mation about the target frequencies and their harmonics. Xcan be considered as the components of the original signalwhich are not related to the visual stimulation.

The second step is to find a linear combination of thecolumns of X which minimizes the variance of these non-interesting components. This can be obtained by perform-ing a principal component analysis on X, these principalcomponents correspond to the eigenvectors of the covari-ance2 matrix Σ = E[XT X]. The first principal componentv1 points in the direction of the maximum variance of the

2E[·] denotes the statistical expectancy. By definition Σ = E[XT X]−E[XT ]E[X], but because of our filtering method the data is centeredaround zero and therefore the last term drops.

data, the second component v2 lies in the direction of max-imum variance in the space orthogonal to v1, etc. In thisway, an orthogonal projection of the data on the first prin-cipal components takes as much variance as possible to alower dimensional space. The other way around is in thiscase of our interest: projecting X (the matrix containingnon-interesting information) on the last principal compo-nents results in a lower dimensional signal like X with alot less variance. This 8 × k matrix Vk (columns are k-lowest eigenvectors) times X corresponds to the linear com-bination of X which minimizes the variance of these non-interesting components, we write: S = XVk. We choose kthe maximal integer for which

∑8i=8−k λi/

∑8i=1 λi < 0.1,

where λi is the eigenvalue corresponding to the eigenvectorvi, i = 1, ..., 8.

To classify the stimulation frequency, test statistics Tiare calculated for each target frequency fi, i = {1, 2, 3}and by making use of S. Formally

Ti =

3∑h=1

k∑s=1

Ps(hfi)

σ2s(hfi)

,

where Ps(hfi) and σ2s(hfi) are estimates of respectively

the power and the noise of the target frequency fi in its har-monics (index over h) and estimated by means of the s-thcolumn of S. Classification is then as simple as taking theindex i for which the test statistic Ti is maximal.

The signal power estimate Ps(hfi) can be computed as:

Ps(hfi) =

T∑t=1

((s(t) sin(2πhfit))

2+ (s(t) cos(2πhfit))

2),

with s(t) the s-th column of S (which is simply the discreteFourier transform magnitude at frequency hfi).

Estimating the noise power is a harder thing to do. Fol-lowing [13] we used an autoregressive model on S = XVk,so by excluding all flicker information from the data. Anautoregressive model can be considered as a filter (workingthrough convolution), in terms of ordinary products betweentransforms of signals and filter coefficients in the frequencydomain. Since we assume that the prediction error in an au-tocorrelation model is uncorrelated white noise, we have aflat power spectral density for it with magnitude as a func-tion of the variance of this noise signal. Thus, the Fouriertransformations of the regression coefficients aj (estimated,for example, with use of Yule-Walker equations) show usthe influence of the frequency content of particular signalsinto the white noise variance. More formally, we have

σ2h,s(fi) =

πT

4

σ2

|1−∑pj=1 aj exp(−2π

√−1jhfi/Fs)|

,

where T is the length of the signal, σ is an estimate of thevariance of white noise, p is an order of regression model,Fs is the sampling frequency (1000 Hz).

4. RESULTS AND DISCUSSION

4.1. The object detector

Measuring the performance of a detection system is not triv-ial. The performance is influenced by the lightning condi-tions of a scene-image, the amount of detail of the object,the amount of detail in the scene-image responsible for trig-gering false interesting points, the scale of the sought ob-ject in the scene, and many more factors. Comparison withother studies is also not straightforward; most widespreaddatasets contain images of faces, cars, pedestrians, etc. Thegoal of our application is though to detect objects whichcould be found in the near environment of a patient. There-fore we took for the cup, monitor and the thermos respec-tively 100, 97 and 120 scene-images, not used for training,inside our lab under both natural and artificial light, in dif-ferent rooms and viewpoints and always with other objectsin close proximity. The scale of the object inside the scene-image was a little restricted, for example: the height of thethermos was always around 0.26 to 0.4 times the height ofthe image. By doing so, the vector α of reasonable scales inSection 2.2 could be constructed. That this choice is not toorestrictive can for example be seen in Fig 4 and 5, where thethermosses, one in the foreground, one in the background,lie in this range.

By visually inspecting the output of the object detectorswe discerned the correct detections (true positives) as thedetections where the winning vote for the object center (asdefined in Section 2.3) is a pixel belonging to the object. Ifthe vote does not belong to the object it is a false positive.Like in [2] the number of true and false positives detectionsare used to compute the precision and recall and the receiveroperator characteristic (ROC), while varying the thresholdfor detection as proposed in Section 2.3. The result is shownin Fig 6.

Again it should be well emphasized that this measure forthe performance is not absolute. It remains a big challengeto constrain the parameters which influence the accuracy ofan object detector and at the same time be general enough tofind the object in a large amount of different scene-images.In testing the classifier we found that the choice of α (thevector of ”of reasonable scales” from Section 2.2) is ex-tremely influential. Despite these difficulties, our results doshow the potential of these kind of object detectors for ourapplication.

4.2. The SSVEP classifier

The accuracy of the SSVEP classifier was off-line assessedin terms of the duration of the EEG-recordings for the 3combinations of frequencies. More precisely we shifted awindow with fixed size w over the recorded data, present-ing the data in the window to the classifier (windows cover-

Fig. 6. Left and middle figure: accuracy of the object detectors when varying the threshold as described in Section 2. Rightfigure: accuracy of the SSVEP classifier as a function of the data duration t in seconds.

ing a gaze transition from one frequency to another are dis-carded). The accuracy can then be described by the num-ber of correct predictions divided by all predictions madeby the classifier. We did this for different window sizesw = 0.2, 0.4, ..., 5 seconds, the result is shown in Fig 6. Af-ter 2.2 seconds the accuracy of the classifier is above 90%for all subjects.

With more object detectors available, the selection taskbecomes more difficult: the minimal energy combinationmethod for 4 or more targets is less precise than for 3 tar-gets (we obtain more test-statistics). One solution can be togroup objects together and do a tree search; if one group ofobjects is selected a new selection phase begins where eachindividual object from this group becomes selectable.

5. CONCLUSION

A new application was presented which can improve the in-teraction of motor impaired patients with their environment.It allows the patient to select objects by means of object de-tectors and brain computer interfacing. Improvements inboth research domains will directly lead to an improvementof our application.

6. ACKNOWLEDGMENT

The authors are also grateful to Refet Firat Yazicioglu, TomTorfs and Cris Van Hoof from the Interuniversity Micro-electronics Centre (Imec) in Leuven for providing with thewireless EEG system.

7. REFERENCES

[1] P. Viola and M. Jones, “Rapid Object Detection Using aBoosted Cascade of Simple Features,” CVPR, 2001.

[2] S. Agarwal and D. Roth, “Learning a Sparse Representa-tion for Object Detection,” in Proc. Seventh European Conf.Computer Vision, 2002, vol. 4, pp. 113–130.

[3] B. Leibe, A. Leonardis, and B. Schiele, “Robust Object De-tection with Interleaved Categorization and Segmentation,”IJCV, vol. 77, 2008.

[4] G. Piccioli et al., “Robust method for road sign detectionand recognition,” Image and Vision Computing, vol. 14, pp.209–223, 1996.

[5] S. Belongie, J. Malik, and J. Puzicha, “Shape Matching andObject Recognition Using Shape Contexts,” in IEEE Trans.Pattern Analysis and Machine Intelligence, 2002, vol. 2, pp.509–522.

[6] D. Lowe, “Object Recognition from Local Scale-InvariantFeatures,” in Proceedings of the International Conferenceon Computer Vision, 1999, pp. 1150–1157.

[7] H. Bay, A. Ess, T. Tuytelaars, and L. J. V. Gool, “Speeded-uprobust features (SURF),” CVIU, vol. 110, no. 3, pp. 346–359,2008.

[8] D. G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” International Journal of Computer Vi-sion, vol. 60, no. 2, pp. 91–110, 2004.

[9] G. Bin et al., “VEP-Based Brain-Computer Interfaces: Time,Frequency, and Code Modulations,” Computational Intelli-gence Magazine, IEEE, vol. 4, no. 4, pp. 22–26, 2009.

[10] O. Friman, I. Volosyak, and A. Graser, “Multiple Chan-nel Detection of Steady-State Visual Evoked Potentials forBrain-Computer Interfaces,” in IEEE Transactions onBiomedical Engineering, 2007, vol. 54, pp. 742–750.

[11] R. Yazicioglu et al., “Low-Power Low-Noise 8-ChannelEEG Front-End ASIC for Ambulatory Acquisition Systems,”in Proceedings of the 32nd European Solid-State CircuitsConference, IEEE, 2006, pp. 247–250.

[12] D. H. Brainard, “The Psychophysics Toolbox,” Spatial vision10, pp. 433–436, 1997.

[13] S. Kay, Modern Spectral Estimation: Theory and Applica-tion, Prentice Hall, 1999.

COMBINING OBJECT DETECTION AND BRAIN COMPUTER … · 2012-05-08 · COMBINING OBJECT DETECTION AND...

Documents

Transcript of COMBINING OBJECT DETECTION AND BRAIN COMPUTER … · 2012-05-08 · COMBINING OBJECT DETECTION AND...