[IEEE 2009 IEEE-RIVF International Conference on Computing and Communication Technologies - Danang...

4
On Evaluating Sport Event Recognition using Bag-of-Words Model Vo Dinh Phong, Tran Ngoc Trung, Le Hoai Bac Department of Computer Science University of Science Ho Chi Minh city, Vietnam {vdphong,tntrung,lhbac}@fit.hcmuns.edu.vn Abstract—This paper presents extensive experiments on sport event images using the Bag-of-Words model. We propose a simple but effective combination of feature extraction and visual dictionary formation to boost the performance of Naïve Bayes classifier based on BOW model. Despite of not being a novel idea, our algorithm offers encouraging performance in event recognition domain. Moreover, in certain degrees, it can com- pete to typical works in event image clustering. Two challenge datasets are involved in discovering interesting facts needed to be concerned when designing a features sharing framework for event recognition. Index Terms—action recognition, still image, bag-of-words I. I NTRODUCTION A CTION, how can we recognize? People tend to think theoretically that action is defined by a sequence of body parts’ movement in a time span. However, the answer lies in our brain. In fact, action can be recognized effectively by the inference engine, means that, brain automatically makes up missed information. Therefore, not only videos but images can also be recognized as action. Having seen motion patterns before, the brain can implicitly understand what movements performed before and after the time the picture was captured. This is our belief in exploring a new way of thinking about action recognition based on still image. Action recognition is currently one of the most interesting problem in the vision community. There are so many research threads that we cannot give a detailed enough review. Inter- ested readers are encouraged to refer to [?]. Briefly, there are several approaches on the theme. The most dedicated approach is to recognize actions using discriminative model. This model contains methods that range from patch based, part based, to template based ones. The second approach is to use explicit dynamical models. These generative models try to catch structured motions between body parts. The lesser known approach is to exploit parsing techniques in natural language processing to parse actions into movelets. The last approach, which is our choice, is dynamic free. We argue that without refer to the whole action span, computer can mimic human to recognize still image. This claim is particularly true in the case of event images, says, sport topics or movies. In this paper, we adopt the recognition task on sport news photographs using the traditional bag-of-words method. A thorough experimental work is conducted to investigate the effectiveness in recognizing sport still image using bag of words model. The use of that model is just a medium for better understanding how much sport topics images are different from each other if we just use unordered and unstructured visual words. The bag of words model has been applied successfully in object recognition; however it has not been applied in action categorization. Our contribution is twofold: (i) analyze the BoW model on two large sport image datasets; (ii) attempt to increase the discrimination between classes by introducing the universal vocabulary and specific vocabularies. Our conclusion can be used as an evidence for developing a learning model that shares common features between action classes. The paper is organized as follows. In Section 1, we intro- duce the problem domain and our contribution. In Section 2, related works on still image action recognition are reviewed. Section 3 will presents steps in the proposed method. Ex- perimental setup and results are presented in Section 4. We summary our work in the last section, future development is also mentioned. II. PREVIOUS WORK The literature review section is restricted in typical works that relevant to still image event recognition. Different body poses in skating and baseball still images are first clustered in[?]. The most notable difference between [?] and our method as well as [?] and [?], [?] is in the sense that images having similar body poses are grouped together. In other words, clusters are created within each event class. It is easy to see that our problem is much challenge. In their work, Wang et al. [?] used a technique for deformable matching of the edges of a pair of images, which will be used to measure the distance between images. The matching algorithm operates on the Euclidean distance map transformed from the binary Canny edge image. This holistic approach get its performance approximately at 50%. Approaching from opposite perspec- tive, [?] construct a integrated generative model that allows recognizing complicated social events. This method is more superior than ours because of their complex model and their carefully annotated dataset. III. METHODOLOGY The proposed method is straightforward. Three steps are are involved in the construction of BoW representation: (i) detec- tion of interest points, (ii) computation of local descriptors, 978-1-4244-4568-4/09/$25.00 ©2009 IEEE 1

Transcript of [IEEE 2009 IEEE-RIVF International Conference on Computing and Communication Technologies - Danang...

Page 1: [IEEE 2009 IEEE-RIVF International Conference on Computing and Communication Technologies - Danang City, Viet Nam (2009.07.13-2009.07.17)] 2009 IEEE-RIVF International Conference on

On Evaluating Sport Event Recognitionusing Bag-of-Words Model

Vo Dinh Phong, Tran Ngoc Trung, Le Hoai BacDepartment of Computer Science

University of ScienceHo Chi Minh city, Vietnam

{vdphong,tntrung,lhbac}@fit.hcmuns.edu.vn

Abstract—This paper presents extensive experiments on sportevent images using the Bag-of-Words model. We propose asimple but effective combination of feature extraction and visualdictionary formation to boost the performance of Naïve Bayesclassifier based on BOW model. Despite of not being a novelidea, our algorithm offers encouraging performance in eventrecognition domain. Moreover, in certain degrees, it can com-pete to typical works in event image clustering. Two challengedatasets are involved in discovering interesting facts needed tobe concerned when designing a features sharing framework forevent recognition.

Index Terms—action recognition, still image, bag-of-words

I. INTRODUCTION

ACTION, how can we recognize? People tend to thinktheoretically that action is defined by a sequence of body

parts’ movement in a time span. However, the answer liesin our brain. In fact, action can be recognized effectively bythe inference engine, means that, brain automatically makesup missed information. Therefore, not only videos but imagescan also be recognized as action. Having seen motion patternsbefore, the brain can implicitly understand what movementsperformed before and after the time the picture was captured.This is our belief in exploring a new way of thinking aboutaction recognition based on still image.

Action recognition is currently one of the most interestingproblem in the vision community. There are so many researchthreads that we cannot give a detailed enough review. Inter-ested readers are encouraged to refer to [?]. Briefly, thereare several approaches on the theme. The most dedicatedapproach is to recognize actions using discriminative model.This model contains methods that range from patch based,part based, to template based ones. The second approach is touse explicit dynamical models. These generative models tryto catch structured motions between body parts. The lesserknown approach is to exploit parsing techniques in naturallanguage processing to parse actions into movelets. The lastapproach, which is our choice, is dynamic free. We argue thatwithout refer to the whole action span, computer can mimichuman to recognize still image. This claim is particularly truein the case of event images, says, sport topics or movies.

In this paper, we adopt the recognition task on sport newsphotographs using the traditional bag-of-words method. Athorough experimental work is conducted to investigate theeffectiveness in recognizing sport still image using bag of

words model. The use of that model is just a medium for betterunderstanding how much sport topics images are differentfrom each other if we just use unordered and unstructuredvisual words. The bag of words model has been appliedsuccessfully in object recognition; however it has not beenapplied in action categorization. Our contribution is twofold:(i) analyze the BoW model on two large sport image datasets;(ii) attempt to increase the discrimination between classes byintroducing the universal vocabulary and specific vocabularies.Our conclusion can be used as an evidence for developing alearning model that shares common features between actionclasses.

The paper is organized as follows. In Section 1, we intro-duce the problem domain and our contribution. In Section 2,related works on still image action recognition are reviewed.Section 3 will presents steps in the proposed method. Ex-perimental setup and results are presented in Section 4. Wesummary our work in the last section, future development isalso mentioned.

II. PREVIOUS WORK

The literature review section is restricted in typical worksthat relevant to still image event recognition. Different bodyposes in skating and baseball still images are first clusteredin[?]. The most notable difference between [?] and our methodas well as [?] and [?], [?] is in the sense that images havingsimilar body poses are grouped together. In other words,clusters are created within each event class. It is easy to seethat our problem is much challenge. In their work, Wanget al. [?] used a technique for deformable matching of theedges of a pair of images, which will be used to measure thedistance between images. The matching algorithm operateson the Euclidean distance map transformed from the binaryCanny edge image. This holistic approach get its performanceapproximately at 50%. Approaching from opposite perspec-tive, [?] construct a integrated generative model that allowsrecognizing complicated social events. This method is moresuperior than ours because of their complex model and theircarefully annotated dataset.

III. METHODOLOGY

The proposed method is straightforward. Three steps are areinvolved in the construction of BoW representation: (i) detec-tion of interest points, (ii) computation of local descriptors,

978-1-4244-4568-4/09/$25.00 ©2009 IEEE1

Page 2: [IEEE 2009 IEEE-RIVF International Conference on Computing and Communication Technologies - Danang City, Viet Nam (2009.07.13-2009.07.17)] 2009 IEEE-RIVF International Conference on

(iii) local descriptor quantization. These steps are presented inthe two next sections.

A. Feature Detection

At this stage, visual information is selected. Interest pointsdetectors have invariance characteristics (scale or/and rotateor/and affine invariance), compact, and generalized well. Quiteregularly, Difference-of-Gaussian (DOG) blob-like detector,or Harris-Laplacian corner detector is used. Empirical results[?] showed that a sparse set of interest points achieve highaccuracy on man-made objects like cars, buildings, toys,simple textured objects. Whereas dense grid or pyramid repre-sentations has been successfully used in the context of scenerecognition [?], [?]. Furthermore, [?] conducted a thoroughreview on the use of various kinds of interested point detector(see Fig.1). We use both strategies in order to evaluate theeffectiveness on sport event images, which had never beentested before. After extracting salient points, local regionssurrounding points are encoded into descriptors using SIFT[?].A practical issue is whether SIFT is suitable for event imagewhose subjects are human body at various scales and postures.

Figure 1: Examples of multi-scale sampling methods. 1st row:(1) Harris-Laplace (HL) with a large detection threshold. (2)HL with threshold zero – note that the sampling is stillquite sparse. (3) Laplacian-of-Gaussian. (4) Random sampling.2nd row: grid sampling from coarse to fine scale. Generallyspeaking, it is similar to pyramid sampling. Courtesy of [?],[?].

The inherent shorthand of bag-of-words methods is theignorance of structure, says spatial arrangement, perspectiveprojection, or time sequence. Consequently, local correlationis a simple way to implement. The input image is divided intodense grid layers in which grid cells equal in the whole imageat one layer and at the higher layer, the grid size in increased,usually by an exponent of 2. From each cell’s center, a circularregion is computed into descriptor. These regions are taken sothat there are overlapping between neighborhood local regionsin the same layer as well as between superposition regionsin all the layers. In spite the fact that we cannot “see” howa such multi-scale and grid sampling contribute the locationinformation into the final decision, doing that way increasethe overall system performance [?], [?]. We also adopt thisapproach, but it probable has a different ending because thestory teller is the data itself.

B. Visual words formation

After extracted and represented as SIFT descriptors, thesevectors are input of a clustering algorithm. We use the K-means algorithm for its simplicity and rapidity. Euclidean dis-tance is chose to measure the degree of dissimilarity betweentwo vectors. Like scene images, event images contains textureor flat regions that repeat at many locations; therefore raw

descriptors are quantized to a half or a quarter at image leveland event level without losing informative features. Thesecluster centers are called intermediate visual words. Thenintermediate visual words are grouped together and clusteredinto K words, which called the universal vocabulary. Toincrease the discrimination power of the dictionary on similarevents (e.x cycling vs. racing, tennis vs. cricket, soccer vs.football), we try both solutions: (i) coarse-fine vocabulary, (ii)universal-specific vocabulary.

The first solution is interpreted as follows: descriptors atscales from coarse to fine are clustered together with thehope that coarse vectors will be shared among classes andfine vectors can characterize classes. An alternative way isto construct three separate vocabularies, says coarse, mid,and fine. Likelihoods are estimated independently in eachvocabulary and then combined in posterior probabilities.

With the second solution, after clustering universal words,some intermediate words in each event are chosen so that theylie far from universal words. The criterion is very simple: sortby distance and cut off small values.

C. The Bag-of-Words model

The bag-of-words approach is motivated by an analogy tolearning methods using the bag-of-words representation fortext categorization. The idea of adapting text categorizationapproach to visual categorization is not new. However, [?]renovated it after the concept of textons are widely acknowl-edged. Briefly, a bag of words corresponds to a histogramof the number of occurrences of particular image patterns ina given image. The major advantages of the method are itssimplicity and computational cost.

In this paper we use the bag-of-words representation forNaïve Bayes classifier and training and testing stage will bepresented under the perspective of generative model. Thissystem is one of the simplest generative models. Formally,assume we have a set of N labeled images I = {Ii}N

i=1, a setof object classes Cj , j = 1..M , and a set of visual vocabularyV = {vk}K

k=1. A generative model means all the images canbe “generated” (or synthesized) from visual words, thereforethe likelihoods of a word vk belongs to class Cj must becomputed all over the possible pairs (vk|Cj). Assumed thatthe training data is well sampled over the image space, thelikelihoods can be computed by counting the appearances ofvisual words in the dataset as follows:

P (vk|Cj) =1 +

∑Ni=1 ζ(k, i)

K +∑K

s=1

∑Ni=1 1 {Ii ∈ Cj}ζ(s, i)

, in which ζ(k, i) is the number of times word vk occurs inimage Ii, and 1 {Ii ∈ Cj} is equal 1 if its condition is satisfiedand 0 otherwise. Note that we use Laplacian smoothingto eliminate zeros in probability. Provided that probabilityindependence for simplicity, the likelihood of image Ii belongsto class Cj is computed as a chain product of visual wordlikelihoods,

P (Ii|Cj) =K∏

k=1

P (vk|Cj)ζ(k,i)

2

Page 3: [IEEE 2009 IEEE-RIVF International Conference on Computing and Communication Technologies - Danang City, Viet Nam (2009.07.13-2009.07.17)] 2009 IEEE-RIVF International Conference on

The posterior probability P (Cj |Ii) can be figured out easilyusing Bayes theorem:

P (Cj |Ii) =P (Cj)P (Ii|Cj)∑M

m=1 P (Cm)P (Ii|Cm)

IV. RESULTS

A. Dataset

We test the algorithm on two datasets. They are both aboutsports . The first dataset, called set A, is utilized from [?] is acollection of 8 classes (Fig.2). The second dataset, called set B,is collected by us using Internet and resized in the dimensionof 196×196. This dataset is very challenging since an image islabeled class “X” iff people look at it (not other pictures) andfigure out what kind of event it is about, without any restrictionon acquisition configuration. Sample images are illustrated inFig.3. Both datasets share common characteristics:

• Images are captured from a variety of views;• Subjects with various sizes and positions exist in the same

image;• Number of instances of the same object category change

diversely even within the same event category;• Body poses are highly varied within the same category

and slightly varied among categories;

Figure 2: Our dataset contains 8 event classes: boxing (151images), cricket (188 images), cycling (114 images), football(179 images), golf (185 images), racing (89 images), soccer(179 images), tennis (178 images)

Figure 3: Princeton dataset [?] contains 8 event classes:badminton (194 images), bocce (137 images), croquet (210images), polo (181 images), rockclimbing (194 images), row-ing (250 images), sailing (189 images), snowboarding (190images)

We can see different from [?], we do not annotate theground-truth for each image with the purpose to figure outhow sampling strategies perform without any guidance exceptevent label.

B. Experimental setup

We run the experiments on laptop with 2.26 GHz, 2GBRAM. Input images are in their natural dimension but resizedif needed. Feature detection is applied using Lowe’s imple-mentation [?] and VLfeat toolbox [?]. In spite of there is alittle difference in the detection result, it does not show anydegradation in the final result. We encounter restrictions incomputational resources so that everything must be tailoredto fit into the situation. Particularly, K-means can run at

most K = 600; holdout validation is used for accuracyestimation in which 1/6 data is designated as the training setand the remaining 5/6 as the test set. This ratio puts ourmethod in a unfair comparison against other methods. Forbetter investigation, we also compute the confusion matrix inwhich column indicates ground-truths and row indicates falsepositives.

C. Optimal vocabulary size

How many words the classifier requires to perform best? Werun a series of experiments in finding the correlation betweenperformance and vocabulary size. For brevity, we show hereresult for the case set B, as depicted in Fig.4a. The visualwords are “flat” and “single”, mean that we do not apply anyapproach for vocabulary enrichment (neither coarse-to-fine noruniversal-specific). At the first glance, there is no major trendamong events, just one event “boxing” increase steadily from20% to 50%. There are minor variations in the other events.Generally, increasing vocabulary size does not help the systemimprove the accuracy much.

D. Coarse-to-fine vocabulary

The coarse-to-fine approach investigates images under threelayers of dense grid. Grids with different size were appliedto images, say 10x10, 20x20, and 30x30.Fig.4b shows theconfusion matrix for set A. This approach is inferior comparedto universal-specific approach in Fig.5a. In other words, addingevent-specific words is more effective than multi-scale words.This claim is confirmed again in the next section.

E. Universal & Specific vocabulary

This section is intended to verify our hypotheses thatsharing features can help to discriminate event classes. Inall experiments, K-means is used with K = 100 and 50 foruniversal vocabulary and specific vocabularies, respectively.These values are demonstrated to perform well. The firstexperiment is tested on set A with the number of classes =8, train/test ratio = 0.2, 200(universal words) + 50(specificwords)×8 = 600 words. The confusion matrix is shown inFig.5a. From sample image in Fig.2 and Fig.3, events “snow-boarding”, “sailing”, and “rowing” share common themes bluesky and water surface. Interestingly, both events “croquet” and“sailing” share a common space arrangement (water surfacevs. grass field) so that the result is (sailing, croquet) = 16%.A similar scenario happened between “rockclimbing” and“bocce”.

The second experiment is tested on set B with the numberof classes = 8, train/test ratio = 0.5, 100(universal words) +50(specific words)×8 = 500 words. There is a reduction in theuniversal vocabulary size compared to the first experiment.Generally speaking, image size in set B is smaller than inset A, so it requires less words to describe an image. Theconfusion matrix is obtained in Fig.5b. We can see that“football” and “soccer” seem quite similar in appearance butthere is a drastic performance between them (67% and 16%).Event “football” is confused with “cricket” (10%), “soccer”

3

Page 4: [IEEE 2009 IEEE-RIVF International Conference on Computing and Communication Technologies - Danang City, Viet Nam (2009.07.13-2009.07.17)] 2009 IEEE-RIVF International Conference on

100 200 300 400 500 6000

10

20

30

40

50

60

70

80

90

100

Vocabulary size

Ave

rage

acc

urac

y(%

)

boxingcricketcyclingfootballgolfracingsoccertennis

(a) Searching over K to find theoptimal vocabulary size. The re-sult shows that using the orig-inal BOW model cannot con-verge conditioned event images.

.39 .09 .05 .05 .06 .19 .10 .06

.18 .29 .11 .09 .13 .05 .05 .09

.14 .05 .49 .08 .08 .05 .05 .05

.07 .15 .10 .29 .23 .05 .05 .05

.07 .16 .05 .05 .53 .04 .05 .04

.15 .15 .08 .06 .08 .34 .10 .03

.04 .04 .13 .02 .02 .09 .44 .22

.03 .07 .04 .06 .04 .06 .19 .50

badminton

bocce

croquet

polo

rockclimbing

rowing

sailing

snowboardingbadminton

boccecroquet

polorockclimbing

rowingsailing

snowboarding

(b) Confusion matrix for the case set A;coarse-to-fine approach; the average perfor-mance is 40.88%.

.41 .14 .10 .01 .02 .17 .06 .09

.08 .36 .15 .07 .06 .18 .01 .08

.08 .14 .52 .05 .02 .11 .04 .03

.05 .08 .05 .52 .10 .14 .01 .04

.01 .21 .03 .02 .53 .13 .04 .03

.08 .07 .03 .07 .07 .46 .10 .10

.05 .05 .16 .02 .04 .07 .36 .24

.03 .10 .04 .02 .01 .09 .11 .58

badminton

bocce

croquet

polo

rockclimbing

rowing

sailing

snowboardingbadminton

boccecroquet

polorockclimbing

rowingsailing

snowboarding

(c) Confusion matrix for the case set A; univer-sal + specific approach; the average accuracy is46.88%.

.45 .07 .09 .09 .03 .00 .07 .21

.07 .36 .11 .21 .05 .04 .02 .13

.11 .04 .35 .21 .09 .07 .11 .04

.03 .10 .03 .67 .00 .01 .08 .08

.06 .12 .11 .16 .22 .14 .03 .16

.02 .02 .24 .00 .11 .42 .04 .13

.10 .13 .11 .19 .10 .04 .16 .17

.15 .12 .06 .03 .03 .03 .06 .52

boxing

cricket

cycling

football

golf

racing

soccer

tennis

boxingcricket

cyclingfootball

golfracing

soccer

tennis

(d) Confusion matrix for the caseset B; 8 classes; universal + spe-cific approach; the average accu-racy is 39.2%.

Figure 4: Experimental results

(8%), and “tennis” (8%) - which seem intuitively similar;meanwhile, event “soccer” is confused with others. It canbe explained that “soccer” words were lost its characteristics.Following “soccer”, event “golf ” is in the same situation. Anexpected confusion is between “racing” and “cycling” (24%).

F. Discussion

Above analysis expose a competition between event classes.If an event gets high accuracy, it must be exist some eventslose their accuracy. The hidden fact becomes clearly that thevector quantization process makes use some characteristics butmakes losing the others. In the condition of fixed vocabulary,there is no room for losers. We emphasize that this fact is notrandom at all after verifying many times using holdout vali-dation and seeing that dominating classes remain unchanged.Consequently, some confusions cannot be explained by visualintuition or can be explain but with high uncertainty. Analysisalso indicates that universal-specific approach is more effectivethan coarse-to-fine approach.

V. CONCLUSIONS

In this paper we adopted the bag-of-words model, which vi-sion community already familiars with, to a new theme, eventimages. Accompanied by significant modifications in fea-ture extraction and visual dictionaries formation, our methodachieves favorable outcome. Two extensive and challenge sportdatasets are tested and analyzed. The proposed method, whichis simple and effective, verifies confidently our expectation,and initial achievements can compete against [?]. We foundthat the coarse-to-fine approach does not work well relativeto universal-specific approach. Event classes are fairly similarto each other and SIFT descriptor cannot generalize well.However this characteristic of data can benefit for featuresharing methods. As an evidence, the universal-specific ap-proach gains the mutual information over classes and preservesindividual features of each class that is made losing by vectorquantization.

The paper also shows us that features should be extracted bya more powerful method. Besides, a detailed annotated datasetmust be available for better guidance. Above experimentsshow that neighborhood correlation information did not help

to solve the inherent drawback of BoW. Our future solutionis to estimate grayscale image depth, then assign “depth”information for each of descriptor depending on its location.

REFERENCES

[1] Chris Dance, Jutta Willamowski, Lixin Fan, Cedric Bray, and GabrielaCsurka. Visual categorization with bags of keypoints. In ECCVInternational Workshop on Statistical Learning in Computer Vision,2004.

[2] Li-Jia Li Li Fei-Fei. What, where and who? classifying events byscene and object recognition. IEEE 11th International Conference onComputer Vision, 11:1–8, 2007.

[3] David A. Forsyth, Okan Arikan, Leslie Ikemoto, James O’Brien, andDeva Ramanan. Computational studies of human motion: part 1, trackingand motion synthesis. Found. Trends. Comput. Graph. Vis., 1(2-3):77–254, 2005.

[4] Frédéric Jurie and Bill Triggs. Creating efficient codebooks for visualrecognition. In ICCV, pages 604–610, 2005.

[5] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learningrealistic human actions from movies. In Computer Vision and PatternRecognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8, 2008.

[6] I. Laptev and P. Pérez. Retrieving actions in movies. In Proc. Int. Conf.Comp. Vis.(ICCV’07), Rio de Janeiro, Brazil, October 2007.

[7] David G. Lowe. Distinctive image features from scale-invariant key-points. International Journal of Computer Vision, 60(2):91–110, 2004.

[8] Donald Metzler. Beyond bags of words: effectively modeling depen-dence and features in information retrieval. SIGIR Forum, 42(1):77,2008.

[9] Eric Nowak, Frédéric Jurie, and Bill Triggs. Sampling strategies forbag-of-features image classification. In In Proc. ECCV, pages 490–503.Springer, 2006.

[10] J. Sivic, B.C. Russell, A. Zisserman, W.T. Freeman, and A.A. Efros.Unsupervised discovery of visual object class hierarchies. pages 1–8,2008.

[11] A. Vedaldi and B. Fulkerson. Vlfeat: An open and portable library ofcomputer vision algorithms. http://www.vlfeat.org/, 2008.

[12] Yang Wang, Hao Jiang, Mark S. Drew, Ze-Nian Li, and Greg Mori.Unsupervised discovery of action classes. In IEEE Conference onComputer Vision and Pattern Recognition, pages 1654–1661, 2006.

[13] John M. Winn, Antonio Criminisi, and Thomas P. Minka. Objectcategorization by learned universal visual dictionary. In ICCV, pages1800–1807, 2005.

4