Inconsistency-based active learning for support vector machines

17
Inconsistency-based active learning for support vector machines Ran Wang a , Sam Kwong a,n , Degang Chen b a Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong b Department of Mathematics and Physics, North China Electric Power University, 102206 Beijing, PR China article info Article history: Received 8 August 2011 Received in revised form 19 March 2012 Accepted 24 March 2012 Available online 4 April 2012 Keywords: Active learning Concept learning Inconsistency Sample selection Support vector machine abstract In classification tasks, active learning is often used to select out a set of informative examples from a big unlabeled dataset. The objective is to learn a classification pattern that can accurately predict labels of new examples by using the selection result which is expected to contain as few examples as possible. The selection of informative examples also reduces the manual effort for labeling, data complexity, and data redundancy, thus improves learning efficiency. In this paper, a new active learning strategy with pool-based settings, called inconsistency-based active learning, is proposed. This strategy is built up under the guidance of two classical works: (1) the learning philosophy of query-by-committee (QBC) algorithm; and (2) the structure of the traditional concept learning model: from-general-to-specific (GS) ordering. By constructing two extreme hypotheses of the current version space, the strategy evaluates unlabeled examples by a new sample selection criterion as inconsistency value, and the whole learning process could be implemented without any additional knowledge. Besides, since active learning is favorably applied to support vector machine (SVM) and its related applications, the strategy is further restricted to a specific algorithm called inconsistency-based active learning for SVM (I-ALSVM). By building up a GS structure, the sample selection process in our strategy is formed by searching through the initial version space. We compare the proposed I-ALSVM with several other pool-based methods for SVM on selected datasets. The experimental result shows that, in terms of generalization capability, our model exhibits good feasibility and competitiveness. & 2012 Elsevier Ltd. All rights reserved. 1. Introduction Active learning, being different from the traditional methods which adopt the passive mode of ‘‘learning from examples’’ [1], is known as a kind of revised supervised learning scheme that employs the ‘‘selective sampling’’ [2] manner. In some real world machine learning applications such as text classification [3], image retrieval [46], or music classification [7,8], labeled exam- ples are inadequate whereas unlabeled ones are abundant, but the manual labeling work is quite expensive. Active learning is a popular way to alleviate the human effort for labeling by allowing the learner to select a limited number of informative examples with a certain query strategy [9], and the objective is to learn a classification pattern that can achieve accurate predictions on new examples with the selection result. By setting up the selection criterion, one assumes that active learning can at least have some controls of the input domain. A lot of researches on active learning have been done during the recent years. Regarding the way of accessing unlabeled data, active learning has been investigated with two frameworks as stream-based framework [1012] and pool-based framework [1315]. Considering that whether it allows the learner to select examples in group, active learning has been researched with two modes as single mode and batch mode [16,17]. Once the learning framework and the selection mode are fixed, how to design a proper selection criterion will be the key problem. Certain selection criteria are often accompanied with certain query strategies. The three most commonly used query strategies are known as version space reduction [3], expected error reduction [1820], and uncertainty reduction [2124]. Besides, the diversity [3], density [25], and relevance [26] criteria are also widely adopted. Some current works paid their attentions on particular appli- cations such as image annotation or video indexing. In [27], semantic-gap oriented active learning is proposed for image annotation, which incorporates the semantic-gap measure into the sample selection strategy to make the learning more accord with the motivation of user feedbacks. In [28], active learning is employed to video indexing, in which the sample’s local structure is simultaneously exploited, and a unified sample selection approach is then developed for iterative learning. Another group of works aims to enlarge the implementation domain of active learning beyond the supervised environment and the binary case. In [25,29], active learning is performed under an unsupervised Contents lists available at SciVerse ScienceDirect journal homepage: www.elsevier.com/locate/pr Pattern Recognition 0031-3203/$ - see front matter & 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.patcog.2012.03.022 n Corresponding author. Tel.: þ852 2788 7704; fax: þ852 2788 8614. E-mail address: [email protected] (S. Kwong). Pattern Recognition 45 (2012) 3751–3767

Transcript of Inconsistency-based active learning for support vector machines

Page 1: Inconsistency-based active learning for support vector machines

Pattern Recognition 45 (2012) 3751–3767

Contents lists available at SciVerse ScienceDirect

Pattern Recognition

0031-32

http://d

n Corr

E-m

journal homepage: www.elsevier.com/locate/pr

Inconsistency-based active learning for support vector machines

Ran Wang a, Sam Kwong a,n, Degang Chen b

a Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kongb Department of Mathematics and Physics, North China Electric Power University, 102206 Beijing, PR China

a r t i c l e i n f o

Article history:

Received 8 August 2011

Received in revised form

19 March 2012

Accepted 24 March 2012Available online 4 April 2012

Keywords:

Active learning

Concept learning

Inconsistency

Sample selection

Support vector machine

03/$ - see front matter & 2012 Elsevier Ltd. A

x.doi.org/10.1016/j.patcog.2012.03.022

esponding author. Tel.: þ852 2788 7704; fax

ail address: [email protected] (S. Kwong

a b s t r a c t

In classification tasks, active learning is often used to select out a set of informative examples from a big

unlabeled dataset. The objective is to learn a classification pattern that can accurately predict labels of

new examples by using the selection result which is expected to contain as few examples as possible.

The selection of informative examples also reduces the manual effort for labeling, data complexity, and

data redundancy, thus improves learning efficiency. In this paper, a new active learning strategy with

pool-based settings, called inconsistency-based active learning, is proposed. This strategy is built up

under the guidance of two classical works: (1) the learning philosophy of query-by-committee (QBC)

algorithm; and (2) the structure of the traditional concept learning model: from-general-to-specific

(GS) ordering. By constructing two extreme hypotheses of the current version space, the strategy

evaluates unlabeled examples by a new sample selection criterion as inconsistency value, and the

whole learning process could be implemented without any additional knowledge. Besides, since active

learning is favorably applied to support vector machine (SVM) and its related applications, the strategy

is further restricted to a specific algorithm called inconsistency-based active learning for SVM

(I-ALSVM). By building up a GS structure, the sample selection process in our strategy is formed by

searching through the initial version space. We compare the proposed I-ALSVM with several other

pool-based methods for SVM on selected datasets. The experimental result shows that, in terms

of generalization capability, our model exhibits good feasibility and competitiveness.

& 2012 Elsevier Ltd. All rights reserved.

1. Introduction

Active learning, being different from the traditional methodswhich adopt the passive mode of ‘‘learning from examples’’ [1], isknown as a kind of revised supervised learning scheme thatemploys the ‘‘selective sampling’’ [2] manner. In some real worldmachine learning applications such as text classification [3],image retrieval [4–6], or music classification [7,8], labeled exam-ples are inadequate whereas unlabeled ones are abundant, butthe manual labeling work is quite expensive. Active learning is apopular way to alleviate the human effort for labeling by allowingthe learner to select a limited number of informative exampleswith a certain query strategy [9], and the objective is to learn aclassification pattern that can achieve accurate predictions onnew examples with the selection result. By setting up theselection criterion, one assumes that active learning can at leasthave some controls of the input domain.

A lot of researches on active learning have been done duringthe recent years. Regarding the way of accessing unlabeled data,active learning has been investigated with two frameworks as

ll rights reserved.

: þ852 2788 8614.

).

stream-based framework [10–12] and pool-based framework[13–15]. Considering that whether it allows the learner to selectexamples in group, active learning has been researched with twomodes as single mode and batch mode [16,17]. Once the learningframework and the selection mode are fixed, how to design aproper selection criterion will be the key problem. Certainselection criteria are often accompanied with certain querystrategies. The three most commonly used query strategies areknown as version space reduction [3], expected error reduction[18–20], and uncertainty reduction [21–24]. Besides, the diversity[3], density [25], and relevance [26] criteria are also widelyadopted.

Some current works paid their attentions on particular appli-cations such as image annotation or video indexing. In [27],semantic-gap oriented active learning is proposed for imageannotation, which incorporates the semantic-gap measure intothe sample selection strategy to make the learning more accordwith the motivation of user feedbacks. In [28], active learning isemployed to video indexing, in which the sample’s local structureis simultaneously exploited, and a unified sample selectionapproach is then developed for iterative learning. Another groupof works aims to enlarge the implementation domain of activelearning beyond the supervised environment and the binary case.In [25,29], active learning is performed under an unsupervised

Page 2: Inconsistency-based active learning for support vector machines

R. Wang et al. / Pattern Recognition 45 (2012) 3751–37673752

environment in which the clustering technique is used to evaluatethe density of regions. In [30–32], the multi-class or multi-labelcases are discussed. Other related works include multiple-instance active learning [6] which aims to learn a classificationpattern from structured data, and multiple-view active learning[33,34] which conducts the learning from different aspects.

It is well known that query-by-committee (QBC) [11] is aclassical stream-based learning strategy which filters and labelsthe most uncertain examples by maximizing the average infor-mation gain. It randomly generates 2k hypotheses of the currentversion space to form a committee with each hypothesis repre-senting a committee member. When example comes one by onefrom a produced sequence, QBC queries all the members about itslabel, and decides whether to retain it or discard it according to itsdisagreement degree. QBC was firstly proposed based on theGibbs algorithm, and is a powerful framework which was adoptedby many other existing strategies. However, it is hardly utilizedin many practical cases with two reasons. First, stream-basedlearning loses its feasibility in some special tasks since theexamples’ order in the produced sequence has been fixed, thusthe learner cannot evaluate the current example based on afull-scale knowledge of the training set. Second, the number ofhypotheses to be generated is difficult to determine due to therandom mechanism, when the value of k is large, the learning willbecome a costly and impractical task. Freund et al. [10] discussedthe simplest case in which the committee is of size two. Byrandomly generating two hypotheses during each iteration, itrejects the examples which gain equal predictions from them.Gilad et al. [35] further reduced the computational complexity ofthis implementation by projecting the examples onto a lowdimensional space. Even though, the two randomly generatedhypotheses still have low control of the input domain, and cannotrepresent the whole version space commendably.

From-general-to-specific (GS) ordering is a traditional conceptlearning model proposed by Mitchell [36]. It allows us to repre-sent the version space in an uniform way, and observe thehypotheses in a GS order. To get a better representation of theversion space, the GS order could possibly be used. However, thisparadigm did not get much attention during past years. To thebest of our knowledge, we are the first ones who try to adoptthis model to the area of active learning, except for [1] in whichthe author employed a similar notion for neural network training.

In [37], we have made some primary attempts to design asampling process by adopting the ‘‘most general’’ and ‘‘most specific’’concepts. In this paper, we further develop a new model, calledinconsistency-based active learning (I-AL). It employs the learningphilosophy of QBC, and follows the GS ordering. By generating twoextreme hypotheses, it gets a better representation of the currentversion space. Then, by adopting a new selection criterion asinconsistency value, it selects informative examples and performsthe learning in a pool-based environment. Finally, by repeating thesetwo processes, a GS learning structure is formed.

On the other hand, support vector machine (SVM) is favorablyincorporated with active learning in order to generate a compacttraining set by selecting informative examples [4,8,3]. However,traditional QBC is hardly utilized by SVM due to the above mentionedlimitations. Thus, we are encouraged to investigate that whether thesampling mechanism of QBC can better serve its learning.

The rest of the paper is organized as follows: in Section 2, abrief review of pool-based active learning and its related works onSVM are given. In Section 3, the concept learning model of GSordering, as well as the proposed learning strategy, is introduced.In Section 4, the proposed strategy is restricted to a particularscheme for SVM, and some analyses are then described in detail.In Section 5, the experimental results on selected benchmarkdatasets, as well as comparisons with several other methods, are

shown to assess the feasibility of our approach. Summarizationsand conclusions are given in Section 6.

2. A review of pool-based active learning and SVM

The framework of pool-based active learning is composed offive parts: / Oracle T , training set X, learner/classifier S, learningalgorithm f, query algorithm qS, in which X contains two types ofdata: a small part of labeled data L and a major part of unlabeleddata U, T contains two kinds of information: example and label.

Pool-based learning is an incremental task which mainlyincludes two operations: learning and selecting. Firstly, thelearning algorithm f builds a learner S by using L. Then, thelearner S selects examples from U, regardless of the examples’individual order, by using the query algorithm q. Afterwards theselected examples are labeled by oracle T , deleted from U andadded to L. This process repeats until it attains the pre-definedstopping criterion.

In real world applications such as image retrieval or textclassification, the selected examples’ labels are always provided bysome human experts, or users, who can offer correct information.

SVM is a famous classification technique based on statisticallearning theory [38]. For a binary classification problem, it aimsto produce an optimal separating hyperplane which can max-imize the margin between the two classes. Suppose that there is abinary classification problem on the training set L¼ fðxi,yiÞg

li ¼ 1 �

Rm� fþ1,�1g. If L is linearly separable in the input space, the

optimal hyperplane should exist with the form of wT xþb¼ 0,where w is a m-dimensional vector, and b is a constant value. Byusing Lagrange method, this optimal hyperplane could beobtained through a quadratic programming (QP) problem, andthe final classifier is determined as f ðxÞ ¼ signðwT xþbÞ. In casethat the data are nonlinearly separable in the input space, thereare two approaches to handle the situation. The first solution iscalled soft-margin SVM in which a slack variable xi is introducedfor each of the training data. The other solution is to map the datafrom the input space into a higher dimensional feature space, by anonlinear mapping function f : x-fðxÞ, which makes themlinearly separable. One can always use a kernel trick [39], whichis with the form of /fðxiÞ,fðxjÞS¼ kðxi,xjÞ, to represent theinformation of feature space, where kðxi,xjÞ is called a kernelfunction satisfying Mercer’s theorem. Finally, the classifier isderived as f ðxÞ ¼ signð

Pli ¼ 1 yiaikðxi,xÞþbÞ, where ai is the

Lagrange multiplier.It is worth noting that when training the SVM classifier, the

complexity for determining the optimal hyperplane mainly lies insolving the QP problem. When the size of training set increases,both the spatial and time complexity will increase at leastlinearly. In addition, the optimal hyperplane is just determinedby the so-called support vectors, other examples have no effect onthe result but increase the complexity. It is necessary to generatea compact dataset by removing the non-informative examples.However, the random sampling manner cannot efficiently solvethis problem since the learner does not have any control of theinput domain. Thus, active learning is a favorable choice toimprove the generalization and efficiency of SVM.

Pool-based active learning was firstly applied to SVM in [40] byusing a greedy optimal strategy. Afterwards, different kinds of activelearning models for SVM were proposed. Campbell et al. [18]proposed the margin-based strategy that selects the examples whichare closest to the current decision boundary during each learningiteration. Tong [41,3] later demonstrated that the learner shouldquery an example which can bisect the current version space, i.e., theexample xi whose output decision value is closest to zero, and thisdemonstration is exactly corresponding to the margin-based

Page 3: Inconsistency-based active learning for support vector machines

R. Wang et al. / Pattern Recognition 45 (2012) 3751–3767 3753

strategy. The margin-based strategy is a very competitive systemwhich is hard to be defeated in many applications. Cheng et al. [4]proposed a Co-SVM strategy which treats the data from two viewswith different features, then selects the examples which are notconsistent between the two views, and this strategy achieved goodperformance on image retrieval tasks. Wang et al. [24] introduced anentropy-based method that selects the examples with the maximumneighborhood entropy. Cheng et al. [42] developed a criterionto eliminate non-informative examples incrementally by assigningeach example a weight according to its confidence factor. Wang et al.[43] raised a method with adaptive model selection, and Gorisseet al. [44] proposed a sub-linear strategy with an approximate k-NNsearch, etc. So far, all the works have respectively exhibited satisfiedperformance on their focused learning domains.

3. Inconsistency-based active learning

In this section, we first introduce the famous concept learningmodel: from-general-to-specific (GS) ordering. Then, we proposea new pool-based active learning strategy guided by a versionspace searching process with a new sample selection criterion.

3.1. Version space

The idea of version space was firstly proposed by Mitchell in[45], and is considered as a method for forming general descrip-tions of concepts learned from a sequence of positive andnegative training examples. In active learning, version space is avery important notion for marking reliable partially learnedconcepts and finding new informative training examples.

Given a pool-based active learning task, let L¼ fðxi,yiÞgli ¼ 1 �

Rm� fþ1,�1g be the initial training set which contains l labeled

examples, U¼ fðxlþ iÞgui ¼ 1 �Rm be the selection pool which con-

tains u unlabeled examples, and X be the whole training setwhere X¼ L [U.

For the set of examples X : X¼ L [U and the concept class(hypothesis class) C¼ fh : xAX-fþ1,�1gg, the version space(hypothesis space) of C over X is the subset of C which containsall the hypotheses that can correctly classify the so far labeledexamples in X, denoted by HX ¼ fhAC, such that h can correctlyclassify all xALg.

When the determinants of the hypothesis are all composed oflogical phases, such as the one described in [36, pp. 20–51], theversion space is a finite set which contains discrete concepts.However, for many hypothesis classes, the version space couldalso be treated as parameter space which is continuous andcontains infinite concepts [3]. In this research, we only focus ona special type of hypothesis class, for which there exists a subsetof HX that includes the optimal hypotheses by consideringdifferent label combinations of all xAU, and we call this subsetthe selective version space, denoted by Hn

X. Under our problemenvironment, Hn

X should be originally with size 2u since U

contains u unlabeled examples.

Fig. 1. Illustrative example. (a) Data points in the plain. (b) Some of the hypotheses in

Fig. 1 illustrates the idea of Hn

X with the settings that the datapoints lie in the plain, and the hypotheses are two-dimensionalSVMs. At first, three negative points and four positive points arelabeled, and the version space is an infinite set which contains allthe linear separators that are within the disagreement region.However, if there exists two unlabeled examples in the selectionpool, only four separators are included into Hn

X.

3.2. A concept learning model: GS ordering

Mitchell [36] described a concept learning model, called GSordering, which considers the learning process as a search in theconcept set, and the search is guided by an order relation definedamong the concepts being learned. This model is originallydefined over the concept form of boolean-valued function. Here,we extend it into real-valued hypothesis/classifier whose sign isdetermined as the output class label. Focusing on binary-classproblem, and given that C is the real-valued hypothesis class, wenow discuss the learning model under our pool-based activelearning environment.

Definition 1. Suppose that HX is the version space of C over X,h1,h2AHX are two hypotheses. If and only if h1 and h2 satisfythat ð8xAXÞ½ðsignfh2ðxÞg ¼ þ1Þ-ðsignfh1ðxÞg ¼ þ1Þ�, then h1 ismore-general-than-or-equal-to h2, denoted by h1Zgh2, and h2 ismore-specific-than-or-equal-to h1, denoted by h2r sh1.

Regarding this definition, a common explanation could begiven. h1 is more-general-than-or-equal-to h2 when the followingstatement holds: for every example x in the training set X, if h2

classifies it as positive, h1 will also classify it as positive. In a morestrict case, if and only if ðh1Zgh2Þ4ðh2 �gh1Þ, then h1 is strictlymore general than h2 (denoted by h14gh2), and h2 is strictly morespecific than h1 (denoted by h2o sh1).

As aforementioned, for the model described in [36], thedeterminants of the hypotheses are all composed of logicalphases, where both the sample space and the version space arefinite and discrete sets, thus the comparison of GS relationshipbetween two hypotheses is easy and intuitive. However, when itis extended to real-valued hypotheses, further constraints andexplanations should be clearly given

the

Since the sample space is usually a continuous and infinite set,there is no guarantee that the output of one hypothesis will becovered by another. Thus, the GS relationship needs to bedefined on some finite, discrete, and fixed training sets, ratherthan the sample spaces.

� Usually, HX contains infinite hypotheses. On one hand, build-

ing up a GS structure on infinite hypotheses is impossible; onthe other hand, not all the hypotheses are comparable. Thus,we only consider the ones that are included in Hn

X.

When we consider all the examples of U as positive (negative)and train a classifier h1 (h2) from L [U, intuitively h14gh2 andh2o sh1. Then, we gradually assign labels to examples of U and

current version space. (c) The four hypotheses in the selective version space.

Page 4: Inconsistency-based active learning for support vector machines

R. Wang et al. / Pattern Recognition 45 (2012) 3751–37673754

shift them into L, h1 (h2) becomes more and more specific(general). Based on this idea, we can build up a GS order of thereferred hypotheses. Note that generally the words positive andnegative are symmetric, the words general and specific are alsosymmetric. In other words, the order could also be considered asfrom-specific-to-general (SG).

Cohn et al. [1] applied a similar notion to neural networktraining where a most general and a most specific neural net-works are designed. Then, they described how these two net-works might be used to select examples from the region ofuncertainty where they disagree on some unlabeled ones. How-ever, this paradigm has not been adopted by other learningstrategies so far.

3.3. Version space searching process

In fact, for the hypothesis class whose Hn

X exists, a learningstructure could be discovered according to the GS order. As shown inFig. 2, a hierarchical architecture of Hn

X could be formed. In Fig. 2,each point represents a hypothesis of Hn

X, and a GS transition isformed between the two extreme cases by arranging all the 2u

hypotheses. In detail, the highest hypothesis is the most general onewhich considers all xAU as positive, the ones in the second rowconsider just one xAU as negative, etc., and the lowest oneconsiders all xAU as negative. It is worth noting that, not all thepairs of hypotheses have a more-general-than or more-specific-thanrelation, even for the two which are locating in different rows.However, each hypothesis can find the ones which are more generalthan it or more specific than it by tracing up or down in thestructure. Generally speaking, if two points can be connected witha set of arrows in Fig. 2, then they have a more-general-than ormore-specific-than relation, otherwise they do not.

A learning process is aimed to be established by searchingthrough Hn

X. Two boundaries of the searching, the generalboundary Gb and the specific boundary Sb, are defined as thecurrent searching positions which are initialized as the twoextreme cases as shown in Fig. 2. Unlabeled examples arecontinuously selected from U, labeled, and added to L. If theselected example is positive, Sb will be forced to become moregeneral, whereas Gb remains unchanged; if the selected exampleis negative, Gb will be forced to become more specific, whereasSb remains unchanged. Without a stopping criterion, the twoboundaries will become closer and closer. Finally, when no moreexamples exist in U, the two boundaries will converge at onehypothesis hn, and the searching is finished.

In this case, once an example is selected and labeled, thenumber of hypotheses in Hn

X will be reduced half, since only

general

specific

Initial general boundary

Initial specific boundary

Finally coverged hypothesis

Referred hypothesis during searching

Hierarchical architecture of the

selective version space

Inner searching structure

Inner searching structure

Fig. 2. Version space searching process.

half of them can correctly classify it. And, an integral part of thestructure in Fig. 2 will be retained.

3.4. Inconsistency value of unlabeled example

Usually, for any pool-based active learning strategy, thelearning structure in Fig. 2 objectively exists. Nevertheless, thekey idea of active learning is to achieve a classification pattern byselecting as less examples as possible, thus, we need to focus ontwo main problems

In many cases, the size of U is large, thus the learning needs tobe stopped when a certain number of queries has been made. � For different strategies, when the learning modes and related

settings are fixed, the designed selection criteria will be themain distinctions.

Traditional QBC involves 2k randomly generated hypothesesof HX to judge the unlabeled examples, but this process will havehuge expenses if k is big, and will have no enough control of theinput domain if k is small. It is worth noting that in the learningprocess, the two extreme hypotheses are representative and areused to represent the searching boundaries. Thus, if just twohypotheses are going to be retained, these two extreme onesshould be the first choice, and the selection criterion should alsobe designed based on them.

Intuitively thinking, the informative examples regardingtwo hypotheses should be the ones which are inconsistentbetween them. This thinking is accordant with QBC. However,under a pool-based environment, a more comparable criterionshould be established. Fortunately, for real-valued hypothesis,further attempts could be made. As we know, for a real-valuedhypothesis h and an unlabeled example x, f ðxÞ ¼ signfhðxÞgAfþ1,�1g is always decided as the final output label, andhðxÞAðþ1,�1Þ is often treated as the decision value where ahigher absolute hðxÞ, i.e., 9hðxÞ9, demonstrates a higher certainty ofx belonging to the corresponding class. This assertion is trueespecially for margin-based hypothesis such as SVM.

We still follow our pool-based active learning environment, leth1 and h2 be the two extreme hypotheses of the current Hn

X. Sinceh1 is the most general one and h2 is the most specific one, it isclear that h1 will classify xAU as positive and h2 will classifyxAU as negative. Note that h1 and h2 are trained in exactlythe same space with same settings, thus h1ðxÞ and h2ðxÞ arecomparable. In this case, 9h1ðxÞ9 and 9h2ðxÞ9 could be treatedas the certainties of x belonging to the positive class and nega-tive class respectively. We denote by Prþ ðxÞ and Pr�ðxÞ theprobabilities of x having positive label and negative label, thuswe have Prþ ðxÞ ¼ 9h1ðxÞ9=ð9h1ðxÞ9þ9h2ðxÞ9Þ and Pr�ðxÞ ¼ 9h2ðxÞ9=ð9h1ðxÞ9þ 9h2ðxÞ9Þ, where Prþ ðxÞþPr�ðxÞ ¼ 1.

Now, we propose a new sample selection criterion in Definition 2,as inconsistency value, for real-valued hypothesis.

Definition 2. Suppose that h1 and h2 are the two extreme hypo-theses of Hn

X, where h14gh2 and h2o sh1. Given an unlabeledexample xAU, its inconsistency value iðxÞ regarding h1 and h2 isdefined as Eq. (1)

iðxÞ ¼ �Prþ ðxÞ logPrþ ðxÞ�Pr�ðxÞ logPr�ðxÞ ð1Þ

where iðxÞA ½0;1�.

In a word, the inconsistency value of an unlabeled exampleregarding the two extreme hypotheses is defined as the entropyof its label probability. Obviously, when the difference betweenPrþ ðxÞ and Pr�ðxÞ becomes smaller, iðxÞ will be bigger, and viceversa. Thus, if an example has higher label uncertainty, it shouldhave higher inconsistency.

Page 5: Inconsistency-based active learning for support vector machines

R. Wang et al. / Pattern Recognition 45 (2012) 3751–3767 3755

3.5. Inconsistency-based active learning

Based on Definitions 1 and 2, and the version space searchingprocess, we now propose a general active learning strategy, calledinconsistency-based active learning (I-AL), which could be possi-bly applied to kinds of supervised learning algorithms.

We follow the general framework of pool-based active learn-ing as described in Section 2. First, all the examples in U aresupposed to be positive, which results in set L1. Classifier h1 istrained on set L [ L1 according to f. Then, all the examples in U

are supposed to be negative, which results in set L2. Classifier h2

is trained on set L [ L2 according to f. Afterwards, h1 and h2 areused to investigate all the examples in U. Finally, the mostinconsistent ones regarding h1 and h2 (for real-valued classifiersas SVM, the ones with the biggest value of Eq. (1)) are selected,submitted to domain expert for labeling, added to L, and furtherdeleted from U. This process repeats until it reaches to a pre-defined threshold or no more examples exist in U.

Generally speaking, by following the GS learning structure, weapply the inconsistency value, as a sample selection criterion,to active learning. Different from traditional QBC, I-AL usestwo extreme hypotheses to evaluate the unlabeled examples.These two extreme ones can be easily generated by taking all theunlabeled examples as positive or negative, and have a more-general-than relation. The examples that are most inconsistentbetween them will be treated as informative, thus are likely to beselected.

4. Inconsistency-based active learning for SVM

In this section, we restrict I-AL into a particular SVM-basedscheme, followed by some related analyses.

4.1. Algorithm description and feasibility analysis

When the hypothesis class of I-AL is set as SVM, we need toguarantee that Hn

X exists. However, in most cases, this conditioncannot be satisfied in the input space. Fortunately, it is possible toanalyze the model in the feature spaces induced by some kernels.

Note that /fðxiÞ,fðxjÞS¼ kðxi,xjÞ is a kernel trick, where xi andxj are the two input patterns, kðxi,xjÞ is a kernel function, and f :

x-fðxÞ is the associated mapping. Applying kernel trick to thetraining of SVM always involves the calculating of kernel matrixdefined as Definition 3.

Definition 3 (Scholkopf and Smola [39]). Given a function k :

w2-K (where w is a nonempty set, K is the mapped set)and patterns x1, . . . ,xnAw, the n�n matrix K with elementsKi,j : ¼ kðxi,xjÞ is called the Gram Matrix (or kernel matrix) of k

with respect to x1, . . . ,xn.

Fig. 3. The 2-d illustrative example. (a) Data distribution in the model (ideal case). (

examples.

As the most widely used one, we discuss the Gaussian rbfkernel here, which is expressed as kðxi,xjÞ ¼ expð�Jxi�xjJ

2=2s2Þ.

Theorem 1 (Scholkopf and Smola [39]). Suppose that x1, . . . ,xnAware distinct points, and sa0. The matrix K given by Gaussian rbf

kernel Ki,j : ¼ expð�Jxi�xjJ2=2s2Þ has full rank.

Theorem 1 shows an important property of Gaussian rbfkernel: with no restriction on the number of input patterns, andprovided that all the input patterns are distinct, the associatedmapping f maps them into a feature space in which all themapped points are linearly independent. Obviously, Theorem 1 isa sufficient condition for fx1, . . . ,xng being linearly separable infeature space. Generally speaking, linearly independent impliesthat any binary partition of the dataset is linearly separable infeature space. Except Gaussian rbf kernel, the linearly indepen-dent property of other kernels could also be analyzed as in [46].Thus, with the feature spaces induced by certain kernels, Hn

X

exists, and the GS learning structure could be formed.During each learning iteration, two linear hyperplanes, i.e.,

h1ðxÞ ¼ 0 and h2ðxÞ ¼ 0, are generated where h14gh2 and h2o sh1.Here, we can call h1ðxÞ ¼ 0 the general hyperplane and h2ðxÞ ¼ 0the specific hyperplane. The first iteration generates two extremecases, whereas with the learning going, h1 and h2 becomes moreand more similar. It is worth noting that the most specifichyperplane is a duad of the most general hyperplane for a two-class problem. In other words, the specific and general hyper-planes can be exchanged if the positive and negative classes areexchanged.

Furthermore, h1ðxÞ ¼ 0 and h2ðxÞ ¼ 0 separate the whole fea-ture space into four regions as: (1) h1ðxÞ40 and h2ðxÞ40; (2)h1ðxÞo0 and h2ðxÞo0; (3) h1ðxÞ40 and h2ðxÞo0; (4) h1ðxÞo0and h2ðxÞ40. With the fact that h14gh2, one can deduce thatthere should have no example existing in region 4. Whereas thepositive labeled examples should locate in region 1, the negativelabeled examples in region 2, and the unlabeled examples inregion 3. Fig. 3(a) shows the model in two-dimensional space.

In fact, the data distribution in Fig. 3(a) is just an ideal case, inwhich the training accuracy of the two hyperplanes can reach to1. Although in theory, the data can be linearly separated withcertain kernels, in real implementation, this condition cannot besatisfied in many cases, since in order to overcome the over-fitting problem, we always apply a soft-margin SVM. Thus, ceratintraining errors may appear which result in the case as shown inFig. 3(b). Under the circumstances, the examples that locate atregion 1 and region 2 will not be considered, only the ones inregion 3 and region 4 are taken to form the selection pool. If thetraining error can be reduced to a low level, we can deduce thatmost of the unlabeled examples are in region 3.

The particular scheme, i.e., inconsistency-based active learningfor SVM (I-ALSVM), is then described in Algorithm 1.

b) Data distribution in the model (real case). (c) Comparison of three unlabeled

Page 6: Inconsistency-based active learning for support vector machines

R. Wang et al. / Pattern Recognition 45 (2012) 3751–37673756

Someone may argue that from the aspect of ensemble learning,the two extreme hypotheses are poor and are inconvincible toachieve good performance. In fact, the key points of our modeland ensemble learning are totally different. Although it startswith two hypotheses, these two hypotheses are just used todesign a sample selection criterion. In other words, they are justsome temporal tools to select samples rather than some consti-tuent parts of the final learner. Thus, there is no need to evaluateit from the aspect of ensemble learning.

Algorithm 1. Inconsistency-based active learning for SVM.

Input:

Initialtraining set

L¼ fðxi,yiÞgli ¼ 1 � Rm

� fþ1,�1g which contains

l labeled examples;

Unlabeled pool U¼ fðxlþ igui ¼ 1 �Rm which

contains u unlabeled examples;

A pre-defined constant nn representing thenumber of selected examples during eachiteration;

Output:

SVM Classifier h trained on the final trainingset;

1:

while U is not empty do 2: if stopping criterion is met then 3: stop; 4: else 5: All xAU are supposed to be positive, which

results in set L1;

6: SVM classifier f 1ðxÞ ¼ signfh1ðxÞg is trained

on L [ L1 where h1ðxÞ ¼ 0 is the generalhyperplane;

7:

All xAU are supposed to be negative,which results in set L2;

8:

SVM classifier f 2ðxÞ ¼ signfh2ðxÞg is trained

on L [ L2 where h2ðxÞ ¼ 0 is the specifichyperplane;

9:

The unlabeled examples which aredifferently classified by h1 and h2 are considered

to form the current selection pool U0;

10: Assign each xAU0 an inconsistency value

by Definition 2;

11: Sort all the examples in U0 according to

their inconsistency values with descendingorder;

12:

Select the first nn examples from U0 to

form Un;

13: The examples in Un are submitted to be

labeled;

14: Let U¼U�Un, and L¼ L [Un; 15: end if 16: end while

4.2. Inconsistency value of unlabeled example

With regard to Algorithm 1, the inconsistency value of unla-beled example could be further analyzed. The two hypothesesduring each iteration, i.e., h1 and h2, are respectively representedas Eqs. (2) and (3)

h1ðxÞ ¼wT1fðxÞþb1 ð2Þ

h2ðxÞ ¼wT2fðxÞþb2 ð3Þ

where w1,w2 are real vectors, b1,b2 are real constants. One knowsthat the distance of an example fðxÞ to a SVM hyperplanehðxÞ ¼wTfðxÞþb¼ 0, denoted by dðxÞ, could be represented asdðxÞ ¼ 9hðxÞ9=JwJ. Since JwJ¼ 2=margin, thus 9hðxÞ9¼ 2dðxÞ=margin. This assertion further affirms that 9h1ðxÞ9 and 9h2ðxÞ9 arecomparable, where they are the distances of x to the SVMhyperplanes scaled by the corresponding margin values. Sincethe examples in region 3 and region 4 are considered, thus, Prþ ðxÞand Pr�ðxÞ can be respectively evaluated as Eqs. (4) and (5)

Prþ ðxÞRegion3 ¼ Pr�ðxÞRegion4 ¼jwT

1fðxÞþb1j

jwT1fðxÞþb1jþjwT

2fðxÞþb2jð4Þ

Pr�ðxÞRegion3 ¼ Prþ ðxÞRegion4 ¼jwT

2fðxÞþb2j

jwT1fðxÞþb1jþjwT

2fðxÞþb2jð5Þ

And then, the inconsistency value of x regarding h1 and h2 couldbe further computed.

The same case in Fig. 3(b) is further considered in Fig. 3(c) as atwo-dimensional illustration. In order to give a direct impressionof this selection criterion, we retain three examples as x1, x2,andx3, which are assumed to have the following decision values byh1 and h2:

(1)

x1 : h1ðx1Þ ¼ 5 and h2ðx1Þ ¼ �5, results in iðx1Þ ¼ 1, (2) x2 : h1ðx2Þ ¼ 1 and h2ðx2Þ ¼ �5, results in iðx2Þ � 0:1957, (3) x3 : h1ðx3Þ ¼ 10 and h2ðx3Þ ¼�1, results in iðx3Þ � 0:1323.

As what can be seen, x1 should be the most inconsistentone where Prþ ðx1Þ ¼ Pr�ðx1Þ ¼ 1=2, x2 has a comparatively highcertainty to belonging to the negative class where Pr�ðx2Þ ¼

5=64Prþ ðx2Þ ¼ 1=5, and x3 is the least inconsistent one whohas a very high certainty to belonging to the positive class wherePrþ ðx3Þ ¼ 10=114Pr�ðx2Þ ¼ 1=11. Thus, x1 should be selected outfrom the three as the most informative example.

4.3. Hyperplane changing trends

Suppose that in the whole learning process of Algorithm 1, N

iterations are conducted, thus in the n-th iteration (wheren¼ 1;2, . . . ,N), two SVM hypotheses are generated which areknown as the general one and the specific one. We define thegeneral one of the n-th iteration as hn

g in Eq. (6), and the specificone of the n-the iteration as hn

s in Eq. (7)

hgnðxÞ ¼

Xlþu

i ¼ 1

ygn,ia

gn,ikðxi,xÞþbg

n ð6Þ

hsnðxÞ ¼

Xlþu

i ¼ 1

ysn,ia

sn,ikðxi,xÞþbs

n ð7Þ

where lþu is the total number of training examples, y, a, and b arethe corresponding values of the label, Lagrange multiplier, and thebias of the SVM hyperplane, kð,Þ represents the kernel function.

In order to make some further investigations, the SVM trainedon the whole training set (L [U), i.e., hn, is also calculated as inEq. (8). What needs to be clarified is that, in most real cases,hn cannot be obtained in advance, since the label information ofU is not available. However, we use some well-prepared data toanalyze the model, thus hn is possible to generate

hnðxÞ ¼

Xlþu

i ¼ 1

yiaikðxi,xÞþbnð8Þ

We aim to investigate that during the learning process, whatare the changing trends of the general and specific hyperplanes,as well as the current classification hyperplane. The general andspecific hypotheses are well defined in Eqs. (6) and (7), while

Page 7: Inconsistency-based active learning for support vector machines

R. Wang et al. / Pattern Recognition 45 (2012) 3751–3767 3757

the current classification one could be defined as hnðxÞ ¼Plþui ¼ 1 yn,ian,ikðxi,xÞþbn, where n¼1,y,N.As what we propose here, the criterion for doing the above

mentioned investigations is the included angle between twohyperplanes in feature space. Kernel trick is considered, sinceinformation in feature space can be replaced by inner products ofthe input space. As an example, we discuss the included anglebetween hg

nðxÞ ¼ 0 and hsnðxÞ ¼ 0 here, while any other hyperplane

combination could be evaluated in the same way. The includedangle of two linear hyperplanes equals to the included angle oftheir normal vectors. In the n-th iteration, the normal vectors ofthe general hyperplane and the specific hyperplane are

0 50 100 150 200 250 3000

10

20

30

40

50

60

70

80

90

Number of Added Training Examples

Incl

uded

Ang

le

0 50 100 150 200 250 3000

20

40

60

80

100

120

140

Number of Added Training Examples

Incl

uded

Ang

le

Fig. 5. Changing trends between hyperplanes. (a) Changing of ygnn

−4 −2 0 2 4 6 8−3

−2

−1

0

1

2

3

4

5

6

7

Fig. 4. Generated dataset for algorithm investigation.

respectively represented as Eqs. (9) and (10)

wgn ¼

Xlþu

i ¼ 1

ygn,ia

gn,ifðxiÞ ð9Þ

wsn ¼

Xlþu

i ¼ 1

ysn,ia

sn,ifðxiÞ ð10Þ

Thus, the cosine value of their included angle, denoted by ygsn , is

computed as Eq. (11)

cosðygsn Þ ¼

/wgn,ws

nSJwg

nJ � JwsnJ¼ðPlþu

i ¼ 1 ygn,ia

gn,ifðxiÞÞ � ð

Plþui ¼ 1 ys

n,iasn,ifðxiÞÞ

JPlþu

i ¼ 1 ygn,ia

gn,ifðxiÞJ � J

Plþui ¼ 1 ys

n,iasn,ifðxiÞJ

¼

Plþui ¼ 1

Plþuj ¼ 1 yg

n,iysn,ja

gn,ia

sn,jkðxi,xjÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPlþu

i ¼ 1

Plþuj ¼ 1 yg

n,iygn,ja

gn,ia

gn,jkðxi,xjÞ

q�

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPlþui ¼ 1

Plþuj ¼ 1 ys

n,iysn,ja

sn,ia

sn,jkðxi,xjÞ

q

ð11Þ

Hence, ygsn can be calculated by an inverse cosine transform.

Finally, a set of ygsn is obtained where n¼ 1;2, . . . ,N. And a figure

can be drawn to show the relative changing trend between thetwo hyperplanes.

A two-dimensional artificial dataset is generated as in Fig. 4 forrealizing the investigations. Two classes of examples are ran-domly distributed in two clusters. The hyperplanes are trainedwith Gaussian rbf kernel, and four parameter combinations as: (1)s¼ 1, C¼1; (2) s¼ 32, C¼1; (3) s¼ 1, C¼100; (4) s¼ 32, C¼100,are tested. About 10% data are randomly selected from each classas the initial training set L, and the rest ones are taken as theselection pool U. During each iteration, one example is selected,deleted from U, and added to L. About 50 repetitions are con-ducted for each parameter setting, and the average is computed.

The included angles between (1) hgnðxÞ ¼ 0 and hs

nðxÞ ¼ 0,expressed as ygs

n ; (2) hgnðxÞ ¼ 0 and hn

ðxÞ ¼ 0, expressed as ygnn ;

(3) hsnðxÞ ¼ 0 and hn

ðxÞ ¼ 0, expressed as ysnn ; (4) hnðxÞ ¼ 0 and

0 50 100 150 200 250 3000

10

20

30

40

50

60

70

80

90

100

Number of Added Training Examples

Incl

uded

Ang

le

0 50 100 150 200 250 3000

5

10

15

20

25

30

35

Number of Added Training Examples

Incl

uded

Ang

le

. (b) Changing of ysnn . (c) Changing of ygs

n . (d) Changing of yn .

Page 8: Inconsistency-based active learning for support vector machines

R. Wang et al. / Pattern Recognition 45 (2012) 3751–37673758

hnðxÞ ¼ 0, expressed as yn, are recorded. Without a stopping

criterion, the learning will continue until no example exists in U.Fig. 5 clearly shows the investigation results. As what can be

observed, in most cases, the included angle between hyperplanes infeature space becomes smaller and smaller with the number ofadded training examples increasing, and finally reaches to zero.Although the curves may not be strictly monotonously decreasing, aglobal decreasing trend can be guaranteed, especially for the firstand third parameter settings. And for the current classificationhyperplane, we observe from Fig. 5(d) that a decreasing trend isexhibited in all the cases, which implies that with the learninggoing, the current classification hyperplane becomes closer to thefinal hyperplane. However, it is worth noting that this investigationcannot fully represents the difference between two hyperplanes,since it only considers the included angle, and neglects the positionwhich is very hard to be measured in feature space.

The averaged accuracy of hyperplanes during learning isreported in Table 1. We see that for each of the parametersettings, although the training accuracy of the general and specifichyperplanes cannot reach to 100%, a certain level is maintained,which guarantees that during each iteration, a ceratin number ofunlabeled examples is available to be selected.

5. Experimental comparisons

In this section, we conduct some experimental comparisons toshow the feasibility of the proposed model.

Table 3Mean accuracy and final accuracy (%).

Dataset I-ALSVM C-ALSVM RS-SVM SV

Mean Final Mean Final Mean Final Me

Australian 53.32 54.70 54.33 55.35 53.05 53.81 45

Haberman 72.10 74.30 69.08 73.48 71.72 74.20 71

Soybean 70.30 81.07 61.15 69.11 72.56 81.58 78

Pima 72.78 74.42 71.48 72.82 72.42 73.91 73

Ecoli 85.32 88.15 82.57 83.67 85.33 87.28 86

Yeast 83.45 86.57 78.06 82.11 84.30 86.56 85Spam 66.47 70.60 61.28 61.98 68.73 73.00 69

Optdigits 93.09 95.59 86.57 89.17 90.15 93.70 94

Average 74.60 78.17 70.57 73.46 74.78 78.00 75

Table 2Selected UCI benchmark datasets for performance comparison.

Dataset Dim # Class k Initial set size Pool size Test set size

Australian 15 2 5 10 542 138

Haberman 4 2 2 4 241 61

Soybean 36 19 5 10 536 137

Pima 9 2 10 20 594 154

Ecoli 8 8 5 10 259 67

Yeast 9 10 3 6 1181 297

Spam 58 2 10 20 3661 920

Optdigits 65 10 10 20 3803 1797

Table 1Averaged accuracy (%) of hyperplanes during learning.

Setting hn TrainAcc hn TestAcc hng TrainAcc hn

s TrainAcc

s¼ 1, C¼1 99.9989 99.9985 86.9536 86.3816

s¼ 32, C¼1 89.4206 85.1884 80.6046 86.4774

s¼ 1, C¼100 100.000 100.000 89.4325 89.2077

s¼ 32, C¼100 99.8948 99.9604 82.3121 88.6951

5.1. Learning strategies for performance comparison

Several other pool-based learning strategies for SVM, as wellas the proposed one, are listed in this section for performancecomparison.

(1) Random sampling (RS-SVM). Unlabeled examples areselected randomly during each iteration, until it reaches to thepre-defined stopping criterion.

(2) Active learning SVM (SVM-active) [18,3]. This strategy canalso be considered as a simple-distance based method, or margin-based method. It selects the examples which are closest to thecurrent class boundary. Note that the decision value of anunlabeled example x could be computed by hðxÞ ¼wT xþb. Havingthe distance measurement: dðxÞ ¼ 9wT xþb9=JwJ, one knows thatthe distance of an unlabeled example to the hyperplane isproportional to its absolute decision value. This statement is alsotrue in feature space. Thus the strategy is easy to be achievedunder a pool-based environment where the algorithm picks the nn

examples with the smallest 9hðxÞ9 from the selection pool duringeach iteration.

(3) Active learning with two views (Co-SVM) [4]. This strategyseparates the data features into two subsets and each subsetrepresents a view of the data. Two SVM classifiers, i.e., h1ðxÞ andh2ðxÞ, are learned in these two feature subspaces, and theunlabeled examples are evaluated by them. The ones which aredifferently classified by these two classifiers are selected to label.In real implementation, the features are randomly separated intotwo subsets. During each iteration, the first nn examples whichhave the highest absolute differences between these two views,i.e., 9h1ðxÞ�h2ðxÞ9n9signðh1ðxÞÞ�signðh2ðxÞÞ9, are selected.

(4) Pool-based QBC (Pool-QBC-SVM). Traditionally, QBC algo-rithm is performed under a stream-based environment. Whetherit can perform well under a pool-based environment is still notclear. In our experiments, we implement QBC algorithm withpool-based settings. During each iteration, two hypotheses of thecurrent version space, i.e., h1ðxÞ and h2ðxÞ, are randomly gener-ated, and the examples which are disagreed with h1ðxÞ and h2ðxÞare considered to be labeled. In fact, the number of disagreedexamples may not be exactly nn, thus, similar with Co-SVM, thefirst nn examples which have the highest values of9h1ðxÞ�h2ðxÞ9n9signðh1ðxÞÞ�signðh2ðxÞÞ9 are chosen. In addition,for SVM, how to sample the two random hypotheses is alsoimportant. Different from stream-based settings [47], we applytwo methods, the first one is by randomly assigning labels to allthe unlabeled examples two times, and train two SVMs on thewhole training set with same parameter setting; the second one isby randomly selecting two parameter settings, and train twoSVMs on the current labeled set.

(5) Inconsistency-based active learning (I-ALSVM). Algorithm1 is implemented.

M-active Co-SVM Pool-QBC I-modified

an Final Mean Final Mean Final Mean Final

.51 44.25 52.65 52.55 53.90 54.67 54.75 55.61

.94 74.07 72.60 74.43 68.65 73.84 72.73 74.43

.79 91.52 69.08 75.62 65.37 69.71 80.26 91.61

.31 74.99 72.19 75.34 71.68 73.40 73.43 74.69

.78 88.27 84.84 87.43 85.13 87.34 86.86 88.21

.59 87.46 78.41 80.91 80.60 86.64 84.08 86.72

.30 74.21 65.18 68.53 65.38 67.79 69.86 74.17

.31 97.33 91.25 94.25 86.39 91.68 94.70 97.97

.69 79.01 73.27 76.13 72.14 75.63 77.08 80.42

Page 9: Inconsistency-based active learning for support vector machines

0 10 20 30 40 50 60 70 80 90 10044

46

48

50

52

54

56

Number of Added Training Examples

Test

ing

Acc

urac

y

0 10 20 30 40 50 60 70 80 90 10056

58

60

62

64

66

68

70

72

74

76

Number of Added Training Examples

Test

ing

Acc

urac

y

0 10 20 30 40 50 60 70 80 90 10055

60

65

70

75

80

85

90

95

Number of Added Training Examples

Test

ing

Acc

urac

y

0 10 20 30 40 50 60 70 80 90 10068

69

70

71

72

73

74

75

76

Number of Added Training Examples

Test

ing

Acc

urac

y

0 10 20 30 40 50 60 70 80 90 10078

80

82

84

86

88

90

Number of Added Training Examples

Test

ing

Acc

urac

y

0 10 20 30 40 50 60 70 80 90 10065

70

75

80

85

90

Number of Added Training Examples

Test

ing

Acc

urac

y

0 10 20 30 40 50 60 70 80 90 10058

60

62

64

66

68

70

72

74

76

Number of Added Training Examples

Test

ing

Acc

urac

y

0 10 20 30 40 50 60 70 80 90 10082

84

86

88

90

92

94

96

98

100

Number of Added Training Examples

Test

ing

Acc

urac

y

Fig. 6. Performance comparison of different learning strategies. (a) Australian (50 runs), (b) haberman (50 runs), (c) soybean (50 runs), (d) pima (50 runs), (e) ecoli (50

runs), (f) yeast (50 runs), (g) spam (10 runs), (h) optdigits (10 runs).

R. Wang et al. / Pattern Recognition 45 (2012) 3751–3767 3759

(6) Consistency-based active learning (C-ALSVM). We modify theselection criterion in Algorithm 1 as choosing the least inconsistentexamples which has the smallest inconsistency value. They could

also be considered as the most consistent examples. By comparingC-ALSVM and I-ALSVM, the correctness of the inconsistencyselection criterion in the proposed model could be demonstrated.

Page 10: Inconsistency-based active learning for support vector machines

0 10 20 30 40 50 60 70 80 90 1000

1

2

3

4

5

6

7

8

9

10

Number of Added Training Examples

Dev

iatio

n

0 10 20 30 40 50 60 70 80 90 1000

2

4

6

8

10

12

14

16

18

Number of Added Training Examples

Dev

iatio

n

0 10 20 30 40 50 60 70 80 90 1000

5

10

15

20

25

30

Number of Added Training Examples

Dev

iatio

n

0 10 20 30 40 50 60 70 80 90 1000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Number of Added Training Examples

Dev

iatio

n

0 10 20 30 40 50 60 70 80 90 1000

1

2

3

4

5

6

7

8

9

Number of Added Training Examples

Dev

iatio

n

0 10 20 30 40 50 60 70 80 90 1000

2

4

6

8

10

12

14

16

18

Number of Added Training Examples

Number of Added Training Examples Number of Added Training Examples

Dev

iatio

n

0 10 20 30 40 50 60 70 80 90 1000

1

2

3

4

5

6

7

8

9

10

Dev

iatio

n

0 10 20 30 40 50 60 70 80 90 1000

2

4

6

8

10

12

Dev

iatio

n

Fig. 7. Deviation of different strategies during learning. (a) Australian (50 runs), (b) haberman (50 runs), (c) soybean (50 runs), (d) pima (50 runs), (e) ecoli (50 runs),

(f) yeast (50 runs), (g) spam (10 runs), (h) optdigits (10 runs).

R. Wang et al. / Pattern Recognition 45 (2012) 3751–37673760

(7) Modified I-ALSVM (I-ALSVM-modified). We concern thatfor the proposed model, there should have some performanceimprovements compared with QBC, since the idea is originally

from the QBC learning structure. However, as [41] mentioned,the margin-based strategy is a very strong method which isdifficult to be defeated in many applications. In order to further

Page 11: Inconsistency-based active learning for support vector machines

0 10 20 30 40 50 60 70 80 90 1000

5

10

15

20

25

30

35

Number of Added Training Examples

CP

U ti

me

reco

rd

0 10 20 30 40 50 60 70 80 90 1000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Number of Added Training Examples

CP

U ti

me

reco

rd

0 10 20 30 40 50 60 70 80 90 1000

5

10

15

20

25

30

Number of Added Training Examples

CP

U ti

me

reco

rd

0 10 20 30 40 50 60 70 80 90 1000

2

4

6

8

10

12

14

16

18

20

Number of Added Training Examples

CP

U ti

me

reco

rd

0 10 20 30 40 50 60 70 80 90 1000

0.5

1

1.5

2

2.5

3

3.5

Number of Added Training Examples

CP

U ti

me

reco

rd

0 10 20 30 40 50 60 70 80 90 1000

20

40

60

80

100

120

140

160

180

Number of Added Training Examples

CP

U ti

me

reco

rd

0 10 20 30 40 50 60 70 80 90 1000

20

40

60

80

100

120

Number of Added Training Examples

CP

U ti

me

reco

rd

0 10 20 30 40 50 60 70 80 90 1000

50

100

150

200

250

Number of Added Training Examples

CP

U ti

me

reco

rd

Fig. 8. CPU time records of different learning strategies. (a) Australian (50 runs), (b) haberman (50 runs), (c) soybean (50 runs), (d) pima (50 runs), (e) ecoli (50 runs),

(f) yeast (50 runs), (g) spam (10 runs), (h) optdigits (10 runs).

R. Wang et al. / Pattern Recognition 45 (2012) 3751–3767 3761

enhance the proposed strategy, we make some minor modifica-tions. During each learning iteration, we consider more unlabeledexamples which have high inconsistency values, and select thefirst nn ones which are closest to the current decision boundary.

5.2. Experiments on benchmark datasets

The methods are implemented and performed on eightselected benchmark datasets obtained from UCI machine learning

Page 12: Inconsistency-based active learning for support vector machines

0 10 20 30 40 50 60 70 80 90 10050

55

60

65

70

75

80

Number of Added Training Examples

Incl

uded

Ang

le w

ith T

arge

t Hyp

erpl

ane

0 10 20 30 40 50 60 70 80 90 10025

30

35

40

45

50

55

60

65

70

75

Number of Added Training Examples

Incl

uded

Ang

le w

ith T

arge

t Hyp

erpl

ane

Fig. 9. Included angle between hnðxÞ ¼ 0 and hnðxÞ ¼ 0. (a) Ecoli and (b) soybean.

Fig. 10. The first 100 training examples of MNIST dataset.

R. Wang et al. / Pattern Recognition 45 (2012) 3751–37673762

repository.1 As listed in Table 2, datasets Australian, haberman,pima, and spam are binary ones. Datasets soybean, ecoli, yeast,and optdigits are multi-class ones which have been transferredinto binary for implementation. For each dataset, if there are nopre-defined training and testing sets, 20% data will be randomlyselected as the testing set, and the remaining 80% will be taken asthe training set.

An active learning task often begins with a random sampledinitial training set, but the selected examples may be quiteunbalanced. We adopt k-means clustering technique to handlethis problem. The positive training data and the negative trainingdata are respectively partitioned into k subsets by using k-meansclustering algorithm. With the concern that in an active learningtask, the initial labeled examples are always quite limited, thuswe select one example from each of the k clusters. Finally, kpositive ones and k negative ones are taken as the initial trainingset L, the remaining ones are taken as the selection pool U. Weoriginally set the value of k as 10, but empty clusters may begenerated. In this case, we may further reduce the value of k forsome datasets. Detailed information is listed in Table 2.

For comparison purpose, Gaussian rbf kernel is used consistentlywith parameter C¼100 for all the learning strategies. In order toshow a detailed learning process, nn ¼ 1 example is selected duringeach iteration. The learning stops when the number of added trainingexamples reaches to 100, and in real application problems, weassume that this is the largest number of examples that the usersare willing to label. For dataset Australian, haberman, soybean, pima,ecoli, and yeast, we conduct 50 repetitions of the learning, set kernelparameter s¼ 1, and adopt the first sampling method for QBC. Whilethe computational cost of this setting is quite expensive on biggerdata, thus for dataset spam, and optdigits, we conduct 10 repetitionsof the leaning, set kernel parameter s¼ 32, and adopt the secondsampling method for QBC.

We perform the experiments under MATLAB 7.9.0 with the‘‘svmtrain’’ and ‘‘svmpredict’’ functions of libsvm, which areexecuted on a computer with a 3.16-GHz Intel Core 2 Duo CPU,a maximum 4.00-GB memory, and 64-bit windows seven system.

Table 3 reports the mean accuracy during the learning and thefinal accuracy when the learning stops. Figs. 6 and 7 furtherdemonstrate the accuracy and deviation of the entire learningprocess. Generally, several key points could be concluded. (1) It iseasy to observe that for all the tasks except Australian, the perfor-mance of I-ALSVM shows an obvious advantage over C-ALSVM. Thisobservation demonstrates the correctness of selecting the mostinconsistent examples in the proposed model. (2) On most of the

1 ftp://ftp.ics.uci.edu/pub/machine-learning-databases.

selected datasets, the proposed model derives better performancethan the traditional QBC algorithm. One observes that under a pool-based environment, the traditional QBC cannot perform very well,and the random sampling of the hypotheses is not very effective.While by considering the two extreme ones, the performancecould be improved to some extend. (3) SVM-active is a very strongstrategy. In fact, the proposed model cannot outperform this strategyin many cases. However, by making some minor modifications, theperformance could be further improved. In addition, SVM-activecannot perform well on all the tasks either, such as dataset Australian,on which a decreasing learning trend is exhibited.

We also report the CPU time costs of different learning strategiesin Fig. 8. It is true that the time costs of the proposed model aremore expensive than the others, except the QBC algorithm with thefirst sampling manner. The reason is obvious, since during eachleaning iteration, it generates two classifiers which are trainedon the whole training set. We hope that in practical applicationproblems, this drawback could be somewhat reduced, since the timecosts of the manual labeling work are much more expensive thanthe training. Unfortunately, due to limited resources, it is impossiblefor us to carry out such a work in real world, we can only test ourmodel on the well prepared datasets.

During each learning iteration, the current hypothesis, i.e.,hnðxÞ ¼

Plþui ¼ 1 yn,ian,ikðxi,xÞþbn, which is trained on the current

training set, is also recorded. The included angle betweenhnðxÞ ¼ 0 and hn

ðxÞ ¼ 0, i.e., yn, could be calculated to investigatethe changing trend of the current SVM hyperplane. Usually, calcu-lating the included angle of two hyperplanes in feature spaceinvolves amounts of computations, and when the dataset becomesbig, this calculation will become very expensive. Thus, we onlyselect two very small datasets, i.e., ecoli and soybean, to conduct this

Page 13: Inconsistency-based active learning for support vector machines

0 10 20 30 4065

70

75

80

85

90

95

100

Number of Added Training Examples

Test

ing

Acc

urac

y

0 10 20 30 400

5

10

15

20

25

30

Number of Added Training Examples

Dev

iatio

n

I−ALSVMC−ALSVMRS−SVMM−ALSVMCo−SVMPool−QBC−SVMI−ALSVM−modified

Fig. 11. MNIST handwritten digits recognition task with 162 features. (a) Accuracy and (b) deviation.

0 10 20 30 40 50 60 70 80 90 10030

40

50

60

70

80

90

100

Number of Added Training Examples

Test

ing

Acc

urac

y

0 10 20 30 40 50 60 70 80 90 1000

5

10

15

20

25

30

35

40

45

50

Number of Added Training Examples

Dev

iatio

n

Fig. 12. MNIST handwritten digits recognition task with six features. (a) Accuracy and (b) deviation.

Table 4Learning results: MNIST handwritten digits recognition with 162 features.

Result I-ALSVM C-ALSVM RS-SVM SVM-active Co-SVM Pool-QBC I-modified

MeanAcc (%) 94.99 93.03 94.51 96.63 93.68 92.08 96.53

FinalAcc (%) 97.76 93.50 96.39 99.04 97.54 97.31 98.27

Time (s) 29.92 13.35 0.31 1.62 1.83 2.31 30.79

Table 5Learning results: MNIST handwritten digits recognition with six features.

Result I-ALSVM C-ALSVM RS-SVM SVM-active Co-SVM Pool-QBC I-modified

MeanAcc (%) 90.53 73.83 91.29 90.80 50.45 87.22 92.82FinalAcc (%) 91.82 89.93 93.36 92.83 53.66 90.07 94.84Time (s) 11.28 3.05 0.02 0.47 0.45 0.79 12.54

R. Wang et al. / Pattern Recognition 45 (2012) 3751–3767 3763

investigation. We observe from Fig. 9 that during the whole learning,the included angle between hnðxÞ ¼ 0 and hn

ðxÞ ¼ 0 becomes smallerand smaller, and if hnðxÞ ¼ 0 achieves a faster convergence tohnðxÞ ¼ 0, the learning will possibly achieve a better trend.

5.3. Experiments on handwritten digits data

We test our strategy on MNIST2 which is a medium-sizedataset. The task is to distinguish 0–9 handwritten digits asshown in Fig. 10. This dataset contains 60,000 training data and

2 http://yann.lecun.com/exdb/mnist.

10,000 testing data from approximately 250 writers. We considerdigit ‘‘6’’ as positive and the others as negative. The original blackand white handwritten digit images from NIST were normalizedto fit in a 20 � 20 pixel box while preserving their aspect ratio.Then, with an anti-aliasing technique and a normalization algo-rithm, the images were further centered in 28�28 gray levelimages. Thus the raw information of each example in MNIST iscomposed of 784 gray level pixels with each pixel valueAf0, . . . ,255g. We use the gradient-based method3 presented in

3 http://www.cs.berkeley.edu/smaji/projects/digits.

Page 14: Inconsistency-based active learning for support vector machines

R. Wang et al. / Pattern Recognition 45 (2012) 3751–37673764

[48,49] to extract the gradient histogram features, and finally a2172-dimensional feature vector for each example is extracted.

Since for an ordinary computer, realizing SVM-based activelearning on 60,000 examples with such a high dimensional fea-ture vector is always impossible, thus we conduct some featureselections on the 2172 features. All the features are divided into22 subsets with the first 21 subsets containing 100 featuresrespectively and the last subset containing 72 features. First,forward sequential feature selection is performed on each subset,and altogether, 162 features are selected. Then, a second-stepselection is performed on the 162 features, and six features arefurther retained.

In order to give an overall evaluation, we conduct two setsof experiments for this dataset with lower dimensional features(six features) and higher dimensional features (162 features)respectively. With the 162 features, the initial training set selectstwo positive examples and two negative examples by k-meansclustering technique, the learning stops when 40 new examples

0 10 20 30 4055

60

65

Number of Added Training Examples

Test

ing

Acc

urac

y

I−ALSVMCo−SVMPool−QBC−SVM

Fig. 13. Forest CoverType recognition task with

0 10 20 30 40 50 60 70 80 90 10035

40

45

50

55

60

65

70

Number of Added Training Examples

Test

ing

Acc

urac

y

Fig. 14. Forest CoverType recognition task with

Table 6Forest CoverType dataset information.

Class # Records

Spruce-fir 211,840

Lodgepole pine 283,301

Ponderosa pine 35,754

Cottonwood/willow 9493

Aspen 17,367

Krummholz 20,510

are added; with the six features, the initial training set includes10 positive examples and 10 negative examples, and the learningstops when 100 new examples are added. The SVM parameter C isset as 100, and for QBC, the second hypothesis sampling methodis adopted. The averaged performances over 10 runs are shown inFigs. 11 and 12 and Tables 4 and 5. Note that the ‘‘Time’’ inTables 4 and 5 represents the averaged time cost for selecting onenew example during the learning process.

We observe that with higher dimensional features, all thereferred methods can achieve quite similar results, and theperformance of SVM-active is slightly better. The reason may beexplained as that, with such a high dimensional feature vector, apretty small number of examples can already achieve an accep-table performance, thus the selection of informative examplesmay have no obvious influence on the learning. While with lowerdimensional features, some obvious differences appear. We seethat the proposed method can outperform the others. In fact,under this condition, the performances of Co-SVM and Pool-QBCare even worse than the random sampling method, and SVM-active exhibits a very unstable trend. However, although the

0 10 20 30 400

1

2

3

4

5

Number of Added Training Examples

Dev

iatio

n

I−ALSVMCo−SVMPool−QBC−SVM

54 features. (a) Accuracy and (b) deviation.

0 10 20 30 40 50 60 70 80 90 1000

2

4

6

8

10

12

14

16

18

Number of Added Training Examples

Dev

iatio

n

five features. (a) Accuracy and (b) deviation.

Table 7Corel datasets information.

Dataset # Positiveexample

# Negativeexample

# Features Trainset size

Testset size

Elephant 762 629 230 417 974

Fox 647 673 230 396 924

Tiger 544 676 230 366 854

Page 15: Inconsistency-based active learning for support vector machines

0 10 20 30 40 50 60 70 80 90 10050

55

60

65

70

75

80

Number of Added Training Examples

Test

ing

Acc

urac

y

0 10 20 30 40 50 60 70 80 90 1000

2

4

6

8

10

12

14

16

18

Number of Added Training Examples

Dev

iatio

n

0 10 20 30 40 50 60 70 80 90 10050

52

54

56

58

60

62

Number of Added Training Examples

Test

ing

Acc

urac

y

0 10 20 30 40 50 60 70 80 90 1000

1

2

3

4

5

6

Number of Added Training Examples

Dev

iatio

n

0 10 20 30 40 50 60 70 80 90 10050

55

60

65

70

75

Number of Added Training Examples

Test

ing

Acc

urac

y

0 10 20 30 40 50 60 70 80 90 1000

2

4

6

8

10

12

14

16

Number of Added Training Examples

Dev

iatio

n

Fig. 15. Content-based image retrieval tasks. (a) Elephant-accuracy, (b) elephant-deviation, (c) fox-accuracy, (d) fox-deviation, (e) tiger-accuracy, and (f) tiger-deviation.

R. Wang et al. / Pattern Recognition 45 (2012) 3751–3767 3765

proposed one shows the most effective result with this setting,the time cost is much more expensive.

5.4. Experiments on noisy data

When SVM margin is small, the behavior of the proposedstrategy is not clear. Thus we test our algorithm on a noisydataset: Forest CoverType.4 This dataset contains 581,012 exam-ples times 54 features, and each example is associated with one ofthe seven forest cover types as listed in Table 6 [50,51]. In ourexperiment, we randomly select 10% data for the learning where‘‘Spruce-Fir’’ is taken as positive and other types as negative. Forfurther reducing the computations, a forward sequential featureselection is performed and five features are retained. We alsoconduct two sets of experiments for this dataset with lowerdimensional features (five features) and higher dimensional

4 http://kdd.ics.uci.edu/databases/covertype/covertype.html.

features (54 features) respectively, and the experimental settingsare similar with the MNIST dataset.

As stated in [52], when using SVM to solve this problem, thenumber of support vectors is large, and running the algorithms onthis dataset will be very time consuming. Thus, for this task, weonly consider three strategies as I-ALSVM, Co-SVM, and QBC withthe second sampling method. Since for these three strategies, thelearning structures are pretty similar, and their learning processesare more comparable. By similar learning structure, we mean thatduring each iteration, two classifiers are trained, and a selectioncriterion is designed based on these two classifiers, then theexamples which are most disagreed with them are labeled. Theaveraged learning performances over 10 runs are shown inFigs. 13 and 14.

We see that for this task, the QBC algorithm with the secondhypothesis sampling method gives a very unsatisfied perfor-mance, where a decreasing trend is shown. And for Co-SVM, theimprovement is quite marginal. While I-AVSVM is the mosteffective strategy among the three.

Page 16: Inconsistency-based active learning for support vector machines

R. Wang et al. / Pattern Recognition 45 (2012) 3751–37673766

5.5. Simulation of image retrieval task

Three Corel multiple-instance image datasets,5 as Elephant,Fox, and Tiger, are used to simulate some content-based imageretrieval tasks. The goal is to distinguish these three kinds ofanimals from other background pictures. In order to guarantee anenough number of training examples, we treat each instance,rather than each bag, as an example. About 30% data are used fortraining and the rest 70% for testing. Details of these datasets arelisted in Table 7. Each task begins with randomly selecting onepositive example and one negative example as the initial trainingset. The learning strategy selects one new example each time andqueries the user about its label. The learning stops when thequery times reach to 100. Gaussian kernel is used with parameters¼ 32 and C¼100. I-ALSVM, Co-SVM, and QBC are conducted.

Fig. 15 shows the averaged performance over 50 runs. As whatcan be seen, the proposed I-ALSVM achieves a better performancethan the other two. The results exhibit good feasibility of theproposed strategy, which may motivate us to do some furtherresearches on this topic such as multiple-instance active learning[6] and multiple-view active learning [33,34].

6. Conclusion

In this paper, a new pool-based active learning strategy calledinconsistency-based active learning (I-AL), as well as a specificalgorithm called inconsistency-based active learning for SVM(I-ALSVM), which uses the inconsistency value of unlabeled exam-ple as the selection criterion, is proposed. I-AL could be regarded asa strategy which extends the learning philosophy of QBC. Instead ofconsidering several randomly generated hypotheses of the currentversion space, it finishes the learning process by generating twoextreme ones. These two extreme ones can be easily produced bytaking all the unlabeled examples as positive or negative, thus thestrategy could be implemented without any additional knowledge.Experimental results show the feasibility of the algorithm. Com-pared with several other pool-based strategies, it could possiblyachieve better generalization capability, but the time complexitymay be higher.

Acknowledgements

This work is partially supported by City University of HongKong Research Grant 9610025, and the National Natural ScienceFoundation of China under the Grant 71171080.

References

[1] D. Cohn, L. Atlas, R. Ladner, Improving generalization with active learning,Machine Learning 15 (2) (1994) 201–221.

[2] L. Atlas, D. Cohn, R. Ladner, M. El-Sharkawi, R. Marks II, Training connectionistnetworks with queries and selective sampling, Advances in Neural Informa-tion Processing Systems (NIPS) 2 (1989) 566–573.

[3] S. Tong, D. Koller, Support vector machine active learning with applications totext classification, Journal of Machine Learning Research 2 (2002) 45–66.

[4] J. Cheng, K. Wang, Active learning for image retrieval with Co-SVM, PatternRecognition 40 (1) (2007) 330–334.

[5] R. Liu, Y. Wang, T. Baba, D. Masumoto, S. Nagata, SVM-based active feedbackin image retrieval using clustering and unlabeled data, Pattern Recognition41 (8) (2008) 2645–2655.

[6] D. Zhang, F. Wang, Z. Shi, C. Zhang, Interactive localized content based imageretrieval with multiple-instance active learning, Pattern Recognition 43 (2)(2010) 478–484.

5 http://www.cs.columbia.edu/andrews/mil/datasets.html.

[7] G. Chen, T. Wang, L. Gong, P. Herrera, Multi-class support vector machineactive learning for music annotation, International Journal of InnovativeComputing, Information and Control 6 (2010) 921–930.

[8] M. Mandel, G. Poliner, D. Ellis, Support vector machine active learning formusic retrieval, Multimedia Systems 12 (1) (2006) 3–13.

[9] D. Angluin, Queries and concept learning, Machine Learning 2 (4) (1988)319–342.

[10] Y. Freund, H. Seung, E. Shamir, N. Tishby, Selective sampling using the queryby committee algorithm, Machine Learning 28 (2) (1997) 133–168.

[11] H. Seung, M. Opper, H. Sompolinsky, Query by committee, in: Proceedingsof the 5th Annual Workshop on Computational Learning Theory, 1992,pp. 287–294.

[12] X. Zhu, P. Zhang, X. Lin, Y. Shi, Active learning from stream data using optimalweight classifier ensemble, IEEE Transactions on Systems, Man, and Cyber-netics, Part B: Cybernetics 40 (6) (2010) 1607–1621.

[13] A. McCallum, K. Nigam, Employing EM in pool-based active learning for textclassification, in: Proceedings of the 15th International Conference onMachine Learning (ICML), 1998, pp. 350–358.

[14] M. Sugiyama, S. Nakajima, Pool-based active learning in approximate linearregression, Machine Learning 75 (3) (2009) 249–274.

[15] S. Wang, J. Wang, X. Gao, X. Wang, Pool-based active learning based onincremental decision tree, in: Proceedings of the 2010 International Conferenceon Machine Learning and Cybernetics (ICMLC), vol. 1, 2010, pp. 274–278.

[16] Y. Guo, D. Schuurmans, Discriminative batch mode active learning, Advancesin Neural Information Processing Systems (NIPS) 20 (2007) 593–600.

[17] S. Hoi, R. Jin, M. Lyu, Batch mode active learning with applications to textcategorization and image retrieval, IEEE Transactions on Knowledge and DataEngineering 21 (9) (2009) 1233–1248.

[18] C. Campbell, N. Cristianini, A. Smola, Query learning with large marginclassifiers, in: Proceedings of the 17th International Conference on MachineLearning (ICML), 2000, pp. 111–118.

[19] M. Lindenbaum, S. Markovitch, D. Rusakov, Selective sampling for nearestneighbor classifiers, Machine Learning 54 (2) (2004) 125–152.

[20] N. Roy, A. McCallum, Toward optimal active learning through samplingestimation of error reduction, in: Proceedings of the 18th InternationalConference on Machine Learning (ICML), 2001, pp. 441–448.

[21] V. Iyengar, C. Apte, T. Zhang, Active learning using adaptive resampling, in:Proceedings of the 6th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, 2000, pp. 91–98.

[22] D. Lewis, J. Catlett, Heterogeneous uncertainty sampling for supervisedlearning, in: Proceedings of the 11th International Conference on MachineLearning (ICML), 1994, pp. 148–156.

[23] M. Li, I. Sethi, Confidence-based active learning, IEEE Transactions on PatternAnalysis and Machine Intelligence 28 (8) (2006) 1251–1261.

[24] R. Wang, S. Kwong, Sample selection based on maximum entropy for supportvector machines, in: Proceedings of the 2010 International Conference onMachine Learning and Cybernetics (ICMLC), vol. 3, 2010, pp. 1390–1395.

[25] H. Nguyen, A. Smeulders, Active learning using pre-clustering, in: Proceed-ings of the 21st International Conference on Machine Learning (ICML), 2004.

[26] A. Tavares da Silva, A. Xavier Falc~ao, L. Pini Magalh~aes, Active learningparadigms for CBIR systems based on optimum-path forest classification,Pattern Recognition 44 (12) (2011) 2971–2978.

[27] J. Tang, Z. Zha, D. Tao, C. Tat-Seng, Semantic-gap oriented active learning formulti-label image annotation, IEEE Transactions on Image Processing 21 (4)(2012) 2354–2360.

[28] Z. Zha, M. Wang, Y. Zheng, Y. Yang, R. Hong, T. Chua, Interactive videoindexing with statistical active learning, IEEE Transactions on Multimedia 14(1) (2012) 17–27.

[29] W. Hu, W. Hu, N. Xie, S. Maybank, Unsupervised active learning based onhierarchical graph-theoretic clustering, IEEE Transactions on Systems, Man,and Cybernetics Part B: Cybernetics 39 (5) (2009) 1147–1161.

[30] A. Joshi, F. Porikli, N. Papanikolopoulos, Multi-class active learning for imageclassification, in: Proceedings of the 2009 IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2009, pp. 2372–2379.

[31] A. Joshi, F. Porikli, N. Papanikolopoulos, Multi-class batch-mode activelearning for image classification, in: Proceedings of the 2010 IEEE Interna-tional Conference on Robotics and Automation (ICRA), 2010, pp. 1873–1878.

[32] G. Qi, X. Hua, Y. Rui, J. Tang, H. Zhang, Two-dimensional multilabel activelearning with an efficient online adaptation model for image classification,IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (10)(2009) 1880–1897.

[33] Z. Wang, S. Chen, D. Gao, A novel multi-view learning developed from single-view patterns, Pattern Recognition 44 (10–11) (2011) 2395–2413.

[34] Q. Zhang, S. Sun, Multiple-view multiple-learner active learning, PatternRecognition 43 (9) (2010) 3113–3119.

[35] R. Gilad-Bachrach, A. Navot, N. Tishby, Query by committee made real, Advancesin Neural Information Processing Systems (NIPS) 18 (2005) 443–450.

[36] T. Michell, Machine Learning, McGraw Hill, 1997.[37] R. Wang, S. Kwong, Q. He, Active learning based on support vector machines,

in: Proceedings of the 2010 IEEE International Conference on Systems Manand Cybernetics (SMC), 2010, pp. 1312–1316.

[38] V. Vapnik, The Nature of Statistical Learning Theory, Springer Verlag, 2000.

[39] B. Scholkopf, A. Smola, Learning with Kernels, The MIT Press, 2002.[40] G. Schohn, D. Cohn, Less is more: active learning with support vector

machines, in: Proceedings of the 17th International Conference on MachineLearning (ICML), 2000, pp. 839–846.

Page 17: Inconsistency-based active learning for support vector machines

R. Wang et al. / Pattern Recognition 45 (2012) 3751–3767 3767

[41] S. Tong, Active Learning: Theory and Applications, Ph.D. Thesis, Citeseer,2001.

[42] S. Cheng, F. Shih, An improved incremental training algorithm for supportvector machines using active query, Pattern Recognition 40 (3) (2007)964–971.

[43] Z. Wang, S. Yan, C. Zhang, Active learning with adaptive regularization,Pattern Recognition 44 (10–11) (2011) 2375–2383.

[44] D. Gorisse, M. Cord, F. Precioso, SALSAS: Sub-linear active learning strategywith approximate k-NN search, Pattern Recognition 44 (10–11) (2011)2343–2357.

[45] T. Mitchell, Version Spaces: An Approach to Concept Learning, TechnicalReport, DTIC Document, 1978.

[46] D. Chen, Q. He, X. Wang, On linear separability of data sets in feature space,Neurocomputing 70 (13–15) (2007) 2441–2448.

[47] S. Ho, H. Wechsler, Query by transduction, IEEE Transactions on PatternAnalysis and Machine Intelligence 30 (9) (2008) 1557–1571.

[48] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied todocument recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324.

[49] S. Maji, J. Malik, Fast and Accurate Digit Classification, Technical Report UCB/EECS-2009-159, EECS Department, University of California, Berkeley.

[50] J. Blackard, D. Dean, et al., Comparative accuracies of artificial neuralnetworks and discriminant analysis in predicting forest cover types fromcartographic variables, Computers and Electronics in Agriculture 24 (3)(1999) 131–152.

[51] B. Meyer, Forest Cover Type Prediction.[52] M. Trebar, N. Steele, Application of distributed SVM architectures in classifying

forest data cover types, Computers and Electronics in Agriculture 63 (2) (2008)119–130.

Ran Wang received her Bachelor’s degree from School of Information Science & Technology, Beijing Forestry University, Beijing, China, in 2009. She is currently a Ph.D.candidate in the Department of Computer Science, City University of Hong Kong. Her research interests focus on machine learning and its related applications.

Sam Kwong received his B.Sc degree and M.A.Sc degree in electrical engineering from the State University of New York at Buffalo, USA and University of Waterloo, Canada,in 1983 and 1985 respectively. In 1996, he later obtained his Ph.D. from the University of Hagen, Germany. From 1985 to 1987, he was a diagnostic engineer with theControl Data Canada where he designed the diagnostic software to detect the manufacture faults of the VLSI chips in the Cyber 430 machine. He later joined the BellNorthern Research Canada as a Member of Scientific staff. In 1990, he joined the City University of Hong Kong as a lecturer in the Department of Electronic Engineering.He is currently a Professor in the department of computer Science. His research interests are in Pattern Recognition, Evolutionary Algorithms and Video Coding.

Degang Chen received the M.S. degree from Northeast Normal University, Changchun, Jilin, China, in 1994, and the Ph.D. degree from Harbin Institute of Technology,Harbin, China, in 2000. He has worked as a Postdoctoral Fellow with Xi’an Jiaotong University, Xi’an, China, from 2000 to 2002 and with Tsinghua University, Tsinghua,China, from 2002 to 2004. Since 2006, he has worked as a professor at North China Electric Power University, Beijing, China. His research interests include fuzzy group,fuzzy analysis, rough sets and SVM.