Learning How to Actively Learn: A Deep Imitation Learning ...Learning How to Actively Learn: A Deep...

Learning How to Actively Learn: A Deep Imitation Learning Approach

Ming Liu Wray BuntineFaculty of Information Technology, Monash University

{ming.m.liu, wray.buntine, gholamreza.haffari} @ monash.edu

Gholamreza Haffari

Abstract

Heuristic-based active learning (AL)methods are limited when the datadistribution of the underlying learningproblems vary. We introduce a methodthat learns an AL policy using imitationlearning (IL). Our IL-based approachmakes use of an efficient and effectivealgorithmic expert, which provides thepolicy learner with good actions in theencountered AL situations. The AL strat-egy is then learned with a feedforwardnetwork, mapping situations to mostinformative query datapoints. We evaluateour method on two different tasks: textclassification and named entity recogni-tion. Experimental results show that ourIL-based AL strategy is more effectivethan strong previous methods usingheuristics and reinforcement learning.

1 Introduction

For many real-world NLP tasks, labeled data israre while unlabelled data is abundant. Activelearning (AL) seeks to learn an accurate modelwith minimum amount of annotation cost. It isinspired by the observation that a model can getbetter performance if it is allowed to choose thedata points on which it is trained. For example, thelearner can identify the areas of the space where itdoes not have enough knowledge, and query thosedata points which bridge its knowledge gap.

Traditionally, AL is performed using engi-neered heuristics in order to estimate the useful-ness of unlabeled data points as queries to an an-notator. Recent work (Fang et al., 2017; Bachmanet al., 2017; Woodward and Finn, 2017) have fo-cused on learning the AL querying strategy, as en-gineered heuristics are not flexible to exploit char-

acteristics inherent to a given problem. The basicidea is to cast AL as a decision process, where themost informative unlabeled data point needs to beselected based on the history of previous queries.However, previous works train for the AL pol-icy by a reinforcement learning (RL) formulation,where the rewards are provided at the end of se-quences of queries. This makes learning the ALpolicy difficult, as the policy learner needs to dealwith the credit assignment problem. Intuitively,the learner needs to observe many pairs of querysequences and the resulting end-rewards to be ableto associate single queries with their utility scores.

In this work, we formulate learning AL strate-gies as an imitation learning problem. In par-ticular, we consider the popular pool-based ALscenario, where an AL agent is presented with apool of unlabelled data. Inspired by the DatasetAggregation (DAGGER) algorithm (Ross et al.,2011), we develop an effective AL policy learn-ing method by designing an efficient and effectivealgorithmic expert, which provides the AL agentwith good decisions in the encountered states. Wethen use a deep feedforward network to learn theAL policy to associate states to actions. Unlikethe RL approach, our method can get observa-tions and actions directly from the expert’s trajec-tory. Therefore, our trained policy can make betterrankings of unlabelled datapoints in the pool, lead-ing to more effective AL strategies.

We evaluate our method on text classificationand named entity recognition. The results showour method performs better than strong AL meth-ods using heuristics and reinforcement learning,in that it boosts the performance of the under-lying model with fewer labelling queries. Anopen source implementation of our model is avail-able at: https://github.com/Grayming/ALIL.

https://github.com/Grayming/ALIL

https://github.com/Grayming/ALIL

2 Pool-based AL as a Decision Process

We consider the popular pool-based AL settingwhere we are given a small set of initial labeleddata and a large pool of unlabelled data, and a bud-get for getting the annotation of some unlabelleddata by querying an oracle, e.g. a human anno-tator. The goal is to intelligently pick those unla-belled data for which if the annotations were avail-able, the performance of the underlying re-trainedmodel would be improved the most.

The main challenge in AL is how to identify andselect the most beneficial unlabelled data points.Various heuristics have been proposed to guide theunlabelled data selection (Settles, 2010). How-ever, there is no one AL heuristic which performsbest for all problems. The goal of this paper is toprovide an approach to learn an AL strategy whichis best suited for the problem at hand, instead ofresorting to ad-hoc heuristics.

The AL strategy can be learned by attemptingto actively learn on tasks sampled from a distribu-tion over the tasks (Bachman et al., 2017). Theidea is to simulate the AL scenario on instances ofthe problem created using available labeled data,where the label of some part of the data is kepthidden. This allows to have an automatic oracleto reveal the labels of the queried data, resultingin an efficient way to quickly evaluate a hypothe-sised AL strategy. Once the AL strategy is learnedon simulations, it is then applied to real AL sce-narios. The more related are the tasks in the realscenario to those used to train the AL strategy, themore effective the AL strategy would be.

We are interested to train a model mφφφ whichmaps an inputxxx ∈ X to its label yyy ∈ Yxxx, where Yxxxis the set of labels for the input xxx and φφφ is the pa-rameter vector of the underling model. For exam-ple, in the named entity recognition (NER) task,the input is a sentence and the output is its label se-quence, e.g. in the IBO format. Let D = {(xxx,yyy)}be a support set of labeled data, which is randomlypartitioned into labeledDlab, unlabelledDunl, andevaluation Devl datasets. Repeated random parti-tioning creates multiple instances of the AL prob-lem. At each time step t of an AL problem, thealgorithm interacts with the oracle and queries thelabel of a datapoint xxxt ∈ Dunl

t . As the result ofthis action, the followings happen:

• The automatic oracle reveals the label yyyt;

• The labeled and unlabelled datasets are up-

dated to include and exclude the recentlyqueried data point, respectively;

• The underlying model is re-trained based onthe enlarged labeled data to update φφφ; and

• The AL algorithm receives a reward−loss(mφφφ, D

evl), which is the negativeloss of the current trained model on theevaluation set, defined as

loss(mφφφ, Devl) :=

∑(xxx,yyy)∈Devl

loss(mφφφ(xxx), yyy)

where loss(yyy′, yyy) is the loss incurred due topredicting yyy′ instead of the ground truth yyy.

More formally, a pool-based AL problem isa Markov decision process (MDP), denoted by(S,A, Pr(ssst+1|ssst, at), R) where S is the statespace, A is the set of actions, Pr(ssst+1|ssst, at) isthe transition function, and R is the reward func-tion. The state ssst ∈ S at time t consists of thelabeled Dlab

t and unlabelled Dunlt datasets paired

with the parameters of the currently trained modelφt. An action at ∈ A corresponds to the selec-tion of a query datapoint, and the reward functionR(ssst, at, ssst+1) := −loss(mφφφt , D

evl).We aim to find the optimal AL policy prescrib-

ing which datapoint needs to be queried in a givenstate to get the most benefit. The optimal policy isfound by maximising the following objective overthe parameterised policies:

E(Dlab,Dunl,Devl)∼D

[Eπθθθ

[ B∑t=1

R(ssst, at, ssst+1)]]

(1)

where πθθθ is the policy network parameterised byθθθ,D is a distribution over possible AL problem in-stances, and B is the maximum number of queriesmade in an AL run, a.k.a. an episode. Following(Bachman et al., 2017), we maximise the sum ofthe rewards after each time step to encourage theanytime behaviour, i.e. the model should performwell after each label query.

3 Deep Imitation Learning to Train theAL Policy

The question remains as how can we train thepolicy network to maximise the training objec-tive in eqn 1. Typical learning approaches resortto deep reinforcement learning (RL) and providetraining signal at the end of each episode to learnthe optimal policy (Fang et al., 2017; Bachman

et al., 2017) e.g., using policy gradient methods.These approaches, however, need a large numberof training episodes to learn a reasonable policy asthey need to deal with the credit assignment prob-lem, i.e. discovery of the utility of individual ac-tions in the sequence based on the achieved rewardat the end of the episode. This exacerbates the dif-ficulty of finding a good AL policy.

We formulate learning for the AL policy as animitation learning problem. At each state, we pro-vide the AL agent with a correct action which iscomputed by an algorithmic expert. The AL agentuses the sequence of states observed in an episodepaired with the expert’s sequence of actions to up-date its policy. This directly addresses the creditassignment problem, and reduces the complexityof the problem compared to the RL approaches.In what follows, we describe the ingredients ofour deep imitation learning (IL) approach, whichis summarised in Algorithm 1.

Algorithmic Expert. At a given AL state ssst, ouralgorithmic expert computes an action by evaluat-ing the current pool of unlabeled data. More con-cretely, for each xxx′ ∈ Dpool

rnd and its correct labelyyy′, the underlying model mφφφt is re-trained to getmxxx′

φφφt, where Dpool

rnd ⊂ Dunlt is a small subset of the

current large pool of unlabeled data. The expertaction is then computed as:

arg minxxx′∈Dpoolrnd

loss(mxxx′

φφφt(xxx), Devl). (2)

In other words, our algorithmic expert tries a sub-set of actions to roll-out one step from the currentstate, in order to efficiently compute a reasonableaction. Searching for the optimal action would beO(|Dunl|B), which is computationally challeng-ing due to (i) the large action set, and (ii) the ex-ponential dependence on the length of the roll out.We will see in the experiments that our method ef-ficiently learns effective AL policies.

Policy Network. Our policy network is a feed-forward network with two fully-connected hiddenlayers. It receives the current AL state, and pro-vides a preference score for a given unlabeled datapoint, allowing to select the most beneficial onecorresponding to the highest score. The inputto our policy network consists of three parts: (i)a fixed dimensional representation of the contentand the predicted label of the unlabeled data pointunder consideration, (ii) a fixed-dimensional rep-

resentation of the content and the labels of the la-beled dataset, and (iii) a fixed-dimensional repre-sentation of the content of the unlabeled dataset.

score

Dpoolrand

Dlab

xxx

yyy

Figure 1: The policy network and its inputs.

Imitation Learning Algorithm. A typical ap-proach to imitation learning (IL) is to train thepolicy network so that it mimics the expert’s be-haviour given training data of the encounteredstates (input) and actions (output) performed bythe expert. The policy network’s prediction af-fects future inputs during the execution of the pol-icy. This violates the crucial independent andidentically distributed (iid) assumption, inherentto most statistical supervised learning approachesfor learning a mapping from states to actions.

We make use of Dataset Aggregation(DAGGER) (Ross et al., 2011), an iterativealgorithm for IL which addresses the non-iidnature of the encountered states during the ALprocess (see Algorithm 1). In round τ of DAG-GER, the learned policy network π̂τ is applied tothe AL problem to collect a sequence of stateswhich are paired with the expert actions. Thecollected pair of states and actions are aggregatedto the dataset of such pairs M , collected from theprevious iterations of the algorithm. The policynetwork is then re-trained on the aggregated set,resulting in π̂τ+1 for the next iteration of the algo-rithm. The intuition is to build up the set of statesthat the algorithm is likely to encounter during itsexecution, in order to increase the generalizationof the policy network. To better leverage thetraining signal from the algorithmic expert, weallow the algorithm to collect state-action pairsaccording to a modified policy which is a mixtureof π̂τ and the expert policy π̃∗τ , i.e.

πτ = βτ π̃∗ + (1− βτ )π̂τ

where βτ ∈ [0, 1] is a mixing coefficient. Thisamounts to tossing a coin with parameter βτ in

each iteration of the algorithm to decide one ofthese two policies for data collection.

Re-training the Policy Network. To train ourpolicy network, we turn the preference scores toprobabilities, and optimise the parameters suchthat the probability of the action prescribed bythe expert is maximized. More specifically, letM := {(sssi, aaai)}Ii=1 be the collected states pairedwith their expert’s prescribed actions. LetDpool

i bethe set of unlabelled datapoints in the pool withinthe state, and aaai denote the datapoint selected bythe expert in the set. Our training objective is∑I

i=1 logPr(aaai|Dpooli ) where

Pr(aaai|Dpooli ) :=

exp π̂(aaai;sssi)∑xxx∈Dpooli

exp π̂(xxx;sssi).

The above can be interpreted as the probability ofaaai being the best action among all possible actionsin the state. Following (Mnih et al., 2015), we ran-domly sample multiple1 mini-batches from the re-play memoryM, in addition to the current round’sstat-action pair, in order to retrain the policy net-work. For each mini-batch, we make one SGDstep to update the policy, where the gradients ofthe network parameters are calculated using thebackpropagation algorithm.

Transferring the Policy. We now apply the pol-icy learned on the source task to AL in the tar-get task. We expect the learned policy to be effec-tive for target tasks which are related to the sourcetask in terms of the data distribution and charac-teristics. Algorithm 2 illustrates the policy trans-fer. The pool-based AL scenario in Algorithm 2 iscold-start; however, extending to incorporate ini-tially available labeled data is straightforward.

4 Experiments

We conduct experiments on text classification andnamed entity recognition (NER). The AL sce-narios include cross-domain sentiment classifica-tion, cross-lingual authorship profiling, and cross-lingual named entity recognition (NER), wherebyan AL policy trained on a source domain/languageis transferred to the target domain/language.

We compare our proposed AL method using im-itation learning (ALIL) with the followings:

• Random sampling: The query datapoint is cho-sen randomly.

1In our experiments, we use 10 mini-bathes, each ofwhich of size 100.

Algorithm 1 Learn active learning policy via imi-tation learningInput: large labeled data D, max episodes T , budget B,

sample size K, the coin parameter βOutput: The learned policy1: M ← ∅ . the aggregated dataset2: initialise π̂1 with a random policy3: for τ=1, . . . , T do4: Dlab, Dunl, Devl ← dataPartition(D)5: φφφ1 ← trainModel(Dlab)6: c← coinToss(β)7: for t ∈ 1, . . . ,B do8: Dpool

rnd ← sampleUniform(Dunl,K)

9: ssst ← (Dlab, Dpoolrnd ,φφφt)

10: aaat ← argminxxx′∈Dpool

rndloss(mxxx′

φφφt, Devl)

11: if c is head then . the expert12: xxxt ← aaat13: else . the policy14: xxxt ← argmax

xxx′∈Dpoolrnd

π̂τ (xxx′;ssst)

15: end if16: Dlab ← Dlab + {(xxxt, yyyt)}17: Dunl ← Dunl − {xxxt}18: M ←M + {(ssst, aaat)}19: φφφt+1 ← retrainModel(φφφt, Dlab)20: end for21: π̂τ+1 ← retrainPolicy(π̂τ ,M)22: end for23: return π̂T+1

Algorithm 2 Active learning by policy transfer

Input: unlabeled pool Dunl, budget B, policy π̂Output: labeled dataset and trained model1: Dlab ← ∅2: initialise φφφ randomly3: for t ∈ 1, . . . ,B do4: ssst ← (Dlab, Dunl,φφφ)5: xxxt ← argmaxxxx′∈Dunl π̂(xxx

′;ssst)6: yyyt ← askAnnotation(xxxt)7: Dlab ← Dlab + {(xxxt, yyyt)}8: Dunl ← Dunl − {xxxt}9: φ← retrainModel(φφφ,Dlab)

10: end for11: return Dlab and φφφ

• Diversity sampling: The query datapoint isargminxxx

∑xxx′∈Dlab Jaccard(xxx,xxx′), where the Jac-

card coefficient between the unigram featuresof the two given texts is used as the similaritymeasure.

• Uncertainty-based sampling: For textclassification, we use the datapointwith the highest predictive entropy,argmaxxxx−

∑y p(y|xxx,D

lab) log p(y|xxx,Dlab) wherep(yyy|xxx,Dlab) comes from the underlyingmodel. We further use a state-of-the-art exten-sion of this method, called uncertainty withrationals (Sharma et al., 2015), which not onlyconsiders uncertainty but also looks whether

doc. (src/tgt)src tgt number avg. len. (tokens)

elec. music dev. 27k/1k 35/20book movie 24k/2k 140/150

en sp 3.6k/4.2k 1.15k/1.35ken pt 3.6k/1.2k 1.15k/1.03k

Table 1: The data sets used in sentiment classifica-tion (top part) and gender profiling (bottom part).

the unlabelled document contains sentimentwords or phrases that were returned as ratio-nales for any of the existing labeled documents.For NER, we use the Total Token Entropy(TTE) as the uncertainty sampling method,argmaxxxx−

∑|xxx|i=1

∑yip(yi|xxx,Dlab) log p(yi|xxx,Dlab)

which has been shown to be the best heuristicfor this task among 17 different heuristics(Settles and Craven, 2008).

• PAL: A reinforcement learning based approach(Fang et al., 2017), which makes use a deepQ-network to make the selection decision forstream-based active learning.

4.1 Text Classification

Datasets and Setup. The first task is senti-ment classification, in which product reviews ex-press either positive or negative sentiment. Thedata comes from the Amazon product reviews(McAuley and Yang, 2016); see Table 1 for datastatistics.

The second task is Authorship Profiling, inwhich we aim to predict the gender of the textauthor. The data comes from the gender profil-ing task in PAN 2017 (Rangel et al., 2017), whichconsists of a large Twitter corpus in multiple lan-guages: English (en), Spanish (es) and Portuguese(pt). For each language, all tweets collected from auser constitute one document; Table 1 shows datastatistics. The multilingual embeddings for thistask come from off-the-shelf CCA-trained embed-dings (Ammar et al., 2016) for twelve languages,including English, Spanish and Portuguese. Wefix these word embeddings during training of boththe policy and the underlying classification model.

For training, 10% of the source data is used asthe evaluation set for computing the best action inimitation learning. We run T = 100 episodes withthe budget B = 100 documents in each episode,set the sample size K = 5, and fix the mixingcoefficient βτ = 0.5. For testing, we take 90%of the target data as the unlabeled pool, and the

remaining 10% as the test set. We show the testaccuracy w.r.t. the number of labelled documentsselected in the AL process.

As the underlying model mφφφ, we use a fast andefficient text classifier based on convolutional neu-ral networks. More specifically, we apply 50 con-volutional filters with ReLU activation on the em-bedding of all words in a document xxx, where thewidth of the filters is 3. The filter outputs are aver-aged to produce a 50-dimensional document rep-resentation hhh(xxx), which is then fed into a softmaxto predict the class.

Representing state-action. The input to thepolicy network, i.e. the feature vector represent-ing a state-action pair, includes: the candidate doc-ument represented by the convolutional net hhh(xxx),the distribution over the document’s class labelsmφφφ(xxx), the sum of all document vector represen-tations in the labeled set

∑xxx′∈Dlab hhh(xxx

′), the sumof all document vectors in the random pool of un-labelled data

∑xxx′∈Dpoolrnd

hhh(xxx′), and the empiricaldistribution of class labels in the labeled dataset.

Results. Fig 2 shows the results on productsentiment prediction and authorship profiling, incross-domain and cross-lingual AL scenarios2.Our ALIL method consistently outperforms bothheuristic-based and RL-based (PAL) (Fang et al.,2017) approaches across all tasks. ALIL tends toconvergence faster than other methods, which in-dicates its policy can quickly select the most infor-mative datapoints. Interestingly, the uncertaintyand diversity sampling heuristics perform worsethan random sampling on sentiment classification.We speculate this may be due to these two heuris-tics not being able to capture the polarity informa-tion during the data selection process. PAL per-forms on-par with uncertainty with rationals onmusical device, both of which outperform the tra-ditional diversity and uncertainty sampling heuris-tics. Interestingly, PAL is outperformed by ran-dom sampling on movie reviews, and by the tra-ditional uncertainty sampling heuristic on author-ship profiling tasks. We attribute this to ineffec-tiveness of the RL-based approach for learning areasonable AL query strategy.

We further investigate combining the transfer ofthe policy network with the transfer of the under-lying classifier. That is, we first train a classi-

2Uncertainty with rationale cannot be done for authorshipprofiling as the rationales come from a sentiment dictionary.

Figure 2: The performance of different active learning methods for cross domain sentiment classification(left two plots) and cross lingual authorship profiling (right two plots).

fier on all of the annotated data from the sourcedomain/language. Then, this classifier is portedto the target domain/language; for cross-languagetransfer, we make use of multilingual word em-beddings. We start the AL process starting fromthe transferred classifier, referred to as the warm-start AL. We compare the performance of the di-rectly transferred classifier with those obtained af-ter the AL process in the warm-start and cold-startscenarios. The results are shown in Table 2. Wehave run the cold-start and warm-start AL for 25times, and reported the average accuracy in Ta-ble 2. As seen from the results, both the cold andwarm start AL settings outperform the direct trans-fer significantly, and the warm start consistentlygets higher accuracy than the cold start. The dif-ference between the results are statistically signif-icant, with a p-value of .001, according to McNe-mar test3 (Dietterich, 1998).

musical movie es ptdirect transfer 0.715 0.640 0.675 0.740cold-start AL 0.800 0.760 0.728 0.773warm-start AL 0.825 0.765 0.730 0.780

Table 2: Classifiers performance under three dif-ferent transfer settings.

4.2 Named Entity RecognitionData and setup We use NER corpora from theCONLL2002/2003 shared tasks, which includeannotated text in English (en), German (de), Span-ish (es), and Dutch (nl). The original annotationis based on IOB1, which we convert to the IO

3As the contingency table needed for the McNemar test,we have used the average counts across the 25 runs.

labelling scheme. Following Fang et al. (2017),we consider two experimental conditions: (i) thebilingual scenario where English is the source(used for policy training) and other languages arethe target, and (ii) the multilingual scenario whereone of the languages (except English) is the targetand the remaining ones are the source used in jointtraining of the AL policy. The underlying modelmφφφ is a conditional random field (CRF) treatingNER as a sequence labelling task. The predictionis made using the Viterbi algorithm.

In the existing corpus partitions from CoNLL,each language has three subsets: train, testa andtestb. During policy training with the source lan-guage(s), we combine these three subsets, shuf-fle, and re-split them into simulated training, unla-belled pool, and evaluation sets in every episode.We run N = 100 episodes with the budget B =200, and set the sample size k = 5. When wetransfer the policy to the target language, we doone episode and select B datapoints from train(treated as the pool of unlabeled data) and reportF1 scores on testa.

Representing state-action. The input to thepolicy network includes the representation of thecandidate sentence using the sum of its words’embeddings hhh(xxx), the representation of the la-belling marginals using the label-level convolu-tional network cnnlab(Emφφφ(yyy|xxx)[yyy]) (Fang et al.,2017), the representation of sentences in the la-beled data

∑(xxx′,yyy′)∈Dlab hhh(xxx

′), the representa-tion of sentences in the random pool of un-labelled data

∑xxx′∈Dpoolrnd

hhh(xxx′), the representa-tion of ground-truth labels in the labeled data∑

(xxx′,yyy′)∈Dlab cnnlab(yyy′) using the empirical distri-

butions, and the confidence of the sequential pre-

Figure 3: The performance of active learning methods on the bilingualand multilingual settings for three target languages: German (de), Span-ish (es) and Dutch (nl).

Figure 4: The learning curvesof agents with different K onSpanish (es) NER.

diction |xxx|√

maxyyymφφφ(yyy|xxx), where |xxx| denotes thelength of the sentence xxx. For the word embed-dings, we use off-the-shelf CCA trained multilin-gual embeddings (Ammar et al., 2016) with 40 di-mensions; we fix these during policy training.

Results. Fig. 3 shows the results for three tar-get languages. In addition to the strong heuristic-based methods, we compare our imitation learn-ing approach (ALIL) with the reinforcement learn-ing approach (PAL) (Fang et al., 2017), in bothbilingual (bi) and multilingual (mul) transfer set-tings. Across all three languages, ALIL.bi andALIL.mul outperform the heuristic methods, in-cluding Uncertainty Sampling based on TTE. Thisis expected as the uncertainty sampling largely re-lies on a high quality underlying model, and di-versity sampling ignores the labelling information.In the bilingual case, ALIL.bi outperforms PAL.bion Spanish (es) and Dutch (nl), and performs sim-ilarly on German (de). In the multilingual case,ALIL.mul achieves the best performance on Span-ish, and performs competitively with PAL.mul onGerman and Dutch.

4.3 Analysis

Insight on the selected data. We compare thedata selected by ALIL to other methods. This willconfirm that ALIL learns policies which are suit-able for the problem at hand, without resorting toa fixed engineered heuristics. For this analysis, wereport the mean reciprocal rank (MRR) of the datapoints selected by the ALIL policy under rank-ings of the unlabelled pool generated by the un-certainty and diversity sampling. Furthermore, wemeasure the fraction of times the decisions madeby the ALIL policy agrees with those which would

movie sentiment gender pt NER esacc Unc. 0.06 0.58 0.51

MRR Unc. 0.083 0.674 0.551acc Div. 0.05 0.52 0.45

MRR Div. 0.057 0.593 0.530acc PAL 0.15 0.56 0.52

Table 3: The first four rows show MRR and accu-racy of instances returned by ALIL under the rank-ings of uncertainty and diversity sampling, the lastrow give average accuracy of instances under PAL.

have been made by the heuristic methods, whichis measured by the accuracy (acc). Table 3 re-port these measures. As we can see, for sentimentclassification since uncertainty and diversity sam-pling perform badly, ALIL has a big disagreementwith them on the selected data points. While forgender classification on Portuguese and NER onSpanish, ALIL shows much more agreement withother three heuristics.

Lastly, we compare chosen queries by ALILto those by PAL, to investigate the extent of theagreement between these two methods. This issimply measure by the fraction of identical querydata points among the total number of queries (i.e.accuracy). Since PAL is stream-based and sen-sitive to the order in which it receives the datapoints, we report the average accuracy taken overmultiple runs with random input streams. The ex-pected accuracy numbers are reported in Table 3.As seen, ALIL has higher overlap with PAL thanthe heuristic-based methods, in terms of the se-lected queries.

Sensitivity toK. As seen in Algorithm 1, we re-sort to an approximate algorithmic expert, whichselects the best action in a random subset of the

Figure 5: The learning curves of agents with dif-ferent schedules for β before the trajectory onSpanish (es) NER.

pool of unlabelled data with size K, in order tomake the policy training efficient. Note that, inpolicy training, settingK to one and the size of theunlabelled data pool correspond to stream-basedand pool-based AL scenarios, respectively. BychangingK to values between these two extremes,we can analyse the effect of the quality of the al-gorithmic expert on the trained policy; Figure 4shows the results. A larger candidate set may cor-respond to a better learned policy, needed to betraded off with the training time growing linearlywith K. Interestingly, even small candidate setslead to strong AL policies as increasing K beyond10 does not change the performance significantly.

Dynamically changing β. In our algorithm, βplays an important role as it trades off explorationversus exploitation. In the above experiments, wefix it to 0.5; however, we can change its valuethroughout trajectory collection as a function ofτ (see Algorithm 1). We investigate scheduleswhich tend to put more emphasis on explorationand exploitation towards the beginning and end ofdata collection, respectively. We investigate thefollowing schedules: (i) linear βτ = max(0.5, 1−0.01τ), (ii) exponential βτ = 0.9τ , and (iii) andinverse sigmoid βτ = 5

5+exp(τ/5) , as a function ofiterations. Fig. 5 shows the comparisons of theseschedules. The learned policy seems to performcompetitively with either a fixed or an exponentialschedule. We have also investigated tossing thecoin in each step within the trajectory roll out, butfound that it is more effective to have it before thefull trajectory roll out (as currently done in Algo-rithm 1).

5 Related Work

Traditional active learning algorithms rely onvarious heuristics (Settles, 2010), such as un-certainty sampling (Settles and Craven, 2008;Houlsby et al., 2011), query-by-committee (Gilad-Bachrach et al., 2006), and diversity sampling(Brinker, 2003; Joshi et al., 2009; Yang et al.,2015). Apart from these, different heuristicscan be combined, thus creating integrated strat-egy which consider one or more heuristics at thesame time. Combined with transfer learning,pre-existing labeled data from related tasks canhelp improve the performance of an active learner(Xiao and Guo, 2013; Kale and Liu, 2013; Huangand Chen, 2016; Konyushkova et al., 2017). Morerecently, deep reinforcement learning is used asthe framework for learning active learning algo-rithms, where the active learning cycle is consid-ered as a decision process. (Woodward and Finn,2017) extended one shot learning to active learn-ing and combined reinforcement learning with adeep recurrent model to make labeling decisions.(Bachman et al., 2017) introduced a policy gradi-ent based method which jointly learns data repre-sentation, selection heuristic as well as the modelprediction function. (Fang et al., 2017) designedan active learning algorithm based on a deep Q-network, in which the action corresponds to bi-nary annotation decisions applied to a stream ofdata. The learned policy can then be transferredbetween languages or domains.

Imitation learning (IL) refers to an agent’s ac-quisition of skills or behaviours by observing anexpert’s trajectory in a given task. It helps re-duce sequential prediction tasks into supervisedlearning by employing a (near) optimal oracle attraining time. Several IL algorithms has beenproposed in sequential prediction tasks, includingSEARA (Daumé et al., 2009), AggreVaTe (Rossand Bagnell, 2014), DaD (Venkatraman et al.,2015), LOLS(Chang et al., 2015), DeeplyAggre-VaTe (Sun et al., 2017). Our work is closely re-lated to Dagger (Ross et al., 2011), which canguarantee to find a good policy by addressing thedependency nature of encountered states in a tra-jectory.

6 Conclusion

In this paper, we have proposed a new method forlearning active learning algorithms using deep im-itation learning. We formalize pool-based active

learning as a Markov decision process, in whichactive learning corresponds to the selection de-cision of the most informative data points fromthe pool. Our efficient algorithmic expert pro-vides state-action pairs from which effective activelearning policies can be learned. We show that thealgorithmic expert allows direct policy learning,while at the same time, the learned policies trans-fer successfully between domains and languages,demonstrating improvement over previous heuris-tic and reinforcement learning approaches.

Acknowledgments

We would like to thank the feedback from anony-mous reviewers. GH is grateful to MohammadGhavamzadeh and Trevor Cohn for interesting dis-cussions. This work was supported by computa-tional resources from the Multi-modal AustralianScienceS Imaging and Visualisation Environment(MASSIVE) at Monash University, and partly byan NVIDIA GPU grant.

ReferencesWaleed Ammar, George Mulcaire, Yulia Tsvetkov,

Guillaume Lample, Chris Dyer, and Noah A Smith.2016. Massively multilingual word embeddings.arXiv preprint arXiv:1602.01925.

Philip Bachman, Alessandro Sordoni, and AdamTrischler. 2017. Learning algorithms for activelearning. In Proceedings of the 34th InternationalConference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages301–310, International Convention Centre, Sydney,Australia. PMLR.

Klaus Brinker. 2003. Incorporating diversity in activelearning with support vector machines. In Proceed-ings of the 20th International Conference on Ma-chine Learning (ICML-03), pages 59–66.

Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agar-wal, Hal Daume, and John Langford. 2015. Learn-ing to search better than your teacher. In Proceed-ings of the 32nd International Conference on Ma-chine Learning (ICML-15), pages 2058–2066.

Hal Daumé, John Langford, and Daniel Marcu. 2009.Search-based structured prediction. Machine learn-ing, 75(3):297–325.

Thomas G. Dietterich. 1998. Approximate statisticaltests for comparing supervised classification learn-ing algorithms. Neural Comput., 10(7):1895–1923.

Meng Fang, Yuan Li, and Trevor Cohn. 2017. Learninghow to active learn: A deep reinforcement learningapproach. arXiv preprint arXiv:1708.02383.

Ran Gilad-Bachrach, Amir Navot, and Naftali Tishby.2006. Query by committee made real. In Ad-vances in neural information processing systems,pages 443–450.

Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani,and Máté Lengyel. 2011. Bayesian active learn-ing for classification and preference learning. arXivpreprint arXiv:1112.5745.

Sheng-Jun Huang and Songcan Chen. 2016. Transferlearning with active queries from source domain. InIJCAI, pages 1592–1598.

Ajay J Joshi, Fatih Porikli, and Nikolaos Pa-panikolopoulos. 2009. Multi-class active learningfor image classification. In Computer Vision andPattern Recognition, 2009. CVPR 2009. IEEE Con-ference on, pages 2372–2379. IEEE.

David Kale and Yan Liu. 2013. Accelerating activelearning with transfer learning. In Data Mining(ICDM), 2013 IEEE 13th International Conferenceon, pages 1085–1090. IEEE.

Ksenia Konyushkova, Raphael Sznitman, and PascalFua. 2017. Learning active learning from data. InI. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems30, pages 4225–4235. Curran Associates, Inc.

Julian McAuley and Alex Yang. 2016. Addressingcomplex and subjective product-related queries withcustomer reviews. In Proceedings of the 25th In-ternational Conference on World Wide Web, pages625–635. International World Wide Web Confer-ences Steering Committee.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver,Andrei A. Rusu, Joel Veness, Marc G. Bellemare,Alex Graves, Martin Riedmiller, Andreas K. Fidje-land, Georg Ostrovski, Stig Petersen, Charles Beat-tie, Amir Sadik, Ioannis Antonoglou, Helen King,Dharshan Kumaran, Daan Wierstra, Shane Legg,and Demis Hassabis. 2015. Human-level con-trol through deep reinforcement learning. Nature,518(7540):529–533.

Francisco Rangel, Paolo Rosso, Martin Potthast, andBenno Stein. 2017. Overview of the 5th author pro-filing task at PAN 2017: Gender and language vari-ety identification in twitter. Working Notes Papersof the CLEF.

Stephane Ross and J Andrew Bagnell. 2014. Rein-forcement and imitation learning via interactive no-regret learning. arXiv preprint arXiv:1406.5979.

Stéphane Ross, Geoffrey J Gordon, and Drew Bagnell.2011. A reduction of imitation learning and struc-tured prediction to no-regret online learning. In In-ternational Conference on Artificial Intelligence andStatistics, pages 627–635.

https://doi.org/10.1162/089976698300017197

https://doi.org/10.1162/089976698300017197

https://doi.org/10.1162/089976698300017197

http://papers.nips.cc/paper/7010-learning-active-learning-from-data.pdf

Burr Settles. 2010. Active learning literature survey.University of Wisconsin, Madison, 52(55-66):11.

Burr Settles and Mark Craven. 2008. An analysis of ac-tive learning strategies for sequence labeling tasks.In Proceedings of the conference on empirical meth-ods in natural language processing, pages 1070–1079. Association for Computational Linguistics.

Manali Sharma, Di Zhuang, and Mustafa Bilgic. 2015.Active learning with rationales for text classifica-tion. In Proceedings of the 2015 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, pages 441–451.

Wen Sun, Arun Venkatraman, Geoffrey J Gor-don, Byron Boots, and J Andrew Bagnell. 2017.Deeply aggrevated: Differentiable imitation learn-ing for sequential prediction. arXiv preprintarXiv:1703.01030.

Arun Venkatraman, Martial Hebert, and J AndrewBagnell. 2015. Improving multi-step prediction oflearned time series models. In AAAI, pages 3024–3030.

Mark Woodward and Chelsea Finn. 2017. Active one-shot learning. arXiv preprint arXiv:1702.06559.

Min Xiao and Yuhong Guo. 2013. Online active learn-ing for cost sensitive domain adaptation. In Pro-ceedings of the Seventeenth Conference on Compu-tational Natural Language Learning, pages 1–9.

Yi Yang, Zhigang Ma, Feiping Nie, Xiaojun Chang,and Alexander G Hauptmann. 2015. Multi-class ac-tive learning by uncertainty sampling with diversitymaximization. International Journal of ComputerVision, 113(2):113–127.

Learning How to Actively Learn: A Deep Imitation Learning ...Learning How to Actively Learn: A Deep...

Documents

Transcript of Learning How to Actively Learn: A Deep Imitation Learning ...Learning How to Actively Learn: A Deep...