Active Imitation Learning with Noisy Guidance · Active Imitation Learning with Noisy Guidance...

Active Imitation Learning with Noisy Guidance

Kianté BrantleyUniversity of Maryland

[email protected]

Amr SharafUniversity of [email protected]

Hal Daumé IIIUniversity of Maryland

Microsoft [email protected]

AbstractImitation learning algorithms provide state-of-the-art results on many structured predictiontasks by learning near-optimal search policies.Such algorithms assume training-time accessto an expert that can provide the optimal ac-tion at any queried state; unfortunately, thenumber of such queries is often prohibitive,frequently rendering these approaches imprac-tical. To combat this query complexity, weconsider an active learning setting in whichthe learning algorithm has additional access toa much cheaper noisy heuristic that providesnoisy guidance. Our algorithm, LEAQI, learnsa difference classifier that predicts when theexpert is likely to disagree with the heuris-tic, and queries the expert only when neces-sary. We apply LEAQI to three sequence la-beling tasks, demonstrating significantly fewerqueries to the expert and comparable (or bet-ter) accuracies over a passive approach.

1 Introduction

Structured prediction methods learn models to mapinputs to complex outputs with internal dependen-cies, typically requiring a substantial amount ofexpert-labeled data. To minimize annotation cost,we focus on a setting in which an expert provideslabels for pieces of the input, rather than the com-plete input (e.g., labeling at the level of words, notsentences). A natural starting point for this is imita-tion learning-based “learning to search” approachesto structured prediction (Daumé et al., 2009; Rosset al., 2011; Bengio et al., 2015; Leblond et al.,2018). In imitation learning, training proceedsby incrementally producing structured outputs onpiece at a time and, at every step, asking the ex-pert “what would you do here?” and learning tomimic that choice. This interactive model comes ata substantial cost: the expert demonstrator must becontinuously available and must be able to answera potentially large number of queries.

We reduce this annotation cost by only askingan expert for labels that are truly needed; our al-gorithm, Learning to Query for Imitation (LEAQI,/"li:,tSi:/)1 achieves this by capitalizing on two fac-tors. First, as is typical in active learning (see §2),LEAQI only asks the expert for a label when it isuncertain. Second, LEAQI assumes access to anoisy heuristic labeling function (for instance, arule-based model, dictionary, or inexpert annota-tor) that can provide low-quality labels. LEAQIoperates by always asking this heuristic for a label,and only querying the expert when it thinks theexpert is likely to disagree with this label. It trains,simultaneously, a difference classifier (Zhang andChaudhuri, 2015) that predicts disagreements be-tween the expert and the heuristic (see Figure 1).

The challenge in learning the difference classifieris that it must learn based on one-sided feedback: ifit predicts that the expert is likely to agree with theheuristic, the expert is not queried and the classifiercannot learn that it was wrong. We address thisone-sided feedback problem using the Apple Tast-ing framework (Helmbold et al., 2000), in whicherrors (in predicting which apples are tasty) areonly observed when a query is made (an apple istasted). Learning in this way particularly importantin the general case where the heuristic is likely notjust to have high variance with respect to the expert,but is also statistically biased.

Experimentally (§4.5), we consider three struc-tured prediction settings, each using a different typeof heuristic feedback. We apply LEAQI to: Englishnamed entity recognition where the heuristic is arule-based recognizer using gazetteers (Khashabiet al., 2018); English scientific keyphrase extrac-tion, where the heuristic is an unsupervised method(Florescu and Caragea, 2017); and Greek part-of-speech tagging, where the heuristic is a small dictio-

1Code is available at: https://github.com/xkianteb/leaqi

arX

iv:2

005.

1280

1v1

[cs

.LG

] 2

6 M

ay 2

020

https://github.com/xkianteb/leaqi

AftercompletinghisPh.D.,EllisworkedatBellLabsfrom1969to1972onprobabilitytheory...x=

yh=

y= OOOOOPEROOORGORGOOOOOOO

OOPEROOOOOORGORGOOOOOOO

OOPEROOPEROOORGy1:9= s10π*(s10)=ORGπh(s10)=ORGy

disagree=False

Figure 1: A named entity recognition example (from the Wikipedia page for Clarence Ellis). x is the input sentenceand y is the (unobserved) ground truth. The predictor π operates left-to-right and, in this example, is currently atstate s10 to tag the 10th word; the state s10 (highlighted in purple) combines x with y1:9. The heuristic makes twoerrors at t = 4 and t = 6. The heuristic label at t = 10 is yh10 =ORG. Under Hamming loss, the cost at t = 10 isminimized for a = ORG, which is therefore the expert action (if it were queried). The label that would be providedfor s10 to the difference classifier is 0 because the two policies agree.

nary compiled from the training data (Zesch et al.,2008; Haghighi and Klein, 2006). In all three set-tings, the expert is a simulated human annotator.We train LEAQI on all three tasks using fixed BERT(Devlin et al., 2019) features, training only the fi-nal layer (because we are in the regime of smalllabeled data). The goal in all three settings is tominimize the number of words the expert annotatormust label. In all settings, we’re able to establishthe efficacy of LEAQI, showing that it can indeedprovide significant label savings over using the ex-pert alone and over several baselines and ablationsthat establish the importance of both the differenceclassifier and the Apple Tasting paradigm.

2 Background and Related Work

We review first the use of imitation learning forstructured prediction, then online active learning,and finally applications of active learning to struc-tured prediction and imitation learning problems.

2.1 Learning to Search

The learning to search approach to structured pre-diction casts the joint prediction problem of pro-ducing a complex output as a sequence of smallerclassification problems (Ratnaparkhi, 1996; Collinsand Roark, 2004; Daumé et al., 2009). For in-stance, in the named entity recognition examplefrom Figure 1, an input sentence x is labeled oneword at a time, left-to-right. At the depicted state(s10), the model has labeled the first nine words andmust next label the tenth word. Learning to searchapproaches assume access to an oracle policy π?,which provides the optimal label at every position.

In (interactive) imitation learning, we aim toimitate the behavior of the expert policy, π?, whichprovides the true labels. The learning to searchview allows us to cast structured prediction as a(degenerate) imitation learning task, where states

Algorithm 1 DAgger(Π, N, 〈βi〉Ni=0, π?)

1: initialize dataset D = {}2: initialize policy π1 to any policy in Π3: for i = 1 . . . N do4: . stochastic mixture policy5: Let πi = βiπ

? + (1− βi)πi6: Generate a T -step trajectory using πi7: Accumulate dataD ← D∪{(s, π?(s))} for

all s in those trajectories8: Train classifier πi+1 ∈ Π on D9: end for

10: return best (or random) πi

are (input, prefix) pairs, actions are operations onthe output, and the horizon T is the length of thesequence. States are denoted s ∈ S, actions aredenoted a ∈ [K], where [K] = {1, . . . ,K}, andthe policy class is denoted Π ⊆ [K]S . The goal inlearning is to find a policy π ∈ Π with small losson the distribution of states that it, itself, visits.

A popular imitation learning algorithm, DAg-ger (Ross et al., 2011), is summarized in Alg 1. Ineach iteration, DAgger executes a mixture policyand, at each visited state, queries the expert’s ac-tion. This produces a classification example, wherethe input is the state and the label is the expert’saction. At the end of each iteration, the learnedpolicy is updated by training it on the accumulationof all generated data so far. DAgger is effectivein practice and enjoys appealing theoretical prop-erties; for instance, if the number of iterations Nis O(T 2 log(1/δ)) then with probability at least1− δ, the generalization error of the learned policyis O(1/T ) (Ross et al., 2011, Theorem 4.2).

2.2 Active Learning

Active learning has been considered since at leastthe 1980s often under the name “selective sam-

https://en.wikipedia.org/wiki/Clarence_Ellis_(computer_scientist)

pling” (Rendell, 1986; Atlas et al., 1990). In ag-nostic online active learning for classification, alearner operates in rounds (e.g. Balcan et al., 2006;Beygelzimer et al., 2009, 2010). At each round,the learning algorithm is presented an examplex and must predict a label; the learner must de-cide whether to query the true label. An effectivemargin-based approach for online active learningis provided by Cesa-Bianchi et al. (2006) for linearmodels. Their algorithm defines a sampling proba-bility ρ = b/(b+ z), where z is the margin on thecurrent example, and b > 0 is a hyperparameterthat controls the aggressiveness of sampling. Withprobability ρ, the algorithm requests the label andperforms a perceptron-style update.

Our approach is inspired by Zhang and Chaud-huri’s (2015) setting, where two labelers are avail-able: a free weak labeler and an expensive stronglabeler. Their algorithm minimizes queries to thestrong labeler, by learning a difference classifierthat predicts, for each example, whether the weakand strong labelers are likely to disagree. Theiralgorithm trains this difference classifier using anexample-weighting strategy to ensure that its TypeII error is kept small, establishing statistical consis-tency, and bounding its sample complexity.

This type of learning from one-sided feed-back falls in the general framework of partial-monitoring games, a framework for sequential deci-sion making with imperfect feedback. Apple Tast-ing is a type of partial-monitoring game (Little-stone and Warmuth, 1989), where, at each round,a learner is presented with an example x and mustpredict a label y ∈ {−1,+1}. After this predic-tion, the true label is revealed only if the learnerpredicts +1. This framework has been applied inseveral settings, such as spam filtering and doc-ument classification with minority class distribu-tions (Sculley, 2007). Sculley (2007) also conductsa through comparison of two methods that can beused to address the one-side feedback problem:label-efficient online learning (Cesa-Bianchi et al.,2006) and margin-based learning (Vapnik, 1982).

2.3 Active Imitation & Structured Prediction

In the context of structured prediction for natu-ral language processing, active learning has beenconsidered both for requesting full structured out-puts (e.g. Thompson et al., 1999; Culotta and Mc-Callum, 2005; Hachey et al., 2005) and for re-questing only pieces of outputs (e.g. Ringger et al.,

2007; Bloodgood and Callison-Burch, 2010). Forsequence labeling tasks, Haertel et al. (2008) foundthat labeling effort depends both on the number ofwords labeled (which we model), plus a fixed costfor reading (which we do not).

In the context of imitation learning, active ap-proaches have also been considered for at leastthree decades, often called “learning with an exter-nal critic” and “learning by watching” (Whitehead,1991). More recently, Judah et al. (2012) describeRAIL, an active learning-for-imitation-learning al-gorithm akin to our ACTIVEDAGGER baseline, butwhich in principle would operate with any under-lying i.i.d. active learning algorithm (not just ourspecific choice of uncertainty sampling).

3 Our Approach: LEAQI

Our goal is to learn a structured prediction modelwith minimal human expert supervision, effec-tively by combining human annotation with a noisyheuristic. We present LEAQI to achieve this. Asa concrete example, return to Figure 1: at s10, πmust predict the label of the tenth word. If π isconfident in its own prediction, LEAQI can avoidany query, similar to traditional active learning. Ifπ is not confident, then LEAQI considers the labelsuggested by a noisy heuristic (here: ORG). LEAQIpredicts whether the true expert label is likely todisagree with the noisy heuristic. Here, it predictsno disagreement and avoids querying the expert.

3.1 Learning to Query for ImitationOur algorithm, LEAQI, is specified in Alg 2. Asinput, LEAQI takes a policy class Π, a hypothesisclassH for the difference classifier (assumed to besymmetric and to contain the “constant one” func-tion), a number of episodes N , an expert policy π?,a heuristic policy πh, and a confidence parameterb > 0. The general structure of LEAQI followsthat of DAgger, but with three key differences:

(a) roll-in (line 7) is according to the learned pol-icy (not mixed with the expert, as that wouldrequire additional expert queries),

(b) actions are queried only if the current policyis uncertain at s (line 12), and

(c) the expert π? is only queried if it is pre-dicted to disagree with the heuristic πh at sby the difference classifier, or if apple tastingmethod switches the difference classifier label(line 15; see §3.2).

Algorithm 2 LEAQI(Π,H, N, π?, πh, b)

1: initialize dataset D = {}2: initialize policy π1 to any policy in Π3: initialize difference dataset S = {}4: initialize difference classifier h1(s) = 1 (∀s)5: for i = 1 . . . N do6: Receive input sentence x7: . generate a T -step trajectory using πi8: Generate output y using πi9: for each s in y do

10: . draw bernouilli random variable11: Zi ∼ Bern

(b

b+certainty(πi,s)

); see §3.3

12: if Zi = 1 then13: . set difference classifier prediction14: di = hi(s)15: if AppleTaste(s, πh(s), di) then16: . predict agree query heuristic17: D ← D ∪

{ (s, πh(s)

) }18: else19: . predict disagree query expert20: D ← D ∪ { (s, π?(s)) }21: di = 1

[π?(s) = πh(s)

]22: S ← S ∪

{ (s, πh(s), di, di

) }23: end if24: end if25: end for26: Train policy πi+1 ∈ Π on D27: Train difference classifier hi+1 ∈ H on S to

minimize Type II errors (see §3.2)28: end for29: return best (or random) πi

In particular, at each state visited by πi, LEAQIestimates z, the certainty of πi’s prediction at thatstate (see §3.3). A sampling probability ρ is setto b/(b+ z) where z is the certainty, and so if themodel is very uncertain then ρ tends to zero, follow-ing (Cesa-Bianchi et al., 2006). With probability ρ,LEAQI will collect some label.

When a label is collected (line 12), the differenceclassifier hi is queried on state s to predict if π?

and πh are likely to disagree on the correct action.(Recall that h1 always predicts disagreement perline 4.) The difference classifier’s prediction, di, ispassed to an apple tasting method in line 15. In-tuitively, most apple tasting procedures (includingthe one we use, STAP; see §3.2) return di, unlessthe difference classifier is making many Type IIerrors, in which case it may return ¬di.

A target action is set to πh(s) if the apple tast-

Algorithm 3 AppleTaste_STAP(S, ahi , di)

1: . count examples that are action ahi

2: let t =∑

(_,a,_,_)∈S 1[ahi = a]

3: . count mistakes made on action ahi

4: let m =∑

(_,a,d,d)∈S 1[d 6= d ∧ ahi = a]

5: w = t|S| . percentage of time ah

i was seen6: if w < 1 then7: . skew distribution8: draw r ∼ Beta(1− w, 1)9: else

10: draw r ∼ Unifomm(0, 1)11: end if12: return (d = 1) ∧ (r ≤

√(m+ 1)/t)

ing algorithm returns “agree” (line 17), and theexpert π? is only queried if disagreement is pre-dicted (line 20). The state and target action (eitherheuristic or expert) are then added to the trainingdata. Finally, if the expert was queried, then a newitem is added to the difference dataset, consistingof the state, the heuristic action on that state, thedifference classifier’s prediction, and the groundtruth for the difference classifier whose input is sand whose label is whether the expert and heuristicactually disagree. Finally, πi+1 is trained on theaccumulated action data, and hi+1 is trained on thedifference dataset (details in §3.3).

There are several things to note about LEAQI:

� If the current policy is already very certain, aexpert annotator is never queried.

� If a label is queried, the expert is queried onlyif the difference classifier predicts disagree-ment with the heuristic, or the apple tastingprocedure flips the difference classifier predic-tion.

� Due to apple tasting, most errors the differ-ence classifier makes will cause it to query theexpert unnecessarily; this is the “safe” typeof error (increasing sample complexity butnot harming accuracy), versus a Type II error(which leads to biased labels).

� The difference classifier is only trained onstates where the policy is uncertain, which isexactly the distribution on which it is run.

3.2 Apple Tasting for One-Sided LearningThe difference classifier h ∈ H must be trained(line 27) based on one-sided feedback (it only ob-

serves errors when it predicts “disagree“) to min-imize Type II errors (it should only very rarelypredict “agree” when the truth is “disagree”). Thishelps keep the labeled data for the learned policiesunbiased. The main challenge here is that the feed-back to the difference classifier is one-sided: thatis, if it predicts “disagree” then it gets to see thetruth, but if it predicts “agree” it never finds outif it was wrong. We use one of (Helmbold et al.,2000)’s algorithms, STAP (see Alg 3), which worksby random sampling from apples that are predictedto not be tasted and tasting them anyway (line 12).Formally, STAP tastes apples that are predicted tobe bad with probability

√(m+ 1)/t, where m is

the number of mistakes, and t is the number ofapples tasted so far.

We adapt Apple Tasting algorithm STAP to oursetting for controlling the number of Type II errorsmade by the difference classifier as follows. First,because some heuristic actions are much more com-mon than others, we run a separate apple tastingscheme per heuristic action (in the sense that wecount the number of error on this heuristic actionrather than globally). Second, when there is signifi-cant action imbalance2 we find it necessary to skewthe distribution from STAP more in favor of query-ing. We achieve this by sampling from a Beta dis-tribution (generalizing the uniform), whose meanis shifted toward zero for more frequent heuristicactions. This increases the chance that Apple Tast-ing will have on finding bad apples error for eachaction (thereby keeping the false positive rate lowfor predicting disagreement).

3.3 Measuring Policy Certainty

In step 11, LEAQI must estimate the certainty ofπi on s. Following Cesa-Bianchi et al. (2006),we implement this using a margin-based criteria.To achieve this, we consider π as a function thatmaps actions to scores and then chooses the actionwith largest score. The certainty measure is thenthe difference in scores between the highest andsecond highest scoring actions:

certainty(π, s) = maxa

π(s, a)−maxa′ 6=a

π(s, a′)

2For instance, in named entity recognition, both the heuris-tic and expert policies label the majority of words as O (not anentity). As a result, when the heuristic says O, it is very likelythat the expert will agree. However, if we aim to optimize forsomething other than accuracy—like F1—it is precisely thesedisagreements that we need to find.

3.4 AnalysisTheoretically, the main result for LEAQI is an inter-pretation of the main DAgger result(s). Formally,let dπ denote the distribution of states visited by π,C(s, a) ∈ [0, 1] be the immediate cost of perform-ing action a in state s, Cπ(s) = Ea∼π(s)C(s, a),and the total expected cost of π to be J(π) =TEs∼dπCπ(s), where T is the length of trajecto-ries. C is not available to a learner in an imitationsetting; instead the algorithm observes an expertand minimizes a surrogate loss `(s, π) (e.g., ` maybe zero/one loss between π and π?). We assume `is strongly convex and bounded in [0, 1] over Π.

Given this setup assumptions, let εpol-approx =

minπ∈Π1N

∑Ni=1 Es∼dπi `(s, π) be the true loss

of the best policy in hindsight, let εdc-approx =

minh∈H1N

∑Ni=1 Es∼dπi err(s, h, π?(s) 6= πh(s))

be the true error of the best difference classifier inhindsight, and assuming that the regret of the pol-icy learner is bounded by regpol(N) after N steps,Ross et al. (2011) shows the following3:Theorem 1 (Thm 4.3 of Ross et al. (2011)). AfterN episodes each of length T , under the assump-tions above, with probability at least 1 − δ thereexists a policy π ∈ π1:N such that:

Es∼dπ`(s, π) ≤

εpol-approx + regpol(N) +√

(2/N) log(1/δ)

This holds regardless of how π1:N are trained(line 26). The question of how well LEAQI per-forms becomes a question of how well the combi-nation of uncertainty-based sampling and the dif-ference classifier learn. So long as those do a goodjob on their individual classification tasks, DAggerguarantees that the policy will do a good job. Thisis formalized below, whereQ?(s, a) is the best pos-sible cumulative cost (measured by C) starting instate s and taking action a:Theorem 2 (Theorem 2.2 of Ross et al. (2011)).Let u be such that Q?(s, a) − Q?(s, π?(s)) ≤ ufor all a and all s with dπ(s) > 0; then for someπ ∈ π1:N , as N →∞:

J(π) ≤ J(π?) + uTεpol-approx

Here, u captures the most long-term impact a singledecision can have; for example, for average Ham-ming loss, it is straightforward to see that u = 1

T

3Proving a stronger result is challenging: analyzing thesample complexity of an active learning algorithm that uses adifference classifier—even in the non-sequential setting—isquite involved (Zhang and Chaudhuri, 2015).

Task Named Entity Recognition Keyphrase Extraction Part of Speech Tagging

Language English (en) English (en) Modern Greek (el)Dataset CoNLL’03 (Tjong Kim Sang

and De Meulder, 2003)SemEval 2017 Task 10(Augenstein et al., 2017)

Universal Dependencies(Nivre, 2018)

# Ex 14, 987 2, 809 1, 662Avg. Len 14.5 26.3 25.5# Actions 5 2 17Metric Entity F-score Keyphrase F-score Per-tag accuracyFeatures English BERT (Devlin et al.,

2019)SciBERT (Beltagy et al.,2019)

M-BERT (Devlin et al.,2019)

Heuristic String matching against anoffline gazeteer of entitiesfrom Khashabi et al. (2018)

Output from anunsupervised keyphraseextraction model Florescuand Caragea (2017)

Dictionary fromWiktionary, similar toZesch et al. (2008) andHaghighi and Klein(2006)

Heur Quality P 88%, R 27%, F 41% P 20%, R 44%, F 27% 10% coverage, 67% acc

Table 1: An overview of the three tasks considered in experiments.

because any single mistake can increase the num-ber of mistakes by at most 1. For precision, recalland F-score, u can be as large as one in the (rare)case that a single decision switches from one truepositive to no true positives.

4 Experiments

The primary research questions we aim to answerexperimentally are:

Q1 Does uncertainty-based active learningachieve lower query complexity than passivelearning in the learning to search settings?

Q2 Does learning a difference classifier improvequery efficiency over active learning alone?

Q3 Does Apple Tasting successfully handle theproblem of learning from one-sided feedback?

Q4 Is the approach robust to cases where the noisyheuristic is uncorrelated with the expert?

Q5 Is casting the heuristic as a policy more effec-tive than using its output as features?

To answer these questions, we conduct experimentson three tasks (see Table 1): English named entityrecognition, English scientific keyphrase extraction,and low-resource part of speech tagging on ModernGreek (el), selected as a low-resource setting.

4.1 Algorithms and BaselinesIn order to address the research questions above, wecompare LEAQI to several baselines. The baselines

below compare our approach to previous methods:

DAGGER. Passive DAgger (Alg 1)

ACTIVEDAGGER. An active variant of DAggerthat asks for labels only when uncertain. (Thisis equivalent to LEAQI, but with neither thedifference classifier nor apple tasting.)

DAGGER+FEAT. DAGGER with the heuristicpolicy’s output appended as an input feature.

ACTIVEDAGGER+FEAT. ACTIVEDAGGER

with the heuristic policy as a feature.

The next set of comparisons are explicit ablations:

LEAQI+NOAT LEAQI with no apple tasting.

LEAQI+NOISYHEUR. LEAQI, but where theheuristic returns a label uniformly at random.

The baselines and LEAQI share a linear relation-ship. DAGGER is the baseline algorithm usedby all algorithms described above but it is veryquery inefficient with respect to an expert annota-tor. ACTIVEDAGGER introduces active learning tomake DAGGER more query efficient; the delta tothe previous addresses Q1. LEAQI+NOAT intro-duces the difference classifier; the delta addressesQ2. LEAQI adds apple tasting to deal with one-sided learning; the delta addresses Q3. Finally,LEAQI+NOISYHEUR. (vs LEAQI) addresses Q4and the +FEAT variants address Q5.

4.2 Data and RepresentationFor named entity recognition, we use training,validation, and test data from CoNLL’03 (TjongKim Sang and De Meulder, 2003), consisting of IOtags instead of BIO tags (the “B” tag is almost neverused in this dataset, so we never attempt to predictit) over four entity types: Person, Organization,Location, and Miscellaneous. For part of speechtagging, we use training and test data from modernGreek portion of the Universal Dependencies (UD)treebanks (Nivre, 2018), consisting of 17 universaltags4. For keyphrase extraction, we use training,validation, and test data from SemEval 2017 Task10 (Augenstein et al., 2017), consisting of IO tags(we use one “I” tag for all three keyphrase types).

In all tasks, we implement both the policy anddifference classifier by fine-tuning the last layer ofa BERT embedding representation (Devlin et al.,2019). More specifically, for a sentence of length T ,w1, . . . , wT , we first compute BERT embeddingsfor each word, x1, . . . ,xT using the appropriateBERT model: English BERT and M-BERT5 fornamed entity and part-of-speech, respectively, andSciBERT (Beltagy et al., 2019) for keyphrase ex-traction. We then represent the state at position tby concatenating the word embedding at that posi-tion with a one-hot representation of the previousaction: st = [wt; onehot(at−1)]. This feature rep-resentation is used both for learning the labelingpolicy and also learning the difference classifier.

4.3 Expert Policy and HeuristicsIn all experiments, the expert π? is a simulated hu-man annotator who annotates one word at a time.The expert returns the optimal action for the rele-vant evaluation metric (F-score for named entityrecognition and keyphrase extraction, and accuracyfor part-of-speech tagging). We take the annotationcost to be the total number of words labeled.

The heuristic we implement for named en-tity recognition is a high-precision gazeteer-basedstring matching approach. We construct this bytaking a gazeteer from Wikipedia using the Cog-Comp framework (Khashabi et al., 2018), and useFlashText (Singh, 2017) to label the dataset. Thisheuristic achieves a precision of 0.88, recall of 0.27and F-score of 0.41 on the training data.

The keyphrase extraction heuristic is the out-put of an “unsupervised keyphrase extraction” ap-

4ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART,PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X.

5Multilingual BERT (Devlin et al., 2019)

proach (Florescu and Caragea, 2017). This systemis a graph-based approach that constructs word-level graphs incorporating positions of all wordoccurrences information; then using PageRankto score the words and phrases. This heuristicachieves a precision of 0.20, recall of 0.44 andF-score of 0.27 on the training data.

The part of speech tagging heuristic is based ona small dictionary compiled from Wiktionary. Fol-lowing Haghighi and Klein (2006) and Zesch et al.(2008), we extract this dictionary using Wiktionaryas follows: for word w in our training data, we findthe part-of-speech y by querying Wiktionary. If wis in Wikitionary, we convert the Wikitionary partof speech tag to a Universal Dependencies tag (see§A.1), and if word w is not in Wiktionary, we usea default label of “X”. Furthermore, if word w hasmultiple parts of speech, we select the first part ofspeech tag in the list. The label “X” is chosen 90%of the time. For the remaining 10%, the heuristicachieves an accuracy of 0.67 on the training data.

4.4 Experimental SetupOur experimental setup is online active learning.We make a single pass over a dataset, and the goalis to achieve an accurate system as quickly as possi-ble. We measure performance (accuracy or F-score)after every 1000 words (≈ 50 sentences) on held-out test data, and produce error bars by averagingacross three runs and reporting standard deviations.

Hyperparameters for DAGGER are optimized us-ing grid-search on the named entity recognitiontraining data and evaluated on development data.We then fix DAGGER hyperparameters for all otherexperiments and models. The difference classifierhyperparameters are subsequently optimized in thesame manner. We fix the difference classifier hy-perparameters for all other experiments.6

4.5 Experimental ResultsThe main results are shown in the top two rows ofFigure 2; ablations of LEAQI are shown in Figure 3.In Figure 2, the top row shows traditional learningcurves (performance vs number of queries), andthe bottom row shows the number of queries madeto the expert as a function of the total number ofwords seen.

6We note that this is a somewhat optimistic hyperparametersetting: in the real world, model selection for active learningis extremely challenging. Details on hyperparameter selectionand LEAQI’s robustness across a rather wide range of choicesare presented in §A.2, §A.3 and §A.4 for keyphrase extractionand part of speech tagging.

0 50K 100K 150K 200Knumber of words queried

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7ph

rase

-labe

l f-sc

ore

Named Entity Recognition

LeaQIDAggerDAgger+Feat.ActiveDAggerActiveDAgger+Feat.

0 10K 20K 30K 40K 50Knumber of words queried

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

phra

se-la

bel f

-scor

e

Keyphrase Extraction


0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

accu

racy

Part of Speech Tagging

0 50K 100K 150K 200Knumber of words seen

0

25K

50K

75K

100K

125K

150K

175K

200K

num

ber o

f wor

ds q

uerie

d


0 10K 20K 30K 40K 50Knumber of words seen

0

10K

20K

30K

40K

50K

num

ber o

f wor

ds q

uerie

d



0

5K

10K

15K

20K

25K

num

ber o

f wor

ds q

uerie

d


Figure 2: Empirical evaluation on three tasks: (left) named entity recognition, (middle) keyphrase extraction and(right) part of speech tagging. The top rows shows performance (f-score or accuracy) with respect to the numberof queries to the expert. The bottom row shows the number of queries as a function of the number of words seen.

Active vs Passive (Q1). In all cases, we see thatthe active strategies improve on the passive strate-gies; this difference is largest in keyphrase extrac-tion, middling for part of speech tagging, and smallfor NER. While not surprising given previous suc-cesses of active learning, this confirms that it isalso a useful approach in our setting. As expected,the active algorithms query far less than the passiveapproaches, and LEAQI queries the least.

Heuristic as Features vs Policy (Q5). We seethat while adding the heuristic’s output as a featurecan be modestly useful, it is not uniformly usefuland, at least for keyphrase extraction and part ofspeech tagging, it is not as effective as LEAQI.For named entity recognition, it is not effectiveat all, but this is also a case where all algorithmsperform essentially the same. Indeed, here, LEAQIlearns quickly with few queries, but never quitereaches the performance of ActiveDAgger. Thisis likely due to the difference classifier becomingoverly confident too quickly, especially on the “O”label, given the (relatively well known) oddness inmismatch between development data and test dataon this dataset.

Difference Classifier Efficacy (Q2). Turning tothe ablations (Figure 3), we can address Q2

by comparing the ActiveDAgger curve to theLeaQI+NoAT curve. Here, we see that on NERand keyphrase extraction, adding the differenceclassifier without adding apple tasting results in afar worse model: it learns very quickly but plateausmuch lower than the best results. The exceptionis part of speech tagging, where apple tasting doesnot seem necessary (but also does not hurt). Over-all, this essentially shows that without controllingType II errors, the difference classifier on it’s owndoes not fulfill its goals.

Apple Tasting Efficacy (Q3). Also consideringthe ablation study, we can compare LeaQI+NoATwith LeaQI. In the case of part of speech tagging,there is little difference: using apple tasting tocombat issues of learning from one sided feed-back neither helps nor hurts performance. However,for both named entity recognition and keyphraseextraction, removing apple tasting leads to fasterlearning, but substantially lower final performance(accuracy or f-score). This is somewhat expected:without apple tasting, the training data that the pol-icy sees is likely to be highly biased, and so it getsstuck in a low accuracy regime.

Robustness to Poor Heuristic (Q4). We com-pare LeaQI+NoisyHeur to ActiveDAgger. Because

0 20K 40K 60Knumber of words queried

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7ph

rase

-labe

l f-sc

ore


LeaQILeaQI+NoisyHeur.LeaQI+NoATActiveDAgger


0.000.050.100.150.200.250.300.350.40

phra

se-la

bel f

-scor

e



0.0

0.2

0.4

0.6

0.8

accu

racy


Figure 3: Ablation results on (left) named entity recognition, (middle) keyphrase extraction and (right) part ofspeech tagging. In addition to LEAQI and DAgger (copied from Figure 2), these graphs also show LEAQI+NOAT(apple tasting disabled), and LEAQI+NOISYHEUR. (a heuristic that produces labels uniformly at random).

the heuristic here is useless, the main hope isthat it does not degrade performance below Ac-tiveDAgger. Indeed, that is what we see in all threecases: the difference classifier is able to learn quitequickly to essentially ignore the heuristic and onlyrely on the expert.

5 Discussion and Limitations

In this paper, we considered the problem of re-ducing the number of queries to an expert labelerfor structured prediction problems. We took animitation learning approach and developed an algo-rithm, LEAQI, which leverages a source that haslow-quality labels: a heuristic policy that is sub-optimal but free. To use this heuristic as a policy,we learn a difference classifier that effectively tellsLEAQI when it is safe to treat the heuristic’s actionas if it were optimal. We showed empirically—across Named Entity Recognition, Keyphrase Ex-traction and Part of Speech Tagging tasks—that theactive learning approach improves significantly onpassive learning, and that leveraging a differenceclassifier improves on that.

1. In some settings, learning a difference clas-sifier may be as hard or harder than learningthe structured predictor; for instance if thetask is binary sequence labeling (e.g., wordsegmentation), minimizing its usefulness.

2. The true labeling cost is likely more compli-cated than simply the number of individualactions queried to the expert.

Despite these limitations, we hope that LEAQIprovides a useful (and relatively simple) bridge thatcan enable using rule-based systems, heuristics,and unsupervised models as building blocks for

more complex supervised learning systems. Thisis particularly attractive in settings where we havevery strong rule-based systems, ones which oftenoutperform the best statistical systems, like corefer-ence resolution (Lee et al., 2011), information ex-traction (Riloff and Wiebe, 2003), and morphologi-cal segmentation and analysis (Smit et al., 2014).

Acknowledgements

We thank Rob Schapire, Chicheng Zhang, and theanonymous ACL reviewers for very helpful com-ments and insights. This material is based uponwork supported by the National Science Foun-dation under Grant No. 1618193 and an ACMSIGHPC/Intel Computational and Data ScienceFellowship to KB. Any opinions, findings, andconclusions or recommendations expressed in thismaterial are those of the author(s) and do not nec-essarily reflect the views of the National ScienceFoundation nor of the ACM.

ReferencesLes E Atlas, David A Cohn, and Richard E Ladner.

1990. Training connectionist networks with queriesand selective sampling. In NeurIPS.

Isabelle Augenstein, Mrinal Das, Sebastian Riedel,Lakshmi Vikraman, and Andrew McCallum. 2017.Semeval 2017 task 10: Scienceie - extractingkeyphrases and relations from scientific publications.In Proceedings of the 11th International Workshopon Semantic Evaluation (SemEval-2017).

Nina Balcan, Alina Beygelzimer, and John Langford.2006. Agnostic active learning. In ICML.

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. Scib-ert: Pretrained language model for scientific text. InEMNLP.

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, andNoam Shazeer. 2015. Scheduled sampling for se-quence prediction with recurrent neural networks.In NeurIPS.

Alina Beygelzimer, Sanjoy Dasgupta, , and John Lang-ford. 2009. Importance weighted active learning. InICML.

Alina Beygelzimer, Daniel Hsu, John Langford, andTong Zhang. 2010. Agnostic active learning withoutconstraints. In NeurIPS.

Michael Bloodgood and Chris Callison-Burch. 2010.Bucking the trend: Large-scale cost-focused activelearning for statistical machine translation. In ACL.

Nicolò Cesa-Bianchi, Claudio Gentile, and Luca Zani-boni. 2006. Worst-case analysis ofselective sam-pling for linear classification. JMLR.

Michael Collins and Brian Roark. 2004. Incrementalparsing with the perceptron algorithm. In ACL.

Aron Culotta and Andrew McCallum. 2005. Reduc-ing labeling effort for structured prediction tasks. InAAAI.

Hal Daumé, III, John Langford, and Daniel Marcu.2009. Search-based structured prediction. MachineLearning Journal.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In NAACL.

Corina Florescu and Cornelia Caragea. 2017. Position-Rank: An unsupervised approach to keyphrase ex-traction from scholarly documents. In ACL.

Ben Hachey, Beatrice Alex, and Markus Becker. 2005.Investigating the effects of selective sampling on theannotation task. In CoNLL.

Robbie Haertel, Eric K. Ringger, Kevin D. Seppi,James L. Carroll, and Peter McClanahan. 2008. As-sessing the costs of sampling methods in active learn-ing for annotation. In ACL.

Aria Haghighi and Dan Klein. 2006. Prototype-drivenlearning for sequence models.

David P. Helmbold, Nicholas Littlestone, and Philip M.Long. 2000. Apple tasting. Information and Com-putation.

Kshitij Judah, Alan Paul Fern, and Thomas Glenn Diet-terich. 2012. Active imitation learning via reductionto iid active learning. In AAAI.

Daniel Khashabi, Mark Sammons, Ben Zhou, TomRedman, Christos Christodoulopoulos, Vivek Sriku-mar, Nicholas Rizzolo, Lev Ratinov, Guanheng Luo,Quang Do, Chen-Tse Tsai, Subhro Roy, StephenMayhew, Zhili Feng, John Wieting, Xiaodong Yu,Yangqiu Song, Shashank Gupta, Shyam Upadhyay,

Naveen Arivazhagan, Qiang Ning, Shaoshi Ling,and Dan Roth. 2018. CogCompNLP: Your swissarmy knife for NLP. In LREC.

Rémi Leblond, Jean-Baptiste Alayrac, Anton Osokin,and Simon Lacoste-Julien. 2018. SEARNN: Train-ing RNNs with global-local losses. In ICLR.

Heeyoung Lee, Yves Peirsman, Angel Chang,Nathanael Chambers, Mihai Surdeanu, and DanJurafsky. 2011. Stanford’s multi-pass sieve coref-erence resolution system at the conll-2011 sharedtask. In Proceedings of the Fifteenth Conference onComputational Natural Language Learning: SharedTask.

N. Littlestone and M. K. Warmuth. 1989. Theweighted majority algorithm. In Proceedings of the30th Annual Symposium on Foundations of Com-puter Science.

Joakim et. al Nivre. 2018. Universal dependencies v2.5.LINDAT/CLARIN digital library at the Institute ofFormal and Applied Linguistics, Charles University.

Adwait Ratnaparkhi. 1996. A maximum entropymodel for part-of-speech tagging. In EMNLP.

Larry Rendell. 1986. A general framework for induc-tion and a study of selective induction. MachineLearning Journal.

Ellen Riloff and Janyce Wiebe. 2003. Learning extrac-tion patterns for subjective expressions. In EMNLP.

Eric Ringger, Peter McClanahan, Robbie Haertel,George Busby, Marc Carmen, James Carroll, KevinSeppi, and Deryle Lonsdale. 2007. Active learningfor part-of-speech tagging: Accelerating corpus an-notation. In Proceedings of the Linguistic Annota-tion Workshop.

Stéphane Ross, Geoff J. Gordon, and J. Andrew Bag-nell. 2011. A reduction of imitation learning andstructured prediction to no-regret online learning. InAI-Stats.

David Sculley. 2007. Practical learning from one-sidedfeedback. In KDD.

Vikash Singh. 2017. Replace or retrieve keywords indocuments at scale. CoRR, abs/1711.00046.

Peter Smit, Sami Virpioja, Stig-Arne Grönroos, andMikko Kurimo. 2014. Morfessor 2.0: Toolkit forstatistical morphological segmentation. In EACL.

Cynthia A. Thompson, Mary Elaine Califf, and Ray-mond J. Mooney. 1999. Active learning for natu-ral language parsing and information extraction. InICML.

Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the CoNLL-2003 shared task:Language-independent named entity recognition. InNAACL/HLT.

Vladimir Vapnik. 1982. Estimation of DependencesBased on Empirical Data: Springer Series in Statis-tics (Springer Series in Statistics). Springer-Verlag,Berlin, Heidelberg.

Steven Whitehead. 1991. A study of cooperative mech-anisms for faster reinforcement learning. Technicalreport, University of Rochester.

Torsten Zesch, Christof Müller, and Iryna Gurevych.2008. Extracting lexical semantic knowledge fromWikipedia and Wiktionary. In LREC.

Chicheng Zhang and Kamalika Chaudhuri. 2015. Ac-tive learning from weak and strong labelers. InNeurIPS.

Supplementary Material For:Active Imitation Learing with Noisy Guidance

A Experimental Details:

A.1 Wiktionary to Universal Dependencies

POS Tag Source Greek, Modern (el) Wiktionary Universal Dependencies

adjective ADJadposition ADPpreposition ADPadverb ADVauxiliary AUcoordinating conjunction CCONJdeterminer DETinterjection INTJnoun NOUNnumeral NUMparticle PARTpronoun PRONproper noun pROPNpunctuation PUNCTsubordinating conjunction SCONJsymbol SYMverb VERBother Xarticle DETconjunction PART

Table 2: Conversion between Greek, Modern (el) Wiktionary POS tags and Universal Dependencies POS tags.

A.2 HyperparametersHere we provide a table of all of hyperparameters we considered for LEAQI and baselines models. (seesection 4.4)

Table 3: Hyperparameters

Hyperparameter Values Considered Final ValuePolicy Learning rate 10−3, 10−4, 10−5, 10−6, 5.5 · 10−6, 10−6 10−6

Difference Classifier Learning rate h 10−1, 10−2, 10−3, 10−4 10−2

Confidence parameter (b) 5.0 · 10−1, 10 · 10−1, 15 · 10−1 5.0 · 10−1

A.3 Ablation Study Difference Classifier Learning Rate


0.4

0.5

0.6

0.7

0.8

differ

ence

clas

sifier

f-sc

ore


LeaQI - h learning-rate:1e-2LeaQI - h learning-rate:1e-3LeaQI - h learning-rate:1e-4


0

2.5K

5K

7.5K

10K

12.5K

15K

17.5K

num

ber o

f wor

ds q

uerie

d



0.000.050.100.150.200.250.300.350.40

phra

se-la

bel f

-scor

e



0.5

0.6

0.7

0.8

0.9

diffe

renc

e clas

sifier

f-sc

ore


0 10K 20Knumber of words seen

0

2K

4K

6K

8K

10K

12K

14K

numb

er of

wor

ds qu

eried



0.3

0.4

0.5

0.6

0.7

0.8

0.9

accu

racy


Figure 4: (top-row) English keyphrase extraction and (bottom-row) low-resource language part of speech taggingon Greek, Modern (el). We show the performance of using different learning for the difference classifier h. Theseplots indicate that their is small difference in performance depending on the difference classifier learning rate.

A.4 Ablation Study Confidence Parameter: b


0.000.050.100.150.200.250.300.350.40

phra

se-la

bel f

-scor

e


LeaQI - b: 5e-1LeaQI - b: 10e-1LeaQI - b: 15e-1


0

5K

10K

15K

20K

25K

num

ber o

f wor

ds q

uerie

d



0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

accu

racy



0

2.5K

5K

7.5K

10K

12.5K

15K

17.5K

num

ber o

f wor

ds q

uerie

d


Figure 5: (top-row) English keyphrase extraction and (bottom-row) low-resource language part of speech taggingon Greek, Modern (el). We show the performance of using difference confidence parameters b. These plots indicatethat our model is robust to difference confidence parameters.

Active Imitation Learning with Noisy Guidance · Active Imitation Learning with Noisy Guidance...

Documents

Transcript of Active Imitation Learning with Noisy Guidance · Active Imitation Learning with Noisy Guidance...