The Power of Localization for Efﬁciently Learning Linear …pa336/static/pub/active... · 2015....

The Power of Localization

for Efficiently Learning Linear Separators with Noise

Pranjal Awasthi

[email protected]

Maria Florina Balcan

[email protected]

Philip M. Long

[email protected]

January 3, 2014

Abstract

We introduce a new approach for designing computationally efficient learning algorithms that are

tolerant to noise, one of the most fundamental problems in learning theory. We demonstrate the effec-

tiveness of our approach by designing algorithms with improved noise tolerance guarantees for learning

linear separators, the most widely studied and used concept class in machine learning.

We consider two of the most challenging noise models studied in learning theory, the malicious

noise model of Valiant [Val85, KL88] and the adversarial label noise model of Kearns, Schapire, and

Sellie [KSS94]. For malicious noise, where the adversary can corrupt an η of fraction both the label part

and the feature part, we provide a polynomial-time algorithm for learning linear separators in ℜd under

the uniform distribution with near information-theoretic optimal noise tolerance of η = Ω(ǫ). This im-

proves significantly over previously best known results of [KKMS05, KLS09]. For the adversarial label

noise model, where the distribution over the feature vectors is unchanged, and the overall probability

of a noisy label is constrained to be at most η, we give a polynomial-time algorithm for learning linear

separators in ℜd under the uniform distribution that can handle a noise rate of η = Ω(ǫ). This improves

significantly over the results of [KKMS05] which either required runtime super-exponential in 1/ǫ (ours

is polynomial in 1/ǫ) or tolerated less noise.

In the case that the distribution is isotropic log-concave, we present a polynomial-time algorithm

for the malicious noise model that tolerates Ω(

ǫlog2(1/ǫ)

)

noise, and a polynomial-time algorithm for

the adversarial label noise model that also handles Ω(

ǫlog2(1/ǫ)

)

noise. Both of these also improve on

results from [KLS09]. In particular, in the case of malicious noise, unlike previous results, our noise

tolerance has no dependence on the dimension d of the space.

A particularly nice feature of our algorithms is that they can naturally exploit the power of active

learning, a widely studied modern learning paradigm, where the learning algorithm can only receive the

classifications of examples when they ask for them. We show that in this model, our algorithms achieve

a label complexity whose dependence on the error parameter ǫ is exponentially better than that of any

passive algorithm. This provides the first polynomial-time active learning algorithm for learning linear

separators in the presence of adversarial label noise, as well as the first analysis of active learning under

the challenging malicious noise model.

Our algorithms and analysis combine several ingredients including aggressive localization, hinge

loss minimization, and a novel localized and soft outlier removal procedure. Our work illustrates an un-

expected use of localization techniques (previously used for obtaining better sample complexity results)

in order to obtain better noise-tolerant polynomial-time algorithms.

1 Introduction

Overview. Dealing with noisy data is one of the main challenges in machine learning and is a highly

active area of research. In this work we study the noisy learnability of linear separators, arguably the

most popular class of functions used in practice [CST00]. Linear separators are at the heart of methods

ranging from support vector machines (SVMs) to logistic regression to deep networks, and their learnability

has been the subject of intense study for over 50 years. Learning linear separators from correctly labeled

(non-noisy) examples is a very well understood problem with simple efficient algorithms like Perceptron

being effective both in the classic passive learning setting [KV94, Vap98] and in the more modern active

learning framework [Das11]. However, for noisy settings, except for the special case of uniform random

noise, very few positive algorithmic results exist even for passive learning. In the context of theoretical

computer science more broadly, problems of noisy learning are related to seminal results in approximation-

hardness [ABSS93, GR06], cryptographic assumptions [BFKL94, Reg05], and are connected to other classic

questions in learning theory (e.g., learning DNF formulas [KSS94]), and appear as barriers in differential

privacy [GHRU11]. Hence, not surprisingly, designing efficient algorithms for learning linear separators in

the presence of adversarial noise (see definitions below) is of great importance.

In this paper we present new techniques for designing efficient algorithms for learning linear separators

in the presence of malicious and adversarial noise. These are two of the most challenging noise models

studied in learning theory. The models were originally proposed for a setting in which the algorithm must

work for an arbitrarily, unknown distribution. As we will see, bounds on the amount of noise tolerated for

this setting, however were very weak, and no significant progress was made for many years. This gave

rise to the question of the role that the distribution played in determining the limits of noise tolerance.

A breakthrough result of [KKMS05] and subsequent work of [KLS09] showed that indeed better bounds

on the level of noise tolerance can be obtained for the uniform and more generally isotropic log-concave

distributions. In this paper, we significantly improve these results. For the malicious noise case, where

the adversary can corrupt both the label part and the feature part of the observation (and it has unbounded

computational power and access to the entire history of the learning algorithm’s computation), we design an

efficient algorithm that can tolerate near-optimal amount of malicious noise (within constant factor of the

statistical limit) for the uniform distribution, and also significantally improves over the previously known

results for log-concave distribution. In particular, unlike previous works, our noise tolerance limit has no

dependence on the dimension d of the space. We also show similar improvements for adversarial label noise,

and furthermore show that our algorithms can naturally exploit the power of active learning. Active learning

is a widely studied modern learning paradigm, where the learning algorithm only receives the classifications

of examples when it asks for them. We show that in this model, our algorithms achieve a label complexity

whose dependence on the error parameter ǫ is exponentially better than that of any passive algorithm. This

provides the first polynomial-time active learning algorithm for learning linear separators in the presence of

adversarial label noise, solving an open problem posed in [BBL06, Mon06]. It also provides as well as the

first analysis showing the benefits of active learning over passive learning under the challenging malicious

noise model.

Overall, our work illustrates an unexpected use of localization techniques (previously used for obtaining

better sample complexity results) in order to obtain better noise-tolerant polynomial-time algorithms. Our

work brings a new set of algorithmic and analysis techniques including localization and soft outlier removal,

that we believe will have other applications in learning theory and optimization more broadly.

In the following we start by formally defining the learning models we consider, we then present most

relevant prior work, and then our main results and techniques.

1

Passive and Active Learning. Noise Models. In this work we consider the problem of learning linear

separators in two important learning paradigms: the classic passive learning setting and the more modern

active learning scenario. As typical [KV94, Vap98], we assume that there exists a distribution D over ℜd

and a fixed unknown target function w∗. In the noise-free settings, in the passive supervised learning model

the algorithm is given access to a distribution oracle EX(D,w∗) from which it can get training samples

(x, sign(w∗ · x)) where x ∼ D. The goal of the algorithm is to output a hypothesis w such that errD(w) =Prx∼D[sign(w

∗ ·x) 6= sign(w ·x)] ≤ ǫ. In the active learning model [CAL94, Das11] the learning algorithm

is given as input a pool of unlabeled examples drawn from the distribution oracle. The algorithm can then

query for the labels of examples of its choice from the pool. The goal is to produce a hypothesis of low error

while also optimizing for the number of label queries (also known as label complexity). The hope is that in

the active learning setting we can output a classifier of small error by using many fewer label requests than

in the passive learning setting by actively directing the queries to informative examples (while keeping the

number of unlabeled examples polynomial).

In this work we focus on two important and realistic noise models. The first one is the malicious noise

model of [Val85, KL88] where samples are generated as follows: with probability (1 − η) a random pair

(x, y) is output where x ∼ D and y = sign(w∗ · x); with probability η the adversary can output an arbitrary

pair (x, y) ∈ ℜd × −1, 1. We will call η the noise rate. Each of the adversary’s examples can depend

on the state of the learning algorithm and also the previous draws of the adversary. We will denote the

malicious oracle as EXη(D,w∗). The goal remains however that of achieving arbitrarily good predictive

approximation to the underlying target function with respect to the underlying distribution, that is to output

a hypothesis w such that Prx∼D[sign(w∗ · x) 6= sign(w · x)] ≤ ǫ.

In this paper, we consider an extension of the malicious noise model [Val85, KL88] to the the active

learning model as follows. There are two oracles, an example generation oracle and a label revealing oracle.

The example generation oracle works as usual in the malicious noise model: with probability (1 − η) a

random pair (x, y) is generated where x ∼ D and y = sign(w∗ · x); with probability η the adversary can

output an arbitrary pair (x, y) ∈ ℜd × −1, 1. In the active learning setting, unlike the standard malicious

noise model, when an example (x, y) is generated, the algorithm only receives x, and must make a separate

call to the label revealing oracle to get y. The goal of the algorithm is still to output a hypothesis w such that

Prx∼D[sign(w∗ · x) 6= sign(w · x)] ≤ ǫ.

In the adversarial label noise model, before any examples are generated, the adversary may choose a joint

distribution P overℜd×−1, 1whose marginal distribution overℜd isD and such that Pr(x,y)∼P (sign(w∗·

x) 6= y) ≤ η. In the active learning model, we will have two oracles, and example generation oracle and a

label revealing oracle. We note that the results from our theorems in this model translate immediately into

similar guarantees for the agnostic model of [KSS94] (used routinely both in passive and active learning

(e.g., [KKMS05, BBL06, Han07]) – see Appendix G for details.

We will be interested in algorithms that run in time poly(d, 1/ǫ) and use poly(d, 1/ǫ) samples. In

addition, for the active learning scenario we want our algorithms to also optimize for the number of label

requests. In particular, we want the number of labeled examples to depend only polylogarithmically in 1/ǫ.The goal then is to quantify for a given value of ǫ, the tolerable noise rate η(ǫ) which would allow us to

design an efficient (passive or active) learning algorithm.

Previous Work. In the context of passive learning, Kearns and Li’s analysis [KL88] implies that halfs-

paces can be efficiently learned with respect to arbitrary distributions in polynomial time while tolerating a

malicious noise rate of Ω(

ǫd

)

. A slight variant of a construction due to Kearns and Li [KL88] shows that

malicious noise at a rate greater than ǫ1+ǫ , cannot be tolerated by algorithms learning halfspaces when the

distribution is uniform over the unit sphere. The Ω(

ǫd

)

bound for the distribution-free case was not improved

2

for many years. Kalai et al. [KKMS05] showed that, when the distribution is uniform, the poly(d, 1/ǫ)-time

averaging algorithm tolerates malicious noise at a rate Ω(ǫ/√d). They also described an improvement to

Ω(ǫ/d1/4) based on the observation that uniform examples will tend to be well-separated, so that pairs of

examples that are too close to one another can be removed, and this limits an adversary’s ability to coor-

dinate the effects of its noisy examples. [KLS09] analyzed another approach to limiting the coordination

of the noisy examples and proposed an outlier removal procedure that used PCA to find any direction uonto which projecting the training data led to suspiciously high variance, and removing examples with the

most extreme values after projecting onto any such u. Their algorithm tolerates malicious noise at a rate

Ω(ǫ2/ log(d/ǫ)) under the uniform distribution.

Motivated by the fact that many modern learning applications have massive amounts of unannotated or

unlabeled data, there has been significant interest in machine learning in designing active learning algorithms

that most efficiently utilize the available data, while minimizing the need for human intervention. Over

the past decade there has been substantial progress progress on understanding the underlying statistical

principles of active learning, and several general characterizations have been developed for describing when

active learning could have an advantage over the classic passive supervised learning paradigm both in the

noise free settings and in the agnostic case [FSST97, Das05, BBL06, BBZ07, Han07, DHM07, CN07,

BHW08, Kol10, BHLZ10, Wan11, Das11, RR11, BH12]. However, despite many efforts, except for very

simple noise models (random classification noise [BF13] and linear noise [DGS12]), to date there are no

known computationally efficient algorithms with provable guarantees in the presence of noise. In particular,

there are no computationally efficient algorithms for the agnostic case, and furthermore no result exists

showing the benefits of active learning over passive learning in the malicious noise model, where the feature

part of the examples can be corrupted as well. We discuss additional related work in Appendix A.

1.1 Our Results

1. We give a poly(d, 1/ǫ)-time algorithm for learning linear separators in ℜd under the uniform distribu-

tion that can handle a noise rate of η = Ω(ǫ), where ǫ is the desired error parameter. Our algorithm

(outlined in Section 3) is quite different from those in [KKMS05] and [KLS09] and improves signif-

icantly on the noise robustness of [KKMS05] by roughly a factor d1/4 and on the noise robustness

of [KLS09] by a factorlog dǫ . Our noise tolerance is near-optimal and is within a constant factor of the

statistical lower bound of ǫ1+ǫ . In particular we show the following.

Theorem 1.1. There is a polynomial-time algorithm Aum for learning linear separators with re-

spect to the uniform distribution over the unit ball in ℜd in the presence of malicious noise such that

an Ω (ǫ) upper bound on η suffices to imply that for any ǫ, δ > 0, the output w of Aum satisfies

Pr(x,y)∼D[sign(w · x) 6= sign(w∗ · x)] ≤ ǫ with probability at least 1− δ.2. For the adversarial noise model, we give a poly(d, 1/ǫ)-time algorithm for learning with respect to

the uniform distribution that tolerates a noise rate Ω(ǫ).

Theorem 1.2. There is a polynomial-time algorithm Aul for learning linear separators with respect

to the uniform distribution over the unit ball in ℜd in the presence of adversarial label noise such

that an Ω (ǫ) upper bound on η suffices to imply that for any ǫ, δ > 0, the output w of Aum satisfies

Pr(x,y)∼D[sign(w · x) 6= sign(w∗ · x)] ≤ ǫ with probability at least 1− δ.

As a restatement of the above theorem, in the agnostic setting considered in [KKMS05], we can output

a halfspace of error at most O(η + α) in time poly(d, 1/α). The previous best result of [KKMS05]

achieves this by learning a low degree polynomial in time whose dependence on ǫ is exponential.

3

3. We obtain similar results for the case of isotropic log-concave distributions.

Theorem 1.3. There is a polynomial-time algorithm Ailcm for learning linear separators with re-

spect to any isotropic log-concave distribution in ℜd in the presence of malicious noise such that an

Ω(

ǫlog2( 1

ǫ)

)

upper bound on η suffices to imply that for any ǫ, δ > 0, the output w of Ailcm satisfies

Pr(x,y)∼D[sign(w · x) 6= sign(w∗ · x)] ≤ ǫ with probability at least 1− δ.

This improves on the best previous bound of Ω(

ǫ3

log2(d/ǫ)

)

on the noise rate [KLS09]. Notice that our

noise tolerance bound has no dependence on d.

Theorem 1.4. There is a polynomial-time algorithm Ailcl for learning linear separators with respect

to isotropic log-concave distribution in ℜd in the presence of adversarial label noise such that an

Ω(

ǫ/ log2(1/ǫ))

upper bound on η suffices to imply that for any ǫ, δ > 0, the output w of Ailcl

satisfies Pr(x,y)∼D[sign(w · x) 6= sign(w∗ · x)] ≤ ǫ with probability at least 1− δ.

This improves on the best previous bound of Ω(

ǫ3

log(1/ǫ)

)

on the noise rate [KLS09].

4. A particularly nice feature of our algorithms is that they can naturally exploit the power of active

learning. We show that in this model, the label complexity of both algorithms depends only poly-

logarithmically in 1/ǫ where ǫ is the desired error rate, while still using only a polynomial number

of unlabeled samples (for the uniform distribution, the dependence of the number of labels on ǫ is

O(log(1/ǫ))). Our efficient algorithm that tolerates adversarial label noise solves an open problem

posed in [BBL06, Mon06]. Furthermore, our paper provides the first active learning algorithm for

learning linear separators in the presence of non-trivial amount of adversarial noise that can affect not

only the label part, but also the feature part.

Our work exploits the power of localization for designing noise-tolerant polynomial-time algorithms.

Such localization techniques have been used for analyzing sample complexity for passive learning (see

[BBM05, BBL05, Zha06, BLL09, BL13]) or for designing active learning algorithms (see [BBZ07, Kol10,

Han11, BL13]). In order to make such a localization strategy computationally efficient and tolerate mali-

cious noise we introduce several key ingredients described in Section 1.2.

We note that all our algorithms are proper in that they return a linear separator. (Linear models can be

evaluated efficiently, and are otherwise easy to work with.) We summarize our results in Tables 1 and 2.

Table 1: Comparison with previous poly(d, 1/ǫ)-time algs. for uniform distribution

Passive Learning Prior work Our work

malicious η = Ω( ǫd1/4

) [KKMS05] η = Ω(ǫ)

η = Ω( ǫ2

log(d/ǫ)) [KLS09]

adversarial η = Ω(ǫ/√

log(1/ǫ)) [KKMS05] η = Ω(ǫ)

Active Learning (malicious and adversarial) NA η = Ω(ǫ)

1.2 Techniques

Hinge Loss Minimization As minimizing the 0-1 loss in the presence of noise is NP-hard [JP78, GJ90],

a natural approach is to minimize a surrogate convex loss that acts as a proxy for the 0-1 loss. A common

4

Table 2: Comparison with previous poly(d, 1/ǫ)-time algorithms isotropic log-concave distributions

Passive Learning Prior work Our work

malicious η = Ω( ǫ3

log2(d/ǫ)) [KLS09] η = Ω( ǫ

log2(1/ǫ))

adversarial η = Ω( ǫ3

log(1/ǫ)) [KLS09] η = Ω( ǫlog2(1/ǫ)

)

Active Learning (malicious and adversarial) NA Ω( ǫlog2(1/ǫ)

)

choice in machine learning is to use the hinge loss defined as ℓτ (w, x, y) = max(

0, 1− y(w·x)τ

)

, and, for

a set T of examples, we let ℓτ (w, T ) =1|T |∑

(x,y)∈T ℓτ (w, x, y). Here τ is a parameter that changes during

training. It can be shown that minimizing hinge loss with an appropriate normalization factor can tolerate a

noise rate of Ω(ǫ2/√d) under the uniform distribution over the unit ball in ℜd. This is also the limit for such

a strategy since a more powerful malicious adversary with can concentrate all the noise directly opposite to

the target vector w∗ and make sure that the hinge-loss is no longer a faithful proxy for the 0-1 loss.

Localization in the instance and concept space Our first key insight is that by using an iterative lo-

calization technique, we can limit the harm caused by an adversary at each stage and hence can still do

hinge-loss minimization despite significantly more noise. In particular, the iterative style algorithm we pro-

pose proceeds in stages and at stage k, we have a hypothesis vector wk of a certain error rate. The goal in

stage k is to produce a new vector wk+1 of error rate half of wk. In order to halve the error rate, we focus

on a band of size bk = Θ(2−k√d) around the boundary of the linear classifier whose normal vector is wk,

i.e. Swk,bk = x : |wk · x| < bk. For the rest of the paper, we will repeatedly refer to this key region

of borderline examples as “the band”. The key observation made in [BBZ07] is that outside the band, all

the classifiers still under consideration (namely those hypotheses within radius rk of the previous weight

vector wk) will have very small error. Furthermore, the probability mass of this band under the original

distributions is small enough, so that in order to make the desired progress we only need to find a hypothesis

of constant error rate over the data distribution conditioned on being within margin bk of wk. This insight

has been crucially used in the [BBZ07] in order to obtain active learning algorithms with improved label

complexity ignoring computational complexity considerations1 .

In this work, we show the surprising fact that this idea can be extended and adapted to produce polyno-

mial time algorithms with improved noise tolerance as well! Not only do we use this localization idea for

different purposes, but our analysis significantly departs from [BBZ07]. To obtain our results, we exploit

several new ideas: (1) the performance of the rescaled hinge loss minimization in smaller and smaller bands,

(2) a careful variance analysis, and (3) another type of localization — we develop and analyze a novel soft

and localized outlier removal procedure. In particular, we first show that if we minimize a variant of the

hinge loss that is rescaled depending on the width of the band, it remains a faithful enough proxy for the 0-1

error even when there is significantly more noise. As a first step towards this goal, consider the setting where

we pick τk proportionally to bk, the size of the band, and rk is proportional to the error rate of wk, and then

minimize a normalized hinge loss function ℓτk(w, x, y) = max(0, 1 − y(w·x)τk

) over vectors w ∈ B(wk, rk).We first show that w∗ has small hinge loss within the band. Furthermore, within the band the adversarial

examples cannot hurt the hinge loss of w∗ by a lot. To see this notice that if the malicious noise rate is η,

within Swk−1,bk the effective noise rate is Θ(η2k). Also the maximum value of the hinge loss for vectors

w ∈ B(wk, 2−k) is O(

√d). Hence the maximum amount by which the adversary can affect the hinge loss

1We note that the localization considered by [BBZ07] is a more aggressive one than those considered in disagreement based

active learning literature [BBL06, Han07, Kol10, Han11, Wan11] and earlier in passive learning [BBM05, BBL05, Zha06].

5

is O(η2k√d). Using this approach we get a noise tolerance of Ω(ǫ/

√d).

In order to get a much better noise tolerance in the adversarial or agnostic setting, we crucially exploit

a careful analysis of the variance of w · x for vectors w close to the current vector wk−1, one can get a

much tighter bound on the amount by which an adversary can “hurt” the hinge loss. This then leads to an

improved noise tolerance of Ω(ǫ).For the case of malicious noise, in addition we need to deal with the presence of outliers, i.e. points

not generated from the uniform distribution. We do this by introducing a soft localized outlier removal

procedure at each stage (described next). This procedure assigns a weight to each data point indicating how

“noisy” the point is. We then minimize the weighted hinge loss. This combined with the variance analysis

mentioned above leads to a noise of tolerance of Ω(ǫ) in the malicious case as well.

Soft Localized Outlier Removal Outlier removal techniques have been studied before in the context of

learning problems [BFKV97, KLS09]. The goal of outlier removal is to limit the ability of the adversary to

coordinate the effects of noisy examples – excessive such coordination is detected and removed. Our outlier

removal procedure (see Figure 2) is similar in spirit to that of [KLS09] with two key differences. First, as

in [KLS09], we will use the variance of the examples in a particular direction to measure their coordination.

However, due to the fact that in round k, we are minimizing the hinge loss only with respect to vectors that

are close to wk−1, we only need to limit the variance in these directions. This variance is Θ(b2k) which is

much smaller than 1/d. This allows us to limit the harm of the adversary to a greater extent than was possible

in the analysis of [KLS09]. The second difference is that, unlike previous outlier removal techniques, we

do not remove any examples but instead weigh them appropriately and then minimize the weighted hinge

loss. The weights indicate how noisy a given example is. We show that these weights can be computed

by solving a linear program with infinitely many constraints. We then show how to design an efficient

separation oracle for the linear program using recent general-purpose techniques from the optimization

community [SZ03, BM13].

In Section 4 we show that our results hold for a more general class of distributions which we call

admissible distributions. From Section 4 it also follows that our results can be extended to β-log-concave

distributions (for small enough β). Such distributions, for instance, can capture mixtures of log-concave

distributions [BL13].

2 Preliminaries

Our algorithms and analysis will use the hinge loss defined as ℓτ (w, x, y) = max(

0, 1 − y(w·x)τ

)

, and, for

a set T of examples, we let ℓτ (w, T ) =1|T |∑

(x,y)∈T ℓτ (w, x, y). Here τ is a parameter that changes during

training. Similarly, the expected hinge loss w.r.t.D is defined as Lτ (w,D) = Ex∼D(ℓτ (w, x, sign(w∗ ·x))).

Our analysis will also consider the distribution Dw,γ obtained by conditioning D on membership in the band,

i.e. the set x : ‖x‖2 = 1, |w · x| ≤ γ.Since it is very natural, for clarity of exposition, we present our algorithms directly in the active learning

model. We will prove that our active algorithm only uses a polynomial number of unlabeled samples, which

then immediately implies a guarantee for passive learning setting as well. At a high level, our algorithms

are iterative learning algorithms that operate in rounds. In each round k we focus our attention and use

points that fall near the current hypothesized decision boundary wk−1 and use them in order to obtain a new

vector wk of lower error. In the malicious noise case, in round k we first do a soft outlier removal and then

minimize hinge loss normalized appropriately by τk. A formal description appears in Figure 1, and a formal

description of the outlier removal procedure appears in Figure 2. We will present specific choices of the

6

Figure 1 COMPUTATIONALLY EFFICIENT ALGORITHM TOLERATING MALICIOUS NOISE

Input: allowed error rate ǫ, probability of failure δ, an oracle that returns x, for (x, y) sampled from

EXη(f,D), and an oracle for getting the label from an example; a sequence of unlabeled sample sizes

nk > 0 k ∈ Z+; a sequence of labeled sample sizes mk > 0; a sequence of cut-off values bk > 0; a

sequence of hypothesis space radii rk > 0; a sequence of removal rates ξk; a sequence of variance bounds

σ2k; precision value κ; weight vector w0.

1. Draw n1 examples and put them into a working set W .

2. For k = 1, . . . , s = ⌈log2(1/ǫ)⌉

(a) Apply the algorithm from Figure 2 to W with parameters u ← wk−1, γ ← bk−1, r ← rk, ξ ← ξk,

σ2 ← σ2k and let q be the output function q : W → [0, 1] . Normalize q to form a probability distribution

p over W .

(b) Choose mk examples from W according to p and reveal their labels. Call this set T .

(c) Find vk ∈ B(wk−1, rk) to approximately minimize training hinge loss over T s.t. ‖vk‖2 ≤ 1:

ℓτk(vk, T ) ≤ minw∈B(wk−1,rk)∩B(0,1)) ℓτk(w, T ) + κ/8Normalize vk to have unit length, yielding wk = vk

‖vk‖2

.

(d) Clear the working set W .

(e) Until nk+1 additional data points are put in W , given x for (x, f(x)) obtained from EXη(f,D), if

|wk · x| ≥ bk, then reject x else put into W

Output: weight vector ws of error at most ǫ with probability 1− δ.

parameters of the algorithms in the following sections.

The description of the algorithm and its analysis is simplified if we assume that it starts with a prelimi-

nary weight vector w0 whose angle with the target w∗ is acute, i.e. that satisfies θ(w0, w∗) < π/2. We show

in Appendix B that this is without loss of generality for the types of problems we consider.

3 Learning with respect to uniform distribution with malicious noise

Let Sd−1 denote the unit ball in Rd. In this section we focus on the case where the marginal distribution D

is the uniform distribution over Sd−1 and present our results for malicious noise. We present the analysis

of our algorithm directly in the active learning model, and present a proof sketch for its correctness in

Theorem 3.1 below. The proof of Theorem 1.1 follows immediately as a corollary. Complete proof details

are in Appendix C.

Theorem 3.1. Let w∗ be the (unit length) target weight vector. There are absolute positive constants

c1, ..., c4 and a polynomial p such that, an Ω (ǫ) upper bound on η suffices to imply that for any ǫ, δ > 0,

using the algorithm from Figure 1 with ǫ0 = 1/8, cut-off values bk = c12−kd−1/2, radii rk = c22

−kπ,

κ = c3, τk = c42−kd−1/2 for k ≥ 1, ξk = cκ2, σk = (

r2kd−1 + b2k−1), a number nk = p(d, 2k, log(1/δ)) of

unlabeled examples in round k and a number mk = O(d(d + log(k/δ))) of labeled examples in round k,

after s = ⌈log2(1/ǫ)⌉ iterations, we find ws satisfying err(ws) = Pr(x,y)∼D[sign(w ·x) 6= sign(w∗ ·x)] ≤ ǫwith probability ≥ 1− δ.

7

Figure 2 LOCALIZED SOFT OUTLIER REMOVAL PROCEDURE

Input: a set S = (x1, x2, . . . xn) samples; the reference unit vector u; desired radius r; a parameter ξspecifying the desired bound on the fraction of clean examples removed; a variance bound σ2

1. Find q : S → [0, 1] satisfying the following constraints:

(a) for all x ∈ S, 0 ≤ q(x) ≤ 1

(b) 1|S|

∑

(x,y)∈S q(x) ≥ 1− ξ(c) for all w ∈ B(u, r) ∩B(0, 1), 1

|S|

∑

x∈S q(x)(w · x)2 ≤ cσ2

Output: A function q : S → [0, 1].

3.1 Proof Sketch of Theorem 3.1

We may assume without loss of generality that all examples, including noisy examples, fall in Sd−1. This is

because any example that falls outside Sd−1 can be easily identified by the algorithm as noisy and removed,

effectively lowering the noise rate.

A first key insight is that using techniques from [BBZ07], we may reduce our problem to a subproblem

concerning learning with respect to a distribution obtained by conditioning on membership in the band. In

particular, in Appendix C.1, we prove that, for a sufficiently small absolute constant κ, Theorem 3.2 stated

below, together with proofs of its computational, sample and label complexity bounds, suffices to prove

Theorem 3.1.

Theorem 3.2. After round k of the algorithm in Figure 1, with probability at least 1 − δk+k2

, we have

errDwk−1,bk−1(wk) ≤ κ.

The proof of Theorem 3.2 follows from a series of steps summarized in the lemmas below. First, we

bound the hinge loss of the target w∗ within the band Swk−1,bk−1. Since we are analyzing a particular

round k, to reduce clutter in the formulas, for the rest of this section, let us refer to ℓτk simply as ℓ and

Lτk(·,Dwk−1,bk−1) as L(·).

Lemma 3.3. L(w∗) ≤ κ/12.

Proof Sketch: Notice that y(w∗·x) is never negative, so, on any clean example (x, y), we have ℓ(w∗, x, y) =

max

0, 1− y(w∗·x)τk

≤ 1, and, furthermore, w∗ will pay a non-zero hinge only inside the region where

|w∗ · x| < τk. Hence, L(w∗) ≤ PrDwk−1,bk−1

(|w∗ · x| ≤ τk) =Prx∼D(|w∗·x|≤τk & |wk−1·x|≤bk−1)

Prx∼D(|wk−1·x|≤bk−1). Using

standard tail bounds (see Eq. 1 in Appendix C), we can lower bound the denominator Prx∼D(|wk−1 · x| <bk−1) ≥ c′1bk−1

√d for a constant c′1. Also the numerator is at most Prx∼D(|w∗ · x| ≤ τk) ≤ c′2τk

√d. For

another constant c′2. Hence, we have L(w∗) ≤ c′2

√dτk

c′1

√dbk−1

≤ κ/12, for the appropriate choice of constants c′1and c′2 and making κ small enough.

During round k we can decompose the working set W into the set of “clean” examples WC which are

drawn from Dwk−1,bk−1and the set of “dirty” or malicious examples WD which are output by the adversary.

Next, we will relate the hinge loss of vectors over the weighted set W to the hinge loss over clean examples

WC . In order to do this we will need the following guarantee from the outlier removal subroutine of Figure 2.

8

Theorem 3.4. There is a constant c and a polynomial p such that, if n ≥ p(1/η, d, 1/ξ, 1/δ, 1/γ) examples

are drawn from the distribution Du,γ (each replaced with an arbitrary unit-length vector with probability

η < 1/4), then by using the algorithm in Figure 1 with σ2 = r2

d−1 + γ2, we have that with probability

1 − δ, the output q of satisfies the following: (a)∑

(x,y)∈S q(x) ≥ (1 − ξ)|S|, and (b) for all unit length w

such that ‖w − u‖2 ≤ r, 1|S|∑

x∈S q(x)(w · x)2 ≤ cσ2. Furthermore, the algorithm can be implemented in

polynomial time.

The key points in proving this theorem are the following. We will show that the vector q∗ which assigns

a weight 1 to examples in WC and weight 0 to examples in WD is a feasible solution to the linear program

in Figure 2. In order to do this, we first show that the fraction of dirty examples in round k is not too

large, i.e., w.h.p., we have |WD| = O(η|S|). Next, we use the improved variance bound from Lemma C.2

regarding E[(w.x)2] for all w close to u. This bound is ( r2

d−1 + γ2). The proof of feasibility follows easily

by combining the variance bound with standard VC tools. In the appendix we also show how to solve the

linear program in polynomial time. The complete proof of the theorem 3.4 is in Appendix C.

As explained in the introduction, the soft outlier removal procedure allows us to get a much refined

bound on the hinge loss over the clean set WC , i.e., ℓ(w,WC) as compared to the hinge loss over the

weighted set W , i.e., ℓ(w, p). This is formalized in the following lemma. Here ℓ(w,WC) and ℓ(w, p) are

defined with respect to the true unrevealed labels that the adversary has committed to.

Lemma 3.5. There are absolute constants c1, c2 and c3 such that, for large enough d, with probability

1 − δ2(k+k2)

, if we define zk =

√

r2kd−1 + b2k−1, then for any w ∈ B(wk−1, rk), we have ℓ(w,WC) ≤

ℓ(w, p) + c1ηǫ

(

1 + zkτk

)

+ κ/32 and ℓ(w, p) ≤ 2ℓ(w,WC) + κ/32 + c2ηǫ +

c3√

η/ǫzkτk

.

A detailed proof of 3.5 is given in Appendix C. Here were give a few ideas. The loss ℓ(w, x, y) on

a particular example can be upper bounded by 1 + |w·x|τ . One source of difference between ℓ(w,WC),

the loss on the clean examples, and ℓ(w, p), the loss minimized by the algorithm, is the loss on the (total

fractional) dirty examples that were not deleted by the soft outlier removal. By using the Cauchy-Shwartz

inequality, the (weighted) sum of 1 + |w·x|τ over those surviving noisy examples can be bounded in terms of

the variance in the direction w, and the (total fractional) number of surviving dirty examples. Our soft outlier

detection allows us to bound the variance of the surviving noisy examples in terms of Θ(z2k). Another way

that ℓ(w,WC) can be different from ℓ(w, p) is effect of deleting clean examples. We can similarly use the

variance on the clean examples to bound this in terms of z. Finally, we can flesh out the detailed bound by

exploiting the (soft counterparts of) the facts that most examples are clean and few examples are excluded.

Given, these the proof of Theorem 3.2 can be summarized as follows.

Let E = errDwk−1,bk−1

(wk) = errDwk−1,bk−1(vk) be the probability that we want to bound. Applying

VC theory, w.h.p., all sampling estimates of expected loss are accurate to within κ/32, so we may assume

w.l.o.g. that this is the case. Since, for each error, the hinge loss is at least 1, we have E ≤ L(vk). Applying

Lemma 3.5 and VC theory, we get, E ≤ ℓ(vk, T )+ c1ηǫ

(

1 + zkτk

)

+κ/8. Since vk approximately minimizes

the hinge loss, VC theory implies E ≤ ℓ(w∗, p) + c1ηǫ

(

1 + zkτk

)

+ κ/3. Once again applying Lemma 3.5

and VC theory yields E ≤ 2L(w∗) + c1ηǫ

(

1 + zkτk

)

+ c2ηǫ +

c3√

η/ǫzkτk

+ κ/2. Since L(w∗) ≤ κ/12, we get

E ≤ κ/6 + c1ηǫ

(

1 + zkτk

)

+ c2ηǫ +

c3√

η/ǫzkτk

+ κ/2. Now notice that zk/τk is Θ(1). Hence an Ω(ǫ) bound

on η suffices to imply, w.h.p., that errDwk−1,bk−1(wk) ≤ κ.

9

4 Learning with respect to admissible distributions with malicious noise

One of our main results (Theorem 1.3) concerns isotropic log concave distributions. A probability distribu-

tion is isotropic log-concave if its density can be written as exp(−ψ(x)) for a convex function ψ, its mean

is 0, and its covariance matrix is I .

In this section, we extend our analysis from the previous section and show that it works for isotropic

log concave distributions, and in fact an even more general class of distributions which we call as admis-

sible distributions. In particular this includes the class of isotropic log-concave distributions in Rd and the

uniform distributions over the unit ball in Rd.

Definition 4.1. A sequenceD4,D5, ... of probability distributions over R4,R5, ... respectively is λ-admissible

if it satisfies the following conditions. (1.) There are c1, c2, c3 > 0 such that, for all d ≥ 4, for xdrawn from Dd and any unit length u ∈ R

d, (a) for all a, b ∈ [−c1, c1] for which a ≤ b, we have

Pr(u · x ∈ [a, b]) ≥ c2|b− a| and for all a, b ∈ R for which a ≤ b, Pr(u · x ∈ [a, b]) ≤ c3|b− a|. (2.) For

any c4 > 0, there is a c5 > 0 such that, for all d ≥ 4, the following holds. Let u and v be two unit vectors in

Rd, and assume that θ(u, v) = α ≤ π/2. Then Prx∼Dd[sign(u · x) 6= sign(v · x) and |v · x| ≥ c5α] ≤ c4α.

(3.) There is an absolute constant c6 such that, for any d ≥ 4, for any two unit vectors u and v in Rd

we have c6θ(v, u) ≤ Prx∼Dd(sign(u · x) 6= sign(v · x)). (4.) There is a constant c8 such that, for all

constant c7, for all d ≥ 4, for any a such that, ‖a‖2 ≤ 1, and ||u − a|| ≤ r, for any 0 < γ < c7, we have

Ex∼Dd,u,γ

(

(a · x)2)

≤ c8 logλ(1 + 1/γ)(r2 + γ2). (5.) There is a constant c9 such that Prx∼D(||x|| >

α) ≤ c9 exp(−α/√d).

For the case of admissible distributions we have the following theorem, which is proved in Appendix D.

Theorem 4.2. Let a distribution D over Rd be chosen from a λ-admissible sequence of distributions Let

w∗ be the (unit length) target weight vector. There are settings of the parameters of the algorithm A from

Figure 1, such that an Ω

(

ǫlogλ( 1

ǫ)

)

upper bound on the rate η of malicious noise suffices to imply that for

any ǫ, δ > 0, a number nk = poly(d,Mk, log(1/δ)) of unlabeled examples in round k and a number mk =O(

d log(

dǫδ

)

(d+ log(k/δ)))

of labeled examples in round k ≥ 1, and w0 such that θ(w0, w∗) < π/2,

after s = O(log(1/ǫ)) iterations, finds ws satisfying err(ws) ≤ ǫ with probability ≥ 1− δ.If the support ofD is bounded in a ball of radiusR(d), then, we have thatmk = O

(

R(d)2(d+ log(k/δ)))

label requests suffice.

The above theorem contains Theorem 1.3 as a special case. This is because of the fact that any isotropic

log-concave distribution is 2-admissible (see Appendix F.2 for a proof).

5 Adversarial label noise

The intuition in the case of adversarial label noise is the same as for malicious noise, except that, because the

adversary cannot change the marginal distribution over the instances, it is not necessary to perform outlier

removal. Bounds for learning with adversarial label noise are not corollaries of bounds for learning with

malicious noise, however, because, while the marginal distribution over the instances for all the examples,

clean and noisy, is not affected by the adversary, the marginal distribution over the clean examples is changed

(because the examples whose labels are flipped are removed from the distribution over clean examples).

Theorem 1.2 and Theorem 1.4, which concern adversarial label noise, can be proved by combining the

analysis in Appendix E with the facts that the uniform distribution and i.l.c. distributions are 0-admissible

and 2-admissible respectively.

10

6 Discussion

Localization in this paper refers to the practice of narrowing the focus of a learning algorithm to a restricted

range of possibilities (which we know to be safe given the information so far), thereby reducing sensitivity

of estimates of the quality of these possibilities based on random data –this in turn leads to better noise

tolerance in our work. (Note that, while the examples in the band in round k do not occupy a neighborhood

in feature space, they concern differences between hypotheses in a neighborhood around wk−1.) We note

that the idea of localization in the concept space is traditionally used in statistical learning theory both in

supervised and active learning for getting sharper rates [BBL05, BLL09, Kol10]. Furthermore, the idea of

localization in the instance space has been used in margin-based analysis of active learning [BBZ07, BL13].

In this work we used localization in both senses in order to get polynomial-time algorithms with better noise

tolerance. It would be interesting to further exploit this idea for other concept spaces.

References

[AB99] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge

University Press, 1999.

[ABS10] P. Awasthi, A. Blum, and O. Sheffet. Improved guarantees for agnostic learning of disjunctions.

COLT, 2010.

[ABSS93] S. Arora, L. Babai, J. Stern, and Z. Sweedyk. The hardness of approximate optima in lat-

tices, codes, and systems of linear equations. In Proceedings of the 1993 IEEE 34th Annual

Foundations of Computer Science, 1993.

[Bau90] E. B. Baum. The perceptron algorithm is fast for nonmalicious distributions. Neural Compu-

tation, 2:248–260, 1990.

[BBL05] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: a survey of recent ad-

vances. ESAIM: Probability and Statistics, 9:9:323–375, 2005.

[BBL06] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In ICML, 2006.

[BBM05] P. L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. Annals of

Statistics, 33(4):1497–1537, 2005.

[BBZ07] M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In COLT, 2007.

[BF13] M.-F. Balcan and V. Feldman. Statistical active learning algorithms. NIPS, 2013.

[BFKL94] Avrim Blum, Merrick L. Furst, Michael J. Kearns, and Richard J. Lipton. Cryptographic prim-

itives based on hard learning problems. In Proceedings of the 13th Annual International Cryp-

tology Conference on Advances in Cryptology, 1994.

[BFKV97] A. Blum, A. Frieze, R. Kannan, and S. Vempala. A polynomial time algorithm for learning

noisy linear threshold functions. Algorithmica, 22(1/2):35–52, 1997.

[BGMN05] F. Barthe, O. Guedon, S. Mendelson, and A. Naor. A probabilistic approach to the geometry of

the pn-ball. The Annals of Probability, 33(2):480–513, 2005.

11

[BH12] M.-F. Balcan and S. Hanneke. Robust interactive learning. In COLT, 2012.

[BHLZ10] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without con-

straints. In NIPS, 2010.

[BHW08] M.-F. Balcan, S. Hanneke, and J. Wortman. The true sample complexity of active learning. In

COLT, 2008.

[BL13] M.-F. Balcan and P. M. Long. Active and passive learning of linear separators under log-

concave distributions. In Conference on Learning Theory, 2013.

[BLL09] N. H. Bshouty, Y. Li, and P. M. Long. Using the doubling dimension to analyze the generaliza-

tion of learning algorithms. JCSS, 2009.

[BM13] D. Bienstock and A. Michalka. Polynomial solvability of variants of the trust-region subprob-

lem, 2013. Optimization Online.

[BSS12] A. Birnbaum and S. Shalev-Shwartz. Learning halfspaces with the zero-one loss: Time-

accuracy tradeoffs. NIPS, 2012.

[Byl94] T. Bylander. Learning linear threshold functions in the presence of classification noise. In

Conference on Computational Learning Theory, 1994.

[CAL94] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine

Learning, 15(2), 1994.

[CGZ10] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Learning noisy linear classifiers via adaptive

and selective sampling. Machine Learning, 2010.

[CN07] R. Castro and R. Nowak. Minimax bounds for active learning. In COLT, 2007.

[CST00] N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines and other

kernel-based learning methods. Cambridge University Press, 2000.

[Das05] S. Dasgupta. Coarse sample complexity bounds for active learning. In NIPS, volume 18, 2005.

[Das11] S. Dasgupta. Active learning. Encyclopedia of Machine Learning, 2011.

[DGS12] O. Dekel, C. Gentile, and K. Sridharan. Selective sampling and active learning from single and

multiple teachers. JMLR, 2012.

[DHM07] S. Dasgupta, D.J. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. NIPS,

20, 2007.

[FGKP06] V. Feldman, P. Gopalan, S. Khot, and A. Ponnuswami. New results for learning noisy parities

and halfspaces. In FOCS, pages 563–576, 2006.

[FSST97] Y. Freund, H.S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by

committee algorithm. Machine Learning, 28(2-3):133–168, 1997.

[GHRU11] Anupam Gupta, Moritz Hardt, Aaron Roth, and Jonathan Ullman. Privately releasing conjunc-

tions and the statistical query barrier. In Proceedings of the 43rd annual ACM symposium on

Theory of computing, 2011.

12

[GJ90] Michael R. Garey and David S. Johnson. Computers and Intractability; A Guide to the Theory

of NP-Completeness. 1990.

[GR06] Venkatesan Guruswami and Prasad Raghavendra. Hardness of learning halfspaces with noise.

In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science,

2006.

[GR09] V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. SIAM Journal

on Computing, 39(2):742–765, 2009.

[GSSS13] A. Gonen, S. Sabato, and S. Shalev-Shwartz. Efficient pool-based active learning of halfspaces.

In ICML, 2013.

[Han07] S. Hanneke. A bound on the label complexity of agnostic active learning. In ICML, 2007.

[Han11] S. Hanneke. Rates of convergence in active learning. The Annals of Statistics, 39(1):333–361,

2011.

[JP78] D. S. Johnson and F. Preparata. The densest hemisphere problem. Theoretical Computer

Science, 6(1):93 – 107, 1978.

[KKMS05] Adam Tauman Kalai, Adam R. Klivans, Yishay Mansour, and Rocco A. Servedio. Agnostically

learning halfspaces. In Proceedings of the 46th Annual IEEE Symposium on Foundations of

Computer Science, 2005.

[KL88] Michael Kearns and Ming Li. Learning in the presence of malicious errors. In Proceedings of

the twentieth annual ACM symposium on Theory of computing, 1988.

[KLS09] A. R. Klivans, P. M. Long, and Rocco A. Servedio. Learning halfspaces with malicious noise.

Journal of Machine Learning Research, 10, 2009.

[Kol10] V. Koltchinskii. Rademacher complexities and bounding the excess risk in active learning.

Journal of Machine Learning Research, 11:2457–2485, 2010.

[KSS94] Michael J. Kearns, Robert E. Schapire, and Linda M. Sellie. Toward efficient agnostic learning.

Mach. Learn., 17(2-3), November 1994.

[KV94] M. Kearns and U. Vazirani. An introduction to computational learning theory. MIT Press,

Cambridge, MA, 1994.

[LS06] P. M. Long and R. A. Servedio. Attribute-efficient learning of decision lists and linear threshold

functions under unconcentrated distributions. NIPS, 2006.

[LS11] P. M. Long and R. A. Servedio. Learning large-margin halfspaces with more malicious noise.

NIPS, 2011.

[LV07] L. Lovasz and S. Vempala. The geometry of logconcave functions and sampling algorithms.

Random Structures and Algorithms, 30(3):307–358, 2007.

[Mon06] Claire Monteleoni. Efficient algorithms for general active learning. In Proceedings of the 19th

annual conference on Learning Theory, 2006.

13

[Pol11] D. Pollard. Convergence of Stochastic Processes. Springer Series in Statistics. 2011.

[Reg05] Oded Regev. On lattices, learning with errors, random linear codes, and cryptography. In

Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, 2005.

[RR11] M. Raginsky and A. Rakhlin. Lower bounds for passive and active learning. In NIPS, 2011.

[Ser01] Rocco A. Servedio. Smooth boosting and learning with malicious noise. In 14th Annual Con-

ference on Computational Learning Theory and 5th European Conference on Computational

Learning Theory, 2001.

[SZ03] J. Sturm and S. Zhang. On cones of nonnegative quadratic functions. Mathematics of Opera-

tions Research, 28:246–267, 2003.

[Val85] L. G. Valiant. Learning disjunction of conjunctions. In Proceedings of the 9th International

Joint Conference on Artificial intelligence, 1985.

[Vap98] V. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.

[Vem10] S. Vempala. A random-sampling-based algorithm for learning intersections of halfspaces.

JACM, 57(6), 2010.

[Wan11] L. Wang. Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active

Learning. JMLR, 2011.

[Zha06] T. Zhang. Information theoretical upper and lower bounds for statistical estimation. IEEE

Transactions on Information Theory, 52(4):1307–1321, 2006.

A Additional Related Work

Passive Learning Blum et al. [BFKV97] considered noise-tolerant learning of halfspaces under a more

idealized noise model, known as the random noise model, in which the label of each example is flipped

with a certain probability, independently of the feature vector. Some other, less closely related, work on

efficient noise-tolerant learning of halfspaces includes [Byl94, BFKV97, FGKP06, GR09, Ser01, ABS10,

LS11, BSS12].

Active Learning As we have mentioned, most prior theoretical work on active learning focuses on either

sample complexity bounds (without regard for efficiency) or on providing polynomial time algorithms in the

noiseless case or under simple noise models (random classification [BF13] noise or linear noise [CGZ10,

DGS12]).

In [CGZ10, DGS12] online learning algorithms in the selective sampling framework are presented,

where labels must be actively queried before they are revealed. Under the assumption that the label condi-

tional distribution is a linear function determined by a fixed target vector, they provide bounds on the regret

of the algorithm and on the number of labels it queries when faced with an adaptive adversarial strategy of

generating the instances. As pointed out in [DGS12], these results can also be converted to a distributional

PAC setting where instances xt are drawn i.i.d. In this setting they obtain exponential improvement in label

complexity over passive learning. These interesting results and techniques are not directly comparable to

ours. Our framework is not restricted to halfspaces. Another important difference is that (as pointed out

14

in [GSSS13]) the exponential improvement they give is not possible in the noiseless version of their set-

ting. In other words, the addition of linear noise defined by the target makes the problem easier for active

sampling. By contrast RCN can only make the classification task harder than in the realizable case.

Recently, [BF13] showed the first polynomial time algorithms for actively learning thresholds, balanced

rectangles, and homogenous linear separators under log-concave distributions in the presence of random

classification noise. Active learning with respect to isotropic log-concave distributions in the absence of

noise was studied in [BL13].

B Initializing with vector w0

Suppose we have an algorithm B as a subroutine that works, given access to such a w0. Then we can arrive

at an algorithm A which works without it as follows. We will describe the procedure below for general

admissible distributions. With probability 1, for a random u, either u or −u has an acute angle with w∗. We

may then run B with both choices, ǫ set to πc64 for any admissible distribution. Here c6 is the constant in

Definition 4.1. Then we can use hypothesis testing on O(log(1/δ)) examples, and, with high probability,

find a hypothesis w′ with error less than πc64 . Part 3 of Definition 4.1 then implies that A may then set

w0 = w′, and call B again.

C Proof of Theorem 3.1

We start by stating state properties of the distribution D which will be useful in our analysis in the next

section.

1. [Bau90, BBZ07, KKMS05] For any C > 0, there are c1, c2 > 0 such that, for x drawn from the

uniform distribution over Sd−1 and any unit length u ∈ Rd,

• for all a, b ∈ [−C/√d,C/

√d] for which a ≤ b, we have

c1|b− a|√d ≤ Pr(u · x ∈ [a, b]) ≤ c2|b− a|

√d, (1)

• and if b ≥ 0, we have

Pr(u · x > b) ≤ 1

2e−db2/2. (2)

2. [BBZ07, BL13] For any c6 > 0, there is a c7 > 0 such that, for all d ≥ 4, the following holds. Let uand v be two unit vectors in Rd, and assume that θ(u, v) = α ≤ π/2. Then

Prx∼Dd

[sign(u · x) 6= sign(v · x) and |v · x| ≥ c7α√d] ≤ c6α. (3)

C.1 Margin based analysis

The proof of Theorem 3.1 follows the high level structure of the proof of [BBZ07]; the new element is

the application of Theorem C.4 which analyzes the performance of the hinge loss minimization algorithm

for learning inside the band, which in turn applies Theorem C.1, which analyzes the benefits of our new

localized outlier removal procedure.

Proof (of Theorem 1.1): We will prove by induction on k that after k ≤ s iterations, we have errD(wk) ≤2−(k+1) with probability 1− δ(1 − 1/(k + 1))/2.

15

When k = 0, all that is required is errD(w0) ≤ 1/2.

Assume now the claim is true for k − 1 (k ≥ 1). Then by induction hypothesis, we know that with

probability at least 1− δ(1 − 1/k)/2, wk−1 has error at most 2−k. This implies θ(wk−1, w∗) ≤ π2−k.

Let us define Swk−1,bk−1= x : |wk−1 · x| ≤ bk−1 and Swk−1,bk−1

= x : |wk−1 · x| > bk−1.Since wk−1 has unit length, and vk ∈ B(wk−1, rk), we have θ(wk−1, vk) ≤ rk which in turn implies

θ(wk−1, wk) ≤ rk.

Applying Equation 3 to bound the error rate outside the band, we have both:

Prx

[

(wk−1 · x)(wk · x) < 0, x ∈ Swk−1,bk−1

]

≤ 2−(k+4) and

Prx

[

(wk−1 · x)(w∗ · x) < 0, x ∈ Swk−1,bk−1

]

≤ 2−(k+4).

Taking the sum, we obtain Prx[

(wk · x)(w∗ · x) < 0, x ∈ Swk−1,bk−1

]

≤ 2−(k+3). Therefore, we have

err(wk) ≤ (errDwk−1,bk−1(wk)) Pr(Swk−1,bk−1

) + 2−(k+3).

Let c′2 be the constant from Equation 1. We have Pr(Swk−1,bk−1) ≤ 2c′2bk−1

√d, this implies

err(wk) ≤ (errDwk−1,bk−1

(wk))2c′2bk−1

√d+ 2−(k+3) ≤ 2−(k+1)

(

(errDwk−1,bk−1

(wk))4c1c′2 + 1/2

)

.

Recall that Dwk−1,bk−1is the distribution obtained by conditioning D on the event that x ∈ Swk−1,bk−1

.

Applying Theorem C.4, with probability 1 − δ2(k+k2)

, wk has error at most κ = 18c1c′2

within Swk−1,bk−1,

implying that err(wk) ≤ 2−(k+1), completing the proof of the induction, and therefore showing, with

probability at least 1− δ, O(log(1/ǫ)) iterations suffice to achieve err(wk) ≤ ǫ.A polynomial number of unlabeled samples are required by the algorithm and the number of labeled

examples required by the algorithm is∑

kmk = O(d(d+ log log(1/ǫ) + log(1/δ)) log(1/ǫ)).

C.2 Analysis of the outlier removal subroutine

The analysis of the learning algorithm uses the following theorem (same as Theorem 3.4 in the main body)

about the outlier removal subroutine of Figure 2.

Theorem C.1. There is a polynomial p such that, if n ≥ p(1/η, d, 1/ξ, 1/δ, 1/γ) examples are drawn from

the distribution Du,γ (each replaced with an arbitrary unit-length vector with probability η < 1/4), then,

with probability 1− δ, the output q of the algorithm in Figure 1 satisfies the following:

•∑

x∈S q(x) ≥ (1− ξ)|S| (a fraction 1− ξ of the weight is retained)

• For all unit length w such that ‖w − u‖2 ≤ r,

1

|S|∑

x∈Sq(x)(w · x)2 ≤ 2

(

r2

d− 1+ γ2

)

. (4)

Furthermore, the algorithm can be implemented in polynomial time.

Our proof of Theorem 3.4 proceeds through a series of lemmas. We would like to point out that in the

analysis below we will treat each element xi ∈ S as distinct (even if xi = xj for some j). Obviously, a

feasible q satisfies the requirements of the lemma. So all we need to show is

16

• there is a feasible solution q, and

• we can simulate a separation oracle: given a provisional solution q, we can find a linear constraint

violated by q in polynomial time.

We will start by working on proving that there is a feasible q. First of all, a Chernoff bound implies that

n ≥ poly(1/η, 1/δ) suffices for it to be the case that, with probability 1 − δ, at most 2η members of S are

noisy. Let us assume from now on that this is the case.

We will show that q∗ which sets q∗(x, y) = 0 for each noisy point, and q∗(x, y) = 1 for each non-noisy

point, is feasible. First we get a bound on E[(a.x)2] for all vectors a close to u. This is formlized in the

following lemma

Lemma C.2. For all a such that ‖u− a‖2 ≤ r and ‖a‖2 ≤ 1

Ex∼Uu,γ((a · x)2) ≤ r2/(d− 1) + γ2.

Proof. W.l.o.g. we may assume that u = (1, 0, 0, ..., 0). We can write x = (x1, x2, . . . , xd) as x = (x1, x′),

so that x′ is chosen uniformly over all vectors in Rd−1 of length at most

√

1− x21. Let us decompose

Ex∼D((a · x)2) into parts that we can analyze separately as follows.

Ex∼Uu,γ((a · x)2) = a21Ex∼Uu,γ(x21) + a1

n∑

i=2

aiEx∼Uu,γ(x1xi) +Ex∼Uu,γ((x′ · a)2). (5)

Thus Ex∼D((x′ · a)2) is at most the expectation of (x′ · a)2 when x′ = (0, x2, ..., xd) is sampled uniformly

from the unit ball in Rd−1. Thus

Ex∼Uu,γ((x′ · a)2) ≤ 1

d− 1

d∑

i=2

a2i ≤r2

d− 1. (6)

Furthermore, since |x1| ≤ γ when x is drawn from Uu,γ , we have

Ex∼Uu,γ(x21) ≤ γ2. (7)

Finally, by symmetry, Ex∼Uu,γ(x1xi) = 0 for all i. Putting this together with (7), (6) and (5) completes the

proof.

Next, we use VC tools to show the following bound on clean examples.

Lemma C.3. If we draw ℓ times i.i.d. from D to form C , with probability 1 − δ, we have that for any unit

length a,

1

ℓ

∑

x∈C(a · x)2 ≤ E[(a · x)2] +

√

O(d log(ℓ/δ)(d + log(1/δ)))

ℓ.

Proof: See Appendix H.

The above two lemmas imply that n = poly (d, 1/η, 1/δ, 1/γ) suffices for it to be the case that, for all

w ∈ B(u, r),1

|S|∑

x

q∗(x)(a · x)2 ≤ 2E[(a · x)2] ≤ 2(r2

d− 1+ γ2),

17

so that q∗ is feasible.

So what is left is to prove that the convex program has a separation oracle. First, it is easy to check

whether, for all x ∈ S, 0 ≤ q(x) ≤ 1, and whether∑

x∈S q(x) ≥ (1− ξ)|S|. An algorithm can first do that.

If these pass, then it needs to check whether there is a w ∈ B(u, r) with ||w||2 ≤ 1 such that

1

|S|∑

x∈Sq(x)(w · x)2 > c(

r2

d− 1+ γ2).

This can be done by finding w ∈ B(u, r) with ||w||2 ≤ 1 that maximizes∑

x∈S q(x)(w · x)2, and checking

it.

Suppose X is a matrix with a row for each x ∈ S, where the row is√

q(x)x. Then∑

x∈S q(x)(w ·x)2 =wTXTXw, and, maximizing this over w is an equivalent problem to minimizing wT (−XTX)w subject to

‖w − u‖2 ≤ r and ||w|| ≤ 1. Since −XTX is symmetric, problems of this form are known to be solvable

in polynomial time [SZ03] (see [BM13]).

C.3 The error within a band in each iteration

At each iteration, the algorithm of Figure 1 concentrates its attention on examples in the band. Our next

theorem (same as Theorem 3.2 in the main body) analyzes its error on these examples.

Theorem C.4. After round k of the algorithm in Figure 1, with probability 1− δk+k2

, we have errDwk−1,bk−1(wk) ≤

κ.

We will prove Theorem C.4 using a series of lemmas below. First, we bound the hinge loss of the

target w∗ within the band Swk−1,bk−1. Since we are analyzing a particular round k, to reduce clutter in the

formulas, for the rest of this section, let us refer to

• ℓτk simply as ℓ,

• Lτk(·,Dwk−1,bk−1) as L(·).

Lemma C.5. L(w∗) ≤ κ/12.

Proof. Notice that y(w∗ · x) is never negative, so, on any clean example (x, y), we have

ℓ(w∗, x, y) = max

0, 1 − y(w∗ · x)τk

≤ 1,

and, furthermore, w∗ will pay a non-zero hinge only inside the region where |w∗ · x| < τk. Hence,

L(w∗) ≤ PrDwk−1,bk−1

(|w∗ · x| ≤ τk) =Prx∼D(|w∗ · x| ≤ τk & |wk−1 · x| ≤ bk−1)

Prx∼D(|wk−1 · x| ≤ bk−1).

Let c′1 and c′2 be the constants in Equation (1) respectively. We can lower bound the denominator Prx∼D(|wk−1·x| < bk−1) ≥ 2c′1bk−1

√d. Also the numerator is at most Prx∼D(|w∗ · x| ≤ τk) ≤ 2c′2τk

√d. Hence, we

have L(w∗) ≤ 2c′2τk

2c′1bk−1

= κ/12. (by setting c4 = c′1 and c1 = c′2/2.)



We will next show that the fraction of dirty examples in round k is not too large.

18

Lemma C.6. With probability 1− δ6(k+k2)

,

|WD| ≤ 8c1c4ηnk2k. (8)

Proof. From Equation 1 and the setting of our parameters, the probability that an example falls in Swk−1,bk−1

2c1c42−k. Therefore, with probability (1 − δ

12(k+k2)), the number of examples we must draw before we

encounter nk examples that fall within Swk−1,bk−1is at most 4c1c4nk2

k. The probability that each unlabeled

example we draw is noisy is at most η. Applying a Chernoff bound, with probability at least 1− δ12(k+k2)

,

|WD| ≤ 8c1c4ηnk2k.

completing the proof.

Next, we bound the loss on an example in terms of the norm of x.

Lemma C.7. For any w ∈ B(wk−1, rk), and all x,

ℓ(w, x, y) ≤ 4c2π

c4

√d.

Proof. A simple calculation shows:

ℓ(w, x, y) ≤ 1 +|w · x|τk

≤ 1 +|wk−1 · x|+ ‖w − wk−1‖2||x||2

τk

≤ 1 +bk−1 + rk

τk≤ 4c2π

c4

√d.

Recall that the total variation distance between two probability distributions is the maximum difference

between the probabilities that the assign to any event. We can think of q as soft indicator functions for

“keeping” examples, and so interpret the inequality∑

x∈W q(x) ≥ (1 − ξ)|W | as roughly akin to saying

that most examples are kept. This means that distribution p obtained by normalizing q is close to the uniform

distribution over W . We make this precise in the following lemma.

Lemma C.8. The total variation distance between p and the uniform distribution over W is at most ξ.

Proof. Lemma 1 of [LS06] implies that the total variation distance ρ between q and the uniform distribution

over W satisfies

ρ = 1−∑

x∈Wmin

q(x),1

|W |

.

Since q(x) ≤ 1 for all x, we have∑

x∈W q(x) ≤ |W |, so that

ρ ≤ 1− 1

|W |∑

x∈Wminq(x), 1.

Again, since q(x) ≤ 1, we have

ρ ≤ 1− (1− ξ)|W ||W | = ξ.

19

Next, we will relate the average hinge loss when examples are weighted according to p i.e., ℓ(w, p) to

the hinge loss averaged over clean examples WC , i.e., ℓ(w,WC). This is relationship is better than using a

uniform bound on the variance since, within the band, projecting the data onto directions close to wk−1 will

lead to much smaller variance. Specifically, we prove the following lemma (same as Lemma 3.5 in the main

body but with precise constants) Here ℓ(w,WC) and ℓ(w, p) are defined with respect to the true unrevealed

labels that the adversary has committed to.

Lemma C.9. Define zk =

√

r2kd−1 + b2k−1. For large enough d, with probability 1 − δ

2(k+k2), for any

w ∈ B(wk−1, rk), we have

ℓ(w,WC ) ≤ ℓ(w, p) +32c1c4η

ǫ

(

1 +zkτk

)

+ κ/32 (9)

and

ℓ(w, p) ≤ 2ℓ(w,WC ) + κ/32 +8c1c4η

ǫ+

√

32c1c4η/ǫzkτk

(10)

Proof. As in the analysis of the outlier removal procedure, we will treat each element (x, y) ∈W as distinct.

Fix an arbitrary w ∈ B(wk−1, rk). By the guarantee of Theorem C.1, Lemma C.6, and Lemmas C.2 and

C.3 we know that, with probability 1− δ2(k+k2) ,

1

|W |∑

x∈Wq(x)(w · x)2 ≤ 4z2k , (11)

together with

|WD| ≤ 8c1c4ηnk2k (12)

and1

|WC |∑

(x,y)∈WC

(w · x)2 ≤ 2z2k, (13)

Assume that (11), (12) and (13) all hold.

Since∑

x∈W q(x) ≥ (1− ξk)|W | ≥ |W |/2, we have that (11) implies

∑

x∈Wp(x)(w · x)2 ≤ 8z2k . (14)

First, let us bound the weighted loss on noisy examples in the training set. In particular, we will show

that∑

(x,y)∈WD

p(x)ℓ(w, x, y) ≤ C0η2k + ξk +

√

2c′C0η2kzkτk

. (15)

20

To see this, notice that,

∑

(x,y)∈WD

p(x)ℓ(w, x, y) =∑

(x,y)∈WD

p(x)max

0, 1− y(w · x)τk

≤ Prp(WD) +

1

τk

∑

(x,y)∈WD

p(x)|w · x| = Prp(WD) +

1

τk

∑

(x,y)∈Wp(x)1WD

(x, y)|w · x|

≤ Prp(WD) +

1

τk

√

∑

(x,y)∈Wp(x)1WD

(x, y)

√

∑

(x,y)∈Wp(x)(w · x)2 (by the Cauchy-Shwartz inequality)

≤ Prp(WD) +

√

8Prp(WD)zkτk

≤ 8c1c4η2k + ξk +

√

64c1c4η2kzkτk

where the second to last inequality follows by by (14) and the last one follows by Lemma C.8 and (8).

Similarly, we will show that

∑

(x,y)∈Wp(x)ℓ(w, x, y) ≤ 1 +

4zkτk. (16)

To see this notice that,

∑

(x,y)∈Wp(x)ℓ(w, x, y) =

∑

(x,y)∈Wp(x)max

0, 1− y(w · x)τk

≤ 1 +1

τk

∑

x∈Wp(x)|w · x| ≤ 1 +

1

τk

√

∑

x∈Wp(x)(w · x)2 ≤ 1 +

4zkτk,

where the last step follow by (14). Next, we have

ℓ(w,WC) =1

|WC |

∑

(x,y)∈Wq(x)ℓ(w, x, y) + (1WC

(x, y)− q(x))ℓ(w, x, y)

≤ 1

|WC |

∑

(x,y)∈Wq(x)ℓ(w, x, y) +

∑

(x,y)∈WC

(1− q(x))ℓ(w, x, y)

≤ 1

|WC |

∑

(x,y)∈Wq(x)ℓ(w, x, y) +

∑

(x,y)∈WC

(1− q(x))(

1 +|w · x|τk

)

≤ 1

|WC |

∑

(x,y)∈Wq(x)ℓ(w, x, y) + ξk|W |+

1

τk

∑

(x,y)∈WC

(1− q(x))|w · x|

≤ 1

|WC |

∑

(x,y)∈Wq(x)ℓ(w, x, y) + ξk|W |+

1

τk

√

∑

(x,y)∈WC

(1− q(x))2√

∑

(x,y)∈WC

(w · x)2

21

by the Cauchy-Shwartz inequality. Recall that 0 ≤ q(x) ≤ 1, and∑

x∈W q(x) ≥ (1− ξk)|W |. Thus,

ℓ(w,WC ) ≤1

|WC |

∑

(x,y)∈Wq(x)ℓ(w, x, y) + ξk|W |+

1

τk

√

ξk|W |√

∑

x∈WC

(w · x)2

≤ 1

|WC |

∑

(x,y)∈Wq(x)ℓ(w, x, y) + ξk|W |+

√

ξk|W ||WC |2z2kτk

by (25). Since |WC | ≥ |W |/2, we have

ℓ(w,WC) ≤1

|WC |

∑

(x,y)∈Wq(x)ℓ(w, x, y)

+ 2ξk +

√

4ξkz2k

τk.

We have chosen ξk small enough that

ℓ(w,WC) ≤1

|WC |

∑


+ κ/32

=

∑

(x,y)∈W q(x)

|WC |

∑

(x,y)∈Wp(x)ℓ(w, x, y)

+ κ/32

= ℓ(w, p) +

(∑

(x,y)∈W q(x)

|WC |− 1

)

∑


+ κ/32

≤ ℓ(w, p) +( |W ||WC |

− 1

)

∑


+ κ/32

≤ ℓ(w, p) +( |W ||WC |

− 1

)(

1 +4zkτk

)

+ κ/32.

Applying (12) yields (9).

22

Also,

ℓ(w, p) =∑


=∑

(x,y)∈WC

p(x)ℓ(w, x, y) +∑

(x,y)∈WD

p(x)ℓ(w, x, y)

≤∑

(x,y)∈WC

p(x)ℓ(w, x, y) + 8c1c4η2k + ξk +

√

32c1c4η2kzkτk

(by (27)).

=

∑

(x,y)∈WCq(x)ℓ(w, x, y)

∑

(x,y)∈WCq(x)

+ 8c1c4η2k + ξk +

√

32c1c4η2kzkτk

≤∑

(x,y)∈WCℓ(w, x, y)

∑

(x,y)∈WCq(x)

+ 8c1c4η2k + ξk +

√

32c1c4η2kzkτk

(since ∀x, q(x) ≤ 1)).

≤∑


|WC | − ξk|W |+ 8c1c4η2

k + ξk +

√

32c1c4η2kzkτk

≤ 2ℓ(w,WC ) + 8c1c4η2k + ξk +

√

32c1c4η2kzkτk

,

by (8), which in turn implies (10).

Finally, we need some bounds about estimates of the hinge loss.

Lemma C.10. With probability 1− δ2(k+k2)

, for all w ∈ B(wk−1, rk),

|L(w) − ℓ(w,WC)| ≤ κ/32 (17)

and

|ℓ(w, p) − ℓ(w, T )| ≤ κ/32. (18)

Proof. See Appendix H.

Proof of Theorem C.4. By Lemma C.10, with probability 1 − δ2(k+k2)

, for all w ∈ B(wk−1, rk), (17) and

(18) hold. Also with probability 1− δ2(k+k2)

, both (9) and (10) hold. Let us assume from here on that all of

these hold.

Then we have

errDwk−1,bk−1(wk) = errDwk−1,bk−1

(vk)

≤ L(vk) (since for each error, the hinge loss is at least 1)

≤ ℓ(vk,WC) + κ/32 (by (17))

≤ ℓ(vk, p) +32c1c4η

ǫ

(

1 +zkτk

)

+ κ/16 (by (9))

≤ ℓ(vk, T ) +32c1c4η

ǫ

(

1 +zkτk

)

+ κ/8 (by (18))

≤ ℓ(w∗, T ) +32c1c4η

ǫ

(

1 +zkτk

)

+ κ/4 (since w∗ ∈ B(wk−1, rk))

≤ ℓ(w∗, p) +32c1c4η

ǫ

(

1 +zkτk

)

+ κ/3 (by (18)).

23

This, together with (10) and (17), gives

errDwk−1,bk−1(wk) ≤ 2ℓ(w∗,WC) +

8c1c4η

ǫ+

√

32c1c4η/ǫzkτk

+32c1c4η

ǫ

(

1 +zkτk

)

+ 2κ/5

≤ 2L(w∗) +8c1c4η

ǫ+

√

32c1c4η/ǫzkτk

+32c1c4η

ǫ

(

1 +zkτk

)

+ κ/2

≤ κ/3 +8c1c4η

ǫ+

√

32c1c4η/ǫzkτk

+32c1c4η

ǫ

(

1 +zkτk

)

+ κ/2,

by Lemma C.5.

Now notice that zk/τk is Θ(1). Hence an Ω(ǫ) bound on η suffices to imply that errDwk−1,bk−1(wk) ≤ κ

with probability (1− δk+k2

).

D Proof of Theorem 4.2

Throughout this section, assume that the clean training examples are obtained by labeling data drawn ac-

cording to a distribution D over Rd chosen from a λ-admissible sequence. The main algorithm and the

outlier removal procedure remain the same with the following parameters.

D.1 Parameters for the algorithm

The parameters of the algorithm are set as follows. Let M = max 2c6π

, 2, where c6 is from Definition 4.1.

Let c′1 be the value of c5 in part 2 of Definition 4.1 corresponding to the case where c4 is c64M ; then let

bk = c′1M−k.

Let c′6 and c′7 be c2 and c3 respectively, from part 1 of Definition 4.1. Let rk = minM−(k−1)/c6, π/2,where c6 is from Definition 4.1 and κ = 1

4c′1c′7M . Finally, let τk =

c2 minbk−1,c1κ6c3

, where c1, c2 and c3 are

the values from Definition 4.1. Let z2k = (r2k + b2k−1) and ξk = cmin(κ,κ2τ2kz2k

). The value of σ2k for the

outlier removal procedure is lnλ(1 + 1bk−1

)(r2k + b2k−1)

D.2 Analysis of the outlier removal subroutine

The analysis of the learning algorithm uses the following lemma about the outlier removal subroutine of

Figure 2.

Theorem D.1. For any C > 0, there is a constant c and a polynomial p such that, for all ξ > 2η and all

0 < γ < C , if n ≥ p(1/η, d, 1/ξ, 1/δ, 1/γ), then, with probability 1 − δ, the output q of the algorithm in

Figure 2 satisfies the following:

• ∑x∈S q(x) ≥ (1− ξ)|S| (a fraction 1− ξ of the weight is retained)

• For all unit length w such that ‖w − u‖2 ≤ r,

1

|S|∑

x∈Sq(x)(w · x)2 ≤ c lnλ(1 + 1

γ)(r2 + γ). (19)

Furthermore, the algorithm can be implemented in polynomial time.

24

Almost identical to the previous section our proof of Theorem D.1 proceeds through a series of lem-

mas. Again, we would like to point out that in the analysis below we will treat each element xi ∈ S as

distinct (even if xi = xj for some j). Obviously, a feasible q satisfies the requirements of the lemma. So all

we need to show is

• there is a feasible solution q, and

• we can simulate a separation oracle: given a provisional solution q, we can find a linear constraint

violated by q in polynomial time.

We will start by working on proving that there is a feasible q. First of all, a Chernoff bound implies that

n ≥ poly(1/η, 1/δ) suffices for it to be the case that, with probability 1 − δ, at most 2η members of S are

noisy. Let us assume from now on that this is the case.

We will show that q∗ which sets q∗(x) = 0 for each noisy point, and q∗(x) = 1 for each non-noisy

point, is feasible.

First, we use VC tools to show that, if enough examples are chosen, a bound like part 4 of Definition 4.1,

but averaged over the clean examples, likely holds for all relevant directions.

Lemma D.2. If we draw ℓ times i.i.d. from D to form C , with probability 1 − δ, we have that for any unit

length a,

1

ℓ

∑

x∈C(a · x)2 ≤ E[(a · x)2] +

√

O(d log(ℓ/δ)(d + log(1/δ)))

ℓ.

Proof: See Appendix H.

Lemma D.2 and part 4 of Definition 4.1 together directly imply that

n = poly

(

d, 1/η, 1/δ,1

c(r2 + γ2) lnλ(1 + 1/γ)

)

= poly (d, 1/η, 1/δ, 1/γ)

suffices for it to be the case that, for all w ∈ B(u, r),

1

|S|∑

(x,y)

q∗(x)(a · x)2 ≤ 2E[(a · x)2] ≤ 2c8(r2 + γ2) lnλ(1 + 1/γ),

so that, if c = 2c8, we have that q∗ is feasible.

So what is left is to prove that the convex program has a separation oracle. First, it is easy to check

whether, for all x, 0 ≤ q(x) ≤ 1, and whether∑

x∈S q(x) ≥ (1 − ξ)|S|. An algorithm can first do that. If

these pass, then it needs to check whether there is a w ∈ B(u, r) with ||w||2 ≤ 1 such that

1

|S|∑

x∈Sq(x)(w · x)2 > c logλ

(

1 +1

γ

)

(r2 + γ2).

This can be done by finding w ∈ B(u, r) with ||w||2 ≤ 1 that maximizes∑

x∈S q(x)(w · x)2, and checking

it.

Suppose X is a matrix with a row for each x ∈ S, where the row is√

q(x)x. Then∑

x∈S q(x)(w ·x)2 =wTXTXw, and, maximizing this over w is an equivalent problem to minimizing wT (−XTX)w subject to

‖w − u‖2 ≤ r and ||w|| ≤ 1. Since −XTX is symmetric, problems of this form are known to be solvable

in polynomial time [SZ03] (see [BM13]).

25

D.3 The error within a band in each iteration

At each iteration, the algorithm of Figure 1 concentrates its attention on examples in the band. Our next

theorem analyzes its error on these examples.

Theorem D.3. After round k of the algorithm in Figure 1, with probability 1− δk+k2

, we have errDwk−1,bk−1(wk) ≤

κ.

We will prove Theorem D.3 using a series of lemmas below. First, we bound the hinge loss of the

target w∗ within the band Swk−1,bk−1. Since we are analyzing a particular round k, to reduce clutter in the

formulas, for the rest of this section, let us refer to

• ℓτk simply as ℓ,

• Lτk(·,Dwk−1,bk−1) as L(·).

Lemma D.4. L(w∗) ≤ κ/6.

Proof. Notice that y(w∗ · x) is never negative, so, on any clean example (x, y), we have

ℓ(w∗, x, y) = max

0, 1 − y(w∗ · x)τk

≤ 1,

and, furthermore, w∗ will pay a non-zero hinge only inside the region where |w∗ · x| < τk. Hence,

L(w∗) ≤ PrDwk−1,bk−1

(|w∗ · x| ≤ τk) =Prx∼D(|w∗ · x| ≤ τk & |wk−1 · x| ≤ bk−1)

Prx∼D(|wk−1 · x| ≤ bk−1).

Using part 1 of Definition 4.1, for the values of c1 and c2 in that definition, we can lower bound the

denominator:

Prx∼D

(|wk−1 · x| < bk−1) ≥ c2 minbk−1, c1.

part 1 of Definition 4.1 also implies that the numerator is at most

Prx∼D

(|w∗ · x| ≤ τk) ≤ c3τk.

Hence, we have

L(w∗) ≤ c3τkc2 minbk−1, c1

= κ/6.



We will next show that the fraction of dirty examples in round k is not too large.

Lemma D.5. There is an absolute positive constant C0 such that, with probability 1− δ6(k+k2)

,

|WD| ≤ C0ηnkMk. (20)

26

Proof. From Equation 1 and the setting of our parameters, the probability that an example falls in Swk−1

is at least Ω(M−k). Therefore, with probability (1 − δ12(k+k2)

), the number of examples we must draw

before we encounter nk examples that fall within Swk−1,bk−1is at most O(nkM

k). The probability that

each unlabeled example we draw is noisy is at most η. Applying a Chernoff bound, with probability at least

1− δ12(k+k2)

,

|WD| ≤ C0ηnkMk.

completing the proof.

Next, we bound the loss on an example in terms of the norm of x.

Lemma D.6. There is a constant c such that, for any w ∈ B(wk−1, rk), and all x,

ℓ(w, x, y) ≤ c(1 + ||x||2).

Proof.

ℓ(w, x, y) ≤ 1 +|w · x|τk

≤ 1 +|wk−1 · x|+ ‖w − wk−1‖2||x||2

τk

≤ 1 +bk−1 + rk||x||2

τk= 1 +

c′1M−k +minM−(k−1)/c6, π/2||x||2

c2 minc′1M−k,c1κ

6c3

.

If the support of D is bounded, Lemma D.6 gives a useful worst-case bound on the loss. Next, we give

a high-probability bound that holds for all λ-admissible distributions.

Lemma D.7. For an absolute constant c, with probability 1− δ6(k+k2) ,

maxx∈WC

||x||2 ≤ c√d ln

( |WC |kδ

)

.

Proof. Applying part 5 of Definition 4.1, together with a union bound, we have

Pr(∃x ∈WC , ||x|| > α) ≤ c9|WC | exp(−α/√d),

and α =√d ln

(

12c9|WC |k2δ

)

makes the RHS at most δ6(k+k2)

.

Recall that the total variation distance between two probability distributions is the maximum difference

between the probabilities that the assign to any event.

We can think of q as soft indicator functions for “keeping” examples, and so interpret the inequality∑

x∈W q(x) ≥ (1− ξ)|W | as roughly akin to saying that most examples are kept. This means that distribu-

tion p obtained by normalizing q is close to the uniform distribution over W . We make this precise in the

following lemma.

Lemma D.8. The total variation distance between p and the uniform distribution over W is at most ξ.

27

Proof. Lemma 1 of [LS06] implies that the total variation distance ρ between p and the uniform distribution

over W satisfies

ρ = 1−∑

x∈Wmin

p(x),1

|W |

.

Since q(x) ≤ 1 for all x, we have∑

x∈W q(x) ≤ |W |, so that

ρ ≤ 1− 1

|W |∑

x∈Wminq(x), 1.

Again, since q(x) ≤ 1, we have

ρ ≤ 1− (1− ξ)|W ||W | = ξ.

Next, we will relate the average hinge loss when examples are weighted according to p, i.e., ℓ(w, p) to

the hinge loss averaged over clean examples WC , i.e., ℓ(w,WC). Here ℓ(w,WC) and ℓ(w, p) are defined

with respect to the true unrevealed labels that the adversary has committed to.

Lemma D.9. There are absolute constants c1, c2 and c3 such that, for large enough d, with probability

1− δ2(k+k2)

, if we define zk =√

r2k + b2k−1, then for any w ∈ B(wk−1, rk), we have

ℓ(w,WC ) ≤ ℓ(w, p) +c1η

ǫ

(

1 +lnλ/2(1 + 1/bk)zk

τk

)

+ κ/32 (21)

and

ℓ(w, p) ≤ 2ℓ(w,WC) + κ/32 +c2η

ǫ+c3√

η/ǫ lnλ/2(1 + 1/bk)zkτk

(22)

Proof. As in the analysis of the outlier removal procedure, we will treat each element (x, y) ∈W as distinct.

Fix an arbitrary w ∈ B(wk−1, rk). By the guarantee of Theorem D.1, Lemma D.5, part 5 of Definition 4.1,

part 4 of Definition 4.1, and Lemma D.2, we know that, with probability 1− δ2(k+k2) ,

1

|W |∑

x∈Wq(x)(w · x)2 ≤ c′ lnλ(1 + 1/bk)z

2k, (23)

together with

|WD| ≤ C0ηnkM−k (24)

(for an absolute constant C0) and

1

|WC |∑

(x,y)∈WC

(w · x)2 ≤ c′′(r2 + γ2) lnλ(1 + 1/bk), (25)

for an absolute constant c′′.Assume that (23), (24) and (25) all hold.

Since∑

x∈W q(x) ≥ (1− ξk)|W | ≥ |W |/2, we have that (23) implies

∑

x∈Wp(x)(w · x)2 ≤ 2c′ lnλ(1 + 1/bk)z

2k. (26)

28

First, let us bound the weighted loss on noisy examples in the training set. In particular, we will show

that∑

(x,y)∈WD

p(x)ℓ(w, x, y) ≤ C0ηM−k + ξk +

√

2c′C0ηM−k lnλ/2(1 + 1/bk)zkτk

. (27)

To see this, notice that,

∑

(x,y)∈WD

p(x)ℓ(w, x, y) =∑

(x,y)∈WD

p(x)max

0, 1− y(w · x)τk

≤ Prp(WD) +

1

τk

∑

(x,y)∈WD

p(x)|w · x| = Prp(WD) +

1

τk

∑

(x,y)∈Wp(x)1WD

(x, y)|w · x|

≤ Prp(WD) +

1

τk

√

∑

(x,y)∈Wp(x)1WD

(x, y)

√

∑

(x,y)∈Wp(x)(w · x)2 (by the Cauchy-Shwartz inequality)

≤ Prp(WD) +

√

2c′ Prp(WD) lnλ/2(1 + 1/bk)zk

τk≤ C0ηM

−k + ξk +

√


where the second to last inequality follows by (26) and the last one by Lemma D.8 and (24).

Similarly, we will show that

∑

(x,y)∈Wp(x)ℓ(w, x, y) ≤ 1 +

√c′ lnλ/2(1 + 1/bk)zk

τk. (28)

To see this notice that,

∑

(x,y)∈Wp(x)ℓ(w, x, y) =

∑

(x,y)∈Wp(x)max

0, 1− y(w · x)τk

≤ 1 +1

τk

∑

(x,y)∈Wp(x)|w · x| ≤ 1 +

1

τk

√

∑

(x,y)∈Wp(x)(w · x)2

≤ 1 + +

√2c′ lnλ/2(1 + 1/bk)zk

τk,

by (26).

29

Next, we have

ℓ(w,WC) =1

|WC |

∑

(x,y)∈Wq(x)ℓ(w, x, y) + (1WC

(x, y)− q(x))ℓ(w, x, y)

≤ 1

|WC |

∑

(x,y)∈Wq(x)ℓ(w, x, y) +

∑

(x,y)∈WC

(1− q(x))ℓ(w, x, y)

≤ 1

|WC |

∑

(x,y)∈Wq(x)ℓ(w, x, y) +

∑

(x,y)∈WC

(1− q(x))(

1 +|w · x|τk

)

≤ 1

|WC |

∑

(x,y)∈Wq(x)ℓ(w, x, y) + ξk|W |+

1

τk

∑

(x,y)∈WC

(1− q(x))|w · x|

≤ 1

|WC |

∑

(x,y)∈Wq(x)ℓ(w, x, y) + ξk|W |+

1

τk

√

∑

(x,y)∈WC

(1− q(x))2√

∑

(x,y)∈WC

(w · x)2

by the Cauchy-Shwartz inequality. Recall that 0 ≤ q(x) ≤ 1, and∑

(x,y)∈W q(x) ≥ 1− ξk|W |. Thus,

ℓ(w,WC ) ≤1

|WC |

∑

(x,y)∈Wq(x)ℓ(w, x, y) + ξk|W |+

1

τk

√

ξk|W |√

∑

(x,y)∈WC

(w · x)2

≤ 1

|WC |

∑

(x,y)∈Wq(x)ℓ(w, x, y) + ξk|W |+

√

ξk|W ||WC |c′′(r2 + γ2) lnλ(1 + 1/bk)

τk

by (25). Since |WC | ≥ |W |/2, we have

ℓ(w,WC) ≤1

|WC |

∑


+ 2ξk +

√

2ξkc′′(r2 + γ2) lnλ(1 + 1/bk)

τk.

30

We have chosen ξk small enough that

ℓ(w,WC) ≤1

|WC |

∑


+ κ/32

=

∑

(x,y)∈W q(x)

|WC |

∑


+ κ/32

= ℓ(w, p) +

(∑

(x,y)∈W q(x)

|WC |− 1

)

∑


+ κ/32

≤ ℓ(w, p) +( |W ||WC |

− 1

)

∑


+ κ/32

≤ ℓ(w, p) +( |W ||WC |

− 1

)

(

1 +

√c′ lnκ/2(1 + 1/bk)zk

τk

)

+ κ/32.

Applying (24) yields (21).

Also,

ℓ(w, p) =∑


=∑

(x,y)∈WC

p(x)ℓ(w, x, y) +∑

(x,y)∈WD

p(x)ℓ(w, x, y)

≤∑

(x,y)∈WC

p(x)ℓ(w, x, y) + C0ηM−k + ξk +

√


(by (27)).

=

∑

(x,y)∈WCq(x)ℓ(w, x, y)

∑

(x,y)∈WCq(x)

+ C0ηM−k + ξk +

√


≤∑


∑

(x,y)∈WCq(x)

+ C0ηM−k + ξk +

√


(since ∀x, q(x) ≤ 1)).

≤∑


|WC | − ξ|W |+ C0ηM

−k + ξk +

√


≤ 2ℓ(w,WC ) + C0ηM−k + ξk +

√


,

by (24), which in turn implies (22).

Finally, we need some bounds about estimates of the hinge loss.

Lemma D.10. With probability 1− δ2(k+k2)

, for all w ∈ B(wk−1, rk),

|L(w) − ℓ(w,WC)| ≤ κ/32 (29)

31

and

|ℓ(w, p) − ℓ(w, T )| ≤ κ/32. (30)

Proof. See Appendix H.

Proof of Theorem D.3. By Lemma D.10, with probability 1 − δ2(k+k2)

, for all w ∈ B(wk−1, rk), (29) and

(30) hold. Also with probability 1 − δ2(k+k2)

, both (21) and (22) hold. Let us assume from here on that all

of these hold.

Then we have

errDwk−1,bk−1(wk) = errDwk−1,bk−1

(vk)

≤ L(vk) (since for each error, the hinge loss is at least 1)

≤ ℓ(vk,WC) + κ/16 (by (29))

≤ ℓ(vk, p) +c1η

ǫ

(

1 +lnλ/2(1 + 1/bk))zk

τk

)

+ κ/8 (by (21))

≤ ℓ(vk, T ) +c1η

ǫ

(

1 +lnλ/2(1 + 1/bk))zk

τk

)

+ κ/4 (by (30))

≤ ℓ(w∗, T ) +c1η

ǫ

(

1 +lnλ/2(1 + 1/bk))zk

τk

)

+ κ/4 (since w∗ ∈ B(wk−1, rk))

≤ ℓ(w∗, p) +c1η

ǫ

(

1 +lnλ/2(1 + 1/bk))zk

τk

)

+ κ/3 (by (30)).

This, together with (22) and (29), gives

errDwk−1,bk−1(wk) ≤ 2ℓ(w∗,WC) +

c2η

ǫ+c3√


+c1η

ǫ

(

1 +lnλ/2(1 + 1/bk)zk

τk

)

+ 2κ/5

≤ 2L(w∗) +c2η

ǫ+c3√


+c1η

ǫ

(

1 +lnλ/2(1 + 1/bk)zk

τk

)

+ κ/2

≤ κ/3 +c2η

ǫ+c3√


+c1η

ǫ

(

1 +lnλ/2(1 + 1/bk)zk

τk

)

+ κ/2,

by Lemma D.4.

Now notice that zk/τk is Θ(1). Hence an Ω( ǫlogλ(1/ǫ)

) bound on η suffices to imply that errDwk−1,bk−1(wk) ≤

κ with probability (1− δk+k2 ).

D.4 Putting it together

Now we are ready to put everything together. The proof of Theorem 4.2 follows the high level structure of

the proof of [BBZ07]; the new element is the application of Theorem D.3 which analyzes the performance

of the hinge loss minimization algorithm for learning inside the band, which in turn applies Theorem D.1,

which analyzes the benefits of our new localized outlier removal procedure.

32

Proof (of Theorem 4.2): We will prove by induction on k that after k ≤ s iterations, we have errD(wk) ≤Mk with probability 1− δ(1 − 1/(k + 1))/2.

When k = 0, all that is required is errD(w0) ≤ 1.

Assume now the claim is true for k − 1 (k ≥ 1). Then by induction hypothesis, we know that with

probability at least 1− δ(1 − 1/k)/2, wk−1 has error at most M−(k−1). Using part 3 of Definition 4.1, this

implies that θ(wk−1, w∗) ≤ M−(k−1)/c6. This in turn implies θ(wk−1, w

∗) ≤ π/2. (When k = 1, this is

by assumption, and otherwise it is implied by part 3 of Definition 4.1.)

Let us define Swk−1,bk−1= x : |wk−1 · x| ≤ bk−1 and Swk−1,bk−1

= x : |wk−1 · x| > bk−1.Since wk−1 has unit length, and vk ∈ B(wk−1, rk), we have θ(wk−1, vk) ≤ rk which in turn implies

θ(wk−1, wk) ≤ minM−(k−1)/c6, π/2.Applying part 2 of Definition 4.1 to bound the error rate outside the band, we have both:

Prx

[

(wk−1 · x)(wk · x) < 0, x ∈ Swk−1,bk−1

]

≤ M−k

4and

and

Prx

[

(wk−1 · x)(w∗ · x) < 0, x ∈ Swk−1,bk−1

]

≤ M−k

4.

Taking the sum, we obtain Prx[

(wk · x)(w∗ · x) < 0, x ∈ Swk−1,bk−1

]

≤ M−k

2 . Therefore, we have

err(wk) ≤ (errDwk−1,bk−1(wk)) Pr(Swk−1,bk−1

) +M−k

2.

Since Pr(Swk−1,bk−1) ≤ 2c′7bk−1, this implies

err(wk) ≤ (errDwk−1,bk−1(wk))2c

′7bk−1 +

M−k

2≤M−k

(

(errDwk−1,bk−1(wk))2c

′1c

′7M + 1/2

)

.

Recall that Dwk−1,bk−1is the distribution obtained by conditioning D on the event that x ∈ Swk−1,bk−1

.

Applying Theorem D.3, with probability 1 − δ2(k+k2)

, wk has error at most κ = 14c′

1c′7M

within Swk−1,bk−1,

implying that err(wk) ≤ 1/Mk, completing the proof of the induction, and therefore showing, with proba-

bility at least 1− δ, O(log(1/ǫ)) iterations suffice to achieve err(wk) ≤ ǫ.A polynomial number of unlabeled samples are required by the algorithm and the number of labeled

examples required by the algorithm is∑

kmk = O(d(d+ log log(1/ǫ) + log(1/δ)) log(1/ǫ)).

E Proof of Theorem E.1

In this section, we describe an algorithm for learning λ-admissible distribution in the presence of adversarial

label noise. As before, we assume that the algorithm has access to w0 such that θ(w0, w∗) < π/2. This can

be shown to be without loss of generality exactly as in the case of malicious noise.

Theorem E.1. Let D be a distribution over Rd chosen from a λ-admissible sequence of distributions. Let

w∗ be the (unit length) target weight vector. There are absolute positive constants c′1, ..., c′4 and M > 1

and polynomial p such that, an Ω

(

ǫlogλ( 1

ǫ)

)

upper bound on a rate η of adversarial label noise suffices

to imply that for any ǫ, δ > 0, using the algorithm from Figure 3 with cut-off values bk = c′1M−k, radii

rk = c′2M−k, κ = c′3, τk = c′4M

−k for k ≥ 1, a number nk = p(d,Mk, log(1/δ)) of unlabeled examples

33

Figure 3 COMPUTATIONALLY EFFICIENT ALGORITHM TOLERATING ADVERSARIAL LABEL NOISE

Input: allowed error rate ǫ, probability of failure δ, an oracle that returns x, for (x, y) sampled from

EXη(f,D), and an oracle for getting the label from an example; a sequence of sample sizes mk > 0; a

sequence of cut-off values bk > 0; a sequence of hypothesis space radii rk > 0; a precision value κ > 0

1. Draw m1 examples and put them into a working set W .

2. For k = 1, . . . , s = ⌈log2(1/ǫ)⌉

(a) Find vk ∈ B(wk−1, rk) to approximately minimize training hinge loss over W s.t. ‖vk‖2 ≤ 1:

ℓτk(vk,W ) ≤ minw∈B(wk−1,rk)∩B(0,1)) ℓτk(w,W ) + κ/8

(b) Normalize vk to have unit length, yielding wk = vk‖vk‖2

.

(c) Clear the working set W .

(d) Until mk+1 additional data points are put in W , given x for (x, f(x)) obtained from EXη(f,D), if

|wk · x| ≥ bk, then reject x else put into W

Output: Weight vector ws of error at most ǫ with probability 1− δ.

in round k and a number mk = O(

d log(

dǫδ

)

(d+ log(k/δ)))

of labeled examples in round k ≥ 1, and w0

such that θ(w0, w∗) < π/2, after s = ⌈log2(1/ǫ)⌉ iterations, we find a separator ws satisfying err(ws) =

Pr(x,y)∼D[sign(w · x) 6= sign(w∗ · x)] ≤ ǫ with probability at least 1− δ.If the support of D is bounded in a ball of radius R(d), then mk = O

(

R(d)2(d+ log(k/δ)))

label

requests suffice.

To prove Theorem E.1, all we need is Theorem E.2 below, which bounds the error inside the band in

the case of adversarial label noise. Substituting this lemma for Theorem D.3 in the proof of Theorem 4.2

suffices to prove Theorem E.1. (In particular, for the rest of this subsection, rk, bk, κ and τk are set as in the

proof of Theorem 4.2.)

Theorem E.2. During round k of the algorithm in Figure 3, with probability 1− δk+k2 , we have

errDwk−1,bk−1(wk) ≤ κ.

We will prove Theorem E.2 using a series of lemmas.

Define ℓ and L as in the proof of Theorem D.3.

First, Lemma C.5, that L(w∗) ≤ κ/6, also applies here, using exactly the same proof.

From here, the proof is organized a little differently than before. There are two main structural dif-

ferences. First, before, we analyzed a relatively large set of unlabelled examples on which the algorithm

performed soft outlier removal, before subsampling and training. Here, since the algorithm will not perform

outlier removal, we may analyze the underlying distribution in place of the large unlabeled sample. The

second difference is that, whereas before, we separately analyzed the clean examples and the dirty exam-

ples, here, we will analyze properties of the noisy portion of the underlying distribution, but, here, instead of

comparing it with the clean portion, as we did before, we will compare it with the distribution that would be

obtained by fixing the incorrect labels. One reason that this is more convenient is that the marginal over the

instances of this “fixed” distribution is D (whereas the marginal of the clean examples, in general, is not).

Let P be the joint distribution used by the algorithm, which includes the noisy labels chosen by the

adversary. Let N = (x, y) : sign(w∗ · x) 6= y consist of noisy examples, so that P (N) ≤ η. Let P be the

joint distribution obtained by applying the correct labels. Let Pk be the distribution on the examples given

34

to the algorithm in round k (obtained by conditioning P to examples that fall within the band), and let Pk

be the corresponding joint distribution with clean labels.

The key lemma here is to relate the expected loss with respect to Pk to the expected loss with respect to

Pk.

Lemma E.3. There is an absolute positive constant c such that, if we define zk =√

r2k + b2k−1 then for any

w ∈ B(wk−1, rk), we have

|E(x,y)∈Pkℓ(w, x, y)) −E(x,y)∈Pk

ℓ(w, x, y))| ≤ c√

M−kηzk logλ/2(1 + 1/γ)

τk. (31)

Proof. Fix an arbitrary w ∈ B(wk−1, rk). Recalling that N is the set of noisy examples, and that the

marginals of Pk and Pk on the inputs are the same, we have

|E(x,y)∈Pk(ℓ(w, x, y)) −E(x,y)∈Pk

(ℓ(w, x, y))|= |E(x,y)∈Pk

(ℓ(w, x, y) − ℓ(w, x, sign(w∗ · x)))|= |E(x,y)∈Pk

(1(x,y)∈N (ℓ(w, x, y) − ℓ(w, x,−y)))|≤ E(x,y)∈Pk

(1(x,y)∈N |ℓ(w, x, y) − ℓ(w, x,−y)|)

≤ 2E(x,y)∈Pk

(

1(x,y)∈N

( |w · x|τk

))

=2

τkE(x,y)∈Pk

(

1(x,y)∈N |w · x|)

≤ 2

τk

√

Pr(x,y)∼Pk

(N)×√

E(x,y)∈Pk((w · x)2)

by the Cauchy-Schwartz inequality. Part 1 of Definition 4.1 implies that

Pr(x,y)∈Pk

(N) ≤Pr(x,y)∈P (N)

Pr(x,y)∈P (Swk−1,bk−1)≤ cM−kη,

for an absolute constant c, and part 4 of Definition 4.1 implies E(x,y)∈Pk((w ·x)2) ≤ cz2k logλ(1+1/γ).

Proof of Theorem E.2. Let

cleaned(W ) = (x, sign(w∗ · x)) : (x, y) ∈W.

Exploiting the fact that ℓ(w, x, y) = O

(

√

d log(

dǫδ

)

)

for all (x, y) ∈ Swk−1,bk−1andw ∈ B(wk−1, rk)

as in the proof of Lemma D.10, with probability 1− δk+k2 , for all w ∈ B(wk−1, rk), we have

|E(x,y)∈P (ℓ(w, x, y))− ℓ(w,W )| ≤ κ/32, and |E(x,y)∈P (ℓ(w, x, y))− ℓ(w, cleaned(W ))| ≤ κ/32. (32)

35

Then we have, for absolute constants c1 and c2, the following:

errDwk−1,bk−1(wk) ≤ E(x,y)∈Pk

(ℓ(wk, x, y)) (since for each error, the hinge loss is at least 1)

≤ 2E(x,y)∈Pk(ℓ(vk, x, y)) (since ‖vk‖2 ≥ 1/2)

≤ 2E(x,y)∈Pk(ℓ(vk, x, y)) + c1

√

η

ǫ× zk log

λ/2(1 + 1/bk)

τk(by Lemma E.3)

≤ 2ℓ(vk,W ) + c1

√

η

ǫ× zk log

λ/2(1 + 1/bk)

τk+ κ/16 (by (32))

≤ 2ℓ(w∗,W ) + c1

√

η

ǫ× zk log

λ/2(1 + 1/bk)

τk+ κ/8

≤ 2E(x,y)∈Pk(ℓ(w∗, x, y)) + c1

√

η

ǫ× zk log

λ/2(1 + 1/bk)

τk+ κ/4 (by (32))

≤ 2E(x,y)∈P (ℓ(w∗, x, y)) + c2

√

η

ǫ× zk log

λ/2(1 + 1/bk)

τk+ κ/4 (by Lemma E.3)

≤ c2

√

η

ǫ× zk log

λ/2(1 + 1/bk)

τk+ κ/2.

since L(w∗) ≤ κ/6. Since zk/τk = Θ(1), there is an constant c3 such that, η ≤ c3ǫ/ logλ(1+1/bk) suffices

for errDwk−1,bk−1

(wk) ≤ κ, completing the proof.

F Admissibility

F.1 Uniform distribution is 0-admissible

We will show the properties in Definition 4.1 hold for the uniform distribution with λ = 0. Part 1 is an easy

consequence of the corresponding known lemmas about the uniform distribution on the unit ball.

Lemma F.1 (see [Bau90, BBZ07, KKMS05]). For any C > 0, there are c1, c2 > 0 such that, for x drawn

from the uniform distribution over√dSd−1 and any unit length u ∈ R

d, (a) for all a, b ∈ [−C,C for which

a ≤ b, we have c1|b−a| ≤ Pr(u ·x ∈ [a, b]) ≤ c2|b−a|, and (b) if b ≥ 0, we have Pr(u ·x > b) ≤ 12e

−b2/2.

To prove part 2, we will use a lemma from [BL13] that generalizes and strengthens a key lemma from

[BBZ07].

Lemma F.2 (Theorem 4 of [BL13]). For any c1 > 0, there is a c2 > 0 such that the following holds. Let uand v be two unit vectors in Rd, and assume that θ(u, v) = α < π/2. If D is isotropic log-concave in R

d,

then Prx∼D[sign(u · x) 6= sign(v · x) and |v · x| ≥ c2α] ≤ c1α.

This has the following corollary, which proves part 2.

Lemma F.3. For any c1 > 0, there is a c2 > 0 such that the following holds for all d ≥ 4. Let u and v be

two unit vectors inRd, and assume that θ(u, v) = α < π/2. IfD is uniform over Sd−1, Prx∼D[sign(u·x) 6=sign(v · x) and |v · x| ≥ c2α/

√d] ≤ c1α.

Proof. Consider the distribution D′ obtained sampling from D, and scaling the result up by a factor of√d.

36

We claim that the projection D′′ of D′ onto the space spanned by u ·x and v ·x is isotropic log-concave.

This will imply Lemma F.3 by applying Lemma F.2, since the event in question only concerns the span of

u · x and v · x.

Assume without loss of generality that the span of u · x and v · x is

T = (x1, x2, 0, 0, ..., 0) : x ∈ Rd.

The fact that D′′ is isotropic follows from the fact that D′ is isotropic and the fact that it is log-concave

follows from the known fact that, if (x1, ..., xd) is sampled uniformly from Sd−1, then the distribution of

(x1, x2) is log-concave (see Corollary 4 of [BGMN05]).

Part 3 of Definition 4.1 holds trivially in the case of the uniform distribution.

The fact that part 4 of Definition 4.1 holds in the case of the isotropic rescaling of the uniform distribution

U over the surface of the unit ball follows immediately from Lemma C.2.

Part 5 follows from the fact that D is isotropic logconcave (see Lemma F.4 below).

F.2 Isotropic log-concave is 2 admissible

Part 1 of Definition 4.1 is part of the following lemma.

Lemma F.4 ([LV07]). Assume that D is isotropic log-concave in Rd and let f be its density function.

(a) We have Prx∼D [||X||2 ≥ α√d] ≤ e−α+1. If d = 1 then: Prx∼D [X ∈ [a, b]] ≤ |b− a|.

(b) All marginals of D are isotropic log-concave.

(c) If d = 1 we have f(0) ≥ 1/8 and f(x) ≤ 1 for all x.

(d) There is an absolute constant c such that, if d = 1, f(x) > c for all x ∈ [−1/9, 1/9].

Part 2 is Lemma F.2.

Part 3 is implicit in [Vem10] (see Lemma 3 of [BL13]).

In order to prove part 4, we will use the following lemma.

Lemma F.5. For any C > 0, there exists a constant c s.t., for any isotropic log-concave distribution D, for

any a such that, ‖a‖2 ≤ 1, and ||u− a||2 ≤ r, for any 0 < γ < C , and for any K ≥ 4, we have

Prx∼Du,γ

(

|a · x| > K√

r2 + γ2)

≤ c

γe−K .

Proof. W.l.o.g. we may assume that u = (1, 0, 0, · · · , 0).Let a′ = (a2, ..., ad), and, for a random x = (x1, x2, ..., xd) drawn from Du,γ , let x′ = (x2, ..., xd). Let

p = Prx∼Du,γ

(

|a · x| > K√

r2 + γ2)

be the probability that we want to bound. We may rewrite p as

p =Prx∼D

(

|a · x| > K√

r2 + γ2 and |x1| ≤ γ)

Prx∼D (|x1| ≤ γ). (33)

37

Lemma F.4 implies that there is a positive constant c1 such that the denominator satisfies the following lower

bound:

Prx∼D

(|x1| ≤ γ) ≥ c1 minγ, 1/9 ≥ c1γ

9C. (34)

So now, we just need an upper bound on the numerator. We have

Prx∼D

(

|a · x| > K√

r2 + γ2 and |x1| ≤ γ)

≤ Prx∼D

(

|a′ · x′| > K√

r2 + γ2 − γ)

≤ Prx∼D

(

|a′ · x′| > (K − 1)√

r2 + γ2)

≤ Prx∼D

(

|a′ · x′| > (K − 1)r)

≤ Prx∼D

(∣

∣

∣

∣

(

a′

||a′||2

)

· x′∣

∣

∣

∣

> K − 1

)

≤ e−(K−1),

by Lemma F.4, since the marginal distribution over x′ is isotropic log-concave. Combining with (33) and

(34) completes the proof.

Now we’re ready to prove Part 4.

Lemma F.6. For any C , there is a constant c such that, for all 0 < γ ≤ C , for all a such that ‖u− a‖2 ≤ rand ‖a‖2 ≤ 1

Ex∼Du,γ((a · x)2) ≤ c(r2 + γ2) ln2(1 + 1/γ).

Proof: Let z =√

r2 + γ2. Setting, with foresight, t = 9z2 ln2(1 + 1/γ), we have

Ex∼Du,γ((a · x)2)

=

∫ ∞

0Pr

x∼Du,γ

((a · x)2 ≥ α) dα

≤ t+∫ ∞

tPr

x∼Du,γ

((a · x)2 ≥ α) dα. (35)

Since t ≥ 4√

r2 + γ2, Lemma F.5 implies that, for an absolute constant c, we have

Ex∼Du,γ((a · x)2) ≤ t+c

γ

∫ ∞

texp

(

−(

α

r2 + γ2

)1/2)

dα.

Now, we want to evaluate the integral. Since z =√

r2 + γ2, so

∫ ∞

texp

(

−√

α

r2 + γ2

)

dα =

∫ ∞

texp

(

−√α/z

)

dα.

Using a change of variables u2 = α, we get∫ ∞

texp

(

−√α/z

)

dα = 2

∫ ∞

√tu exp (−u/z) du = 2z2(

√t+ 1) exp

(

−√t/z)

.

Putting it together, we get

Ex∼Du,γ((a · x)2) ≤ t+z2(√t+ 1) exp

(

−√t/z)

γ≤ t+ z2,

since t = 9z2 ln2(1 + 1/γ), completing the proof.

Finally Part 5 is also part of Lemma F.4.

38

G Relating Adversarial Label Noise and the Agnostic Setting

In this section we study the agnostic setting of [KSS94, KKMS05] and describe how our results imply

constant factor approximations in that model. In the agnostic model, data (x, y) is generated from a dis-

tribution D over ℜd × 1,−1. For a given concept class C , let OPT be the error of the best classifier

in C . In other words, OPT = argminf∈CerrD(f) = argminf∈CPr(x,y)∼D[f(x) 6= y]. The goal of

the learning algorithm is to output a hypothesis h which is nearly as good as f , i.e., given ǫ > 0, we want

errD(h) ≤ c · OPT + ǫ, where c is the approximation factor. Any result in the adversarial model that we

study, translates into a result for the agnostic setting via the following lemma.

Lemma G.1. For a given concept class C and distribution D, if there exists an algorithm in the adversarial

noise model which runs in time poly(d, 1/ǫ) and tolerates a noise rate of η = Ω(ǫ), then there exists an

algorithm for (C,D) in the agnostic setting which runs in time poly(d, 1/ǫ) and achieves errorO(OPT+ǫ).

Proof. Let f∗ be the optimal halfspace with error OPT . In the adversarial setting, w.r.t. f∗, the noise

rate η will be exactly OPT . Set ǫ′ = c(OPT + ǫ) as input to the algorithm for the adversarial model.

By the guarantee of the algorithm we will get a hypothesis h such that Pr(x,y)∼D[h(x) 6= f∗(x)] ≤ ǫ′ =c(OPT+ǫ). Hence by triangle inequality, we have errD(h) ≤ errD(f∗)+c(OPT+ǫ) = O(OPT+ǫ).

For the case when C is the class of origin centered halfspaces inRd and the marginal ofD is the uniform

distribution over Sd−1, the above lemma along with Theorem 1.2 implies that we can output a halfspace of

accuracy O(OPT + ǫ) in time poly(d, 1/ǫ). The work of [KKMS05] achieves a guarantee of O(OPT + ǫ)in time exponential in 1/ǫ by doing L2 regression to learn a low degree polynomial2 .

H Proof of VC lemmas

In this section, we apply some standard VC tools to establish some lemmas about estimates of expectations.

Definition H.1. Say that a set F of real-valued functions with a common domain X shatters x1, ..., xd ∈ Xif there are thresholds t1, ..., td such that

(sign(f(x1)− t1), ..., sign(f(xd)− td)) : f ∈ F = −1, 1d.

The pseudo-dimension of F is the size of the largest set shattered by F .

We will use the following bound.

Lemma H.2 (see [AB99]). Let F be a set of functions from a common domain X to [a, b] and let d be the

pseudo-dimension of F , and letD be a probability distribution overX. Then, form = O(

(b−a)2

α2 (d+ log(1/δ)))

,

if x1, ..., xm are drawn independently at random according to D, with probability 1− δ, for all f ∈ F ,

∣

∣

∣

∣

∣

Ex∼D(f(x))−1

m

m∑

t=1

f(xt)

∣

∣

∣

∣

∣

≤ α.

2They further show that L1 regression can achieve a stronger guarantee of OPT + ǫ

39

H.1 Proof of Lemma C.10 and Lemmas D.10

The pseudo-dimension of the set of linear combinations of d variables is known to be d [Pol11]. Since, for

any non-increasing function ψ : R → R and any F , the pseudo-dimension of ψ f : f ∈ F is at most

that of F (see [Pol11]), the pseudo-dimension of ℓ(w, ·) : w ∈ Rd is at most d.

Let D′ be the distribution obtained by conditioning D on the event that ||x|| < R (||x|| < 1 for uniform

distribution). For ℓ ≤ nk, the total variation distance between the joint distribution of ℓ draws from D′ and

ℓ draws from D is at most 1− δ4(k+k2)

, so it suffices to prove (29) and (30) with respect to D′ ((17) and (18)

respectively for the uniform distribution). Applying Lemma D.6 and Lemma H.2 then completes the proof.

H.2 Proof of Lemma D.2

Define fa by fa(x) = (a · x)2. The pseudo-dimension of the set of all such functions is O(d) [KLS09]. As

the proof of Lemma D.10, w.l.o.g., all x have ||x||2 ≤ R, and applying Lemma H.2 completes the proof.

40

The Power of Localization for Efﬁciently Learning Linear …pa336/static/pub/active... · 2015....

Documents

Transcript of The Power of Localization for Efﬁciently Learning Linear …pa336/static/pub/active... · 2015....