The Power of Localization for Efficiently Learning Linear …pa336/static/pub/active... · 2015....
Transcript of The Power of Localization for Efficiently Learning Linear …pa336/static/pub/active... · 2015....
The Power of Localization
for Efficiently Learning Linear Separators with Noise
Pranjal Awasthi
Maria Florina Balcan
Philip M. Long
January 3, 2014
Abstract
We introduce a new approach for designing computationally efficient learning algorithms that are
tolerant to noise, one of the most fundamental problems in learning theory. We demonstrate the effec-
tiveness of our approach by designing algorithms with improved noise tolerance guarantees for learning
linear separators, the most widely studied and used concept class in machine learning.
We consider two of the most challenging noise models studied in learning theory, the malicious
noise model of Valiant [Val85, KL88] and the adversarial label noise model of Kearns, Schapire, and
Sellie [KSS94]. For malicious noise, where the adversary can corrupt an η of fraction both the label part
and the feature part, we provide a polynomial-time algorithm for learning linear separators in ℜd under
the uniform distribution with near information-theoretic optimal noise tolerance of η = Ω(ǫ). This im-
proves significantly over previously best known results of [KKMS05, KLS09]. For the adversarial label
noise model, where the distribution over the feature vectors is unchanged, and the overall probability
of a noisy label is constrained to be at most η, we give a polynomial-time algorithm for learning linear
separators in ℜd under the uniform distribution that can handle a noise rate of η = Ω(ǫ). This improves
significantly over the results of [KKMS05] which either required runtime super-exponential in 1/ǫ (ours
is polynomial in 1/ǫ) or tolerated less noise.
In the case that the distribution is isotropic log-concave, we present a polynomial-time algorithm
for the malicious noise model that tolerates Ω(
ǫlog2(1/ǫ)
)
noise, and a polynomial-time algorithm for
the adversarial label noise model that also handles Ω(
ǫlog2(1/ǫ)
)
noise. Both of these also improve on
results from [KLS09]. In particular, in the case of malicious noise, unlike previous results, our noise
tolerance has no dependence on the dimension d of the space.
A particularly nice feature of our algorithms is that they can naturally exploit the power of active
learning, a widely studied modern learning paradigm, where the learning algorithm can only receive the
classifications of examples when they ask for them. We show that in this model, our algorithms achieve
a label complexity whose dependence on the error parameter ǫ is exponentially better than that of any
passive algorithm. This provides the first polynomial-time active learning algorithm for learning linear
separators in the presence of adversarial label noise, as well as the first analysis of active learning under
the challenging malicious noise model.
Our algorithms and analysis combine several ingredients including aggressive localization, hinge
loss minimization, and a novel localized and soft outlier removal procedure. Our work illustrates an un-
expected use of localization techniques (previously used for obtaining better sample complexity results)
in order to obtain better noise-tolerant polynomial-time algorithms.
1 Introduction
Overview. Dealing with noisy data is one of the main challenges in machine learning and is a highly
active area of research. In this work we study the noisy learnability of linear separators, arguably the
most popular class of functions used in practice [CST00]. Linear separators are at the heart of methods
ranging from support vector machines (SVMs) to logistic regression to deep networks, and their learnability
has been the subject of intense study for over 50 years. Learning linear separators from correctly labeled
(non-noisy) examples is a very well understood problem with simple efficient algorithms like Perceptron
being effective both in the classic passive learning setting [KV94, Vap98] and in the more modern active
learning framework [Das11]. However, for noisy settings, except for the special case of uniform random
noise, very few positive algorithmic results exist even for passive learning. In the context of theoretical
computer science more broadly, problems of noisy learning are related to seminal results in approximation-
hardness [ABSS93, GR06], cryptographic assumptions [BFKL94, Reg05], and are connected to other classic
questions in learning theory (e.g., learning DNF formulas [KSS94]), and appear as barriers in differential
privacy [GHRU11]. Hence, not surprisingly, designing efficient algorithms for learning linear separators in
the presence of adversarial noise (see definitions below) is of great importance.
In this paper we present new techniques for designing efficient algorithms for learning linear separators
in the presence of malicious and adversarial noise. These are two of the most challenging noise models
studied in learning theory. The models were originally proposed for a setting in which the algorithm must
work for an arbitrarily, unknown distribution. As we will see, bounds on the amount of noise tolerated for
this setting, however were very weak, and no significant progress was made for many years. This gave
rise to the question of the role that the distribution played in determining the limits of noise tolerance.
A breakthrough result of [KKMS05] and subsequent work of [KLS09] showed that indeed better bounds
on the level of noise tolerance can be obtained for the uniform and more generally isotropic log-concave
distributions. In this paper, we significantly improve these results. For the malicious noise case, where
the adversary can corrupt both the label part and the feature part of the observation (and it has unbounded
computational power and access to the entire history of the learning algorithm’s computation), we design an
efficient algorithm that can tolerate near-optimal amount of malicious noise (within constant factor of the
statistical limit) for the uniform distribution, and also significantally improves over the previously known
results for log-concave distribution. In particular, unlike previous works, our noise tolerance limit has no
dependence on the dimension d of the space. We also show similar improvements for adversarial label noise,
and furthermore show that our algorithms can naturally exploit the power of active learning. Active learning
is a widely studied modern learning paradigm, where the learning algorithm only receives the classifications
of examples when it asks for them. We show that in this model, our algorithms achieve a label complexity
whose dependence on the error parameter ǫ is exponentially better than that of any passive algorithm. This
provides the first polynomial-time active learning algorithm for learning linear separators in the presence of
adversarial label noise, solving an open problem posed in [BBL06, Mon06]. It also provides as well as the
first analysis showing the benefits of active learning over passive learning under the challenging malicious
noise model.
Overall, our work illustrates an unexpected use of localization techniques (previously used for obtaining
better sample complexity results) in order to obtain better noise-tolerant polynomial-time algorithms. Our
work brings a new set of algorithmic and analysis techniques including localization and soft outlier removal,
that we believe will have other applications in learning theory and optimization more broadly.
In the following we start by formally defining the learning models we consider, we then present most
relevant prior work, and then our main results and techniques.
1
Passive and Active Learning. Noise Models. In this work we consider the problem of learning linear
separators in two important learning paradigms: the classic passive learning setting and the more modern
active learning scenario. As typical [KV94, Vap98], we assume that there exists a distribution D over ℜd
and a fixed unknown target function w∗. In the noise-free settings, in the passive supervised learning model
the algorithm is given access to a distribution oracle EX(D,w∗) from which it can get training samples
(x, sign(w∗ · x)) where x ∼ D. The goal of the algorithm is to output a hypothesis w such that errD(w) =Prx∼D[sign(w
∗ ·x) 6= sign(w ·x)] ≤ ǫ. In the active learning model [CAL94, Das11] the learning algorithm
is given as input a pool of unlabeled examples drawn from the distribution oracle. The algorithm can then
query for the labels of examples of its choice from the pool. The goal is to produce a hypothesis of low error
while also optimizing for the number of label queries (also known as label complexity). The hope is that in
the active learning setting we can output a classifier of small error by using many fewer label requests than
in the passive learning setting by actively directing the queries to informative examples (while keeping the
number of unlabeled examples polynomial).
In this work we focus on two important and realistic noise models. The first one is the malicious noise
model of [Val85, KL88] where samples are generated as follows: with probability (1 − η) a random pair
(x, y) is output where x ∼ D and y = sign(w∗ · x); with probability η the adversary can output an arbitrary
pair (x, y) ∈ ℜd × −1, 1. We will call η the noise rate. Each of the adversary’s examples can depend
on the state of the learning algorithm and also the previous draws of the adversary. We will denote the
malicious oracle as EXη(D,w∗). The goal remains however that of achieving arbitrarily good predictive
approximation to the underlying target function with respect to the underlying distribution, that is to output
a hypothesis w such that Prx∼D[sign(w∗ · x) 6= sign(w · x)] ≤ ǫ.
In this paper, we consider an extension of the malicious noise model [Val85, KL88] to the the active
learning model as follows. There are two oracles, an example generation oracle and a label revealing oracle.
The example generation oracle works as usual in the malicious noise model: with probability (1 − η) a
random pair (x, y) is generated where x ∼ D and y = sign(w∗ · x); with probability η the adversary can
output an arbitrary pair (x, y) ∈ ℜd × −1, 1. In the active learning setting, unlike the standard malicious
noise model, when an example (x, y) is generated, the algorithm only receives x, and must make a separate
call to the label revealing oracle to get y. The goal of the algorithm is still to output a hypothesis w such that
Prx∼D[sign(w∗ · x) 6= sign(w · x)] ≤ ǫ.
In the adversarial label noise model, before any examples are generated, the adversary may choose a joint
distribution P overℜd×−1, 1whose marginal distribution overℜd isD and such that Pr(x,y)∼P (sign(w∗·
x) 6= y) ≤ η. In the active learning model, we will have two oracles, and example generation oracle and a
label revealing oracle. We note that the results from our theorems in this model translate immediately into
similar guarantees for the agnostic model of [KSS94] (used routinely both in passive and active learning
(e.g., [KKMS05, BBL06, Han07]) – see Appendix G for details.
We will be interested in algorithms that run in time poly(d, 1/ǫ) and use poly(d, 1/ǫ) samples. In
addition, for the active learning scenario we want our algorithms to also optimize for the number of label
requests. In particular, we want the number of labeled examples to depend only polylogarithmically in 1/ǫ.The goal then is to quantify for a given value of ǫ, the tolerable noise rate η(ǫ) which would allow us to
design an efficient (passive or active) learning algorithm.
Previous Work. In the context of passive learning, Kearns and Li’s analysis [KL88] implies that halfs-
paces can be efficiently learned with respect to arbitrary distributions in polynomial time while tolerating a
malicious noise rate of Ω(
ǫd
)
. A slight variant of a construction due to Kearns and Li [KL88] shows that
malicious noise at a rate greater than ǫ1+ǫ , cannot be tolerated by algorithms learning halfspaces when the
distribution is uniform over the unit sphere. The Ω(
ǫd
)
bound for the distribution-free case was not improved
2
for many years. Kalai et al. [KKMS05] showed that, when the distribution is uniform, the poly(d, 1/ǫ)-time
averaging algorithm tolerates malicious noise at a rate Ω(ǫ/√d). They also described an improvement to
Ω(ǫ/d1/4) based on the observation that uniform examples will tend to be well-separated, so that pairs of
examples that are too close to one another can be removed, and this limits an adversary’s ability to coor-
dinate the effects of its noisy examples. [KLS09] analyzed another approach to limiting the coordination
of the noisy examples and proposed an outlier removal procedure that used PCA to find any direction uonto which projecting the training data led to suspiciously high variance, and removing examples with the
most extreme values after projecting onto any such u. Their algorithm tolerates malicious noise at a rate
Ω(ǫ2/ log(d/ǫ)) under the uniform distribution.
Motivated by the fact that many modern learning applications have massive amounts of unannotated or
unlabeled data, there has been significant interest in machine learning in designing active learning algorithms
that most efficiently utilize the available data, while minimizing the need for human intervention. Over
the past decade there has been substantial progress progress on understanding the underlying statistical
principles of active learning, and several general characterizations have been developed for describing when
active learning could have an advantage over the classic passive supervised learning paradigm both in the
noise free settings and in the agnostic case [FSST97, Das05, BBL06, BBZ07, Han07, DHM07, CN07,
BHW08, Kol10, BHLZ10, Wan11, Das11, RR11, BH12]. However, despite many efforts, except for very
simple noise models (random classification noise [BF13] and linear noise [DGS12]), to date there are no
known computationally efficient algorithms with provable guarantees in the presence of noise. In particular,
there are no computationally efficient algorithms for the agnostic case, and furthermore no result exists
showing the benefits of active learning over passive learning in the malicious noise model, where the feature
part of the examples can be corrupted as well. We discuss additional related work in Appendix A.
1.1 Our Results
1. We give a poly(d, 1/ǫ)-time algorithm for learning linear separators in ℜd under the uniform distribu-
tion that can handle a noise rate of η = Ω(ǫ), where ǫ is the desired error parameter. Our algorithm
(outlined in Section 3) is quite different from those in [KKMS05] and [KLS09] and improves signif-
icantly on the noise robustness of [KKMS05] by roughly a factor d1/4 and on the noise robustness
of [KLS09] by a factorlog dǫ . Our noise tolerance is near-optimal and is within a constant factor of the
statistical lower bound of ǫ1+ǫ . In particular we show the following.
Theorem 1.1. There is a polynomial-time algorithm Aum for learning linear separators with re-
spect to the uniform distribution over the unit ball in ℜd in the presence of malicious noise such that
an Ω (ǫ) upper bound on η suffices to imply that for any ǫ, δ > 0, the output w of Aum satisfies
Pr(x,y)∼D[sign(w · x) 6= sign(w∗ · x)] ≤ ǫ with probability at least 1− δ.2. For the adversarial noise model, we give a poly(d, 1/ǫ)-time algorithm for learning with respect to
the uniform distribution that tolerates a noise rate Ω(ǫ).
Theorem 1.2. There is a polynomial-time algorithm Aul for learning linear separators with respect
to the uniform distribution over the unit ball in ℜd in the presence of adversarial label noise such
that an Ω (ǫ) upper bound on η suffices to imply that for any ǫ, δ > 0, the output w of Aum satisfies
Pr(x,y)∼D[sign(w · x) 6= sign(w∗ · x)] ≤ ǫ with probability at least 1− δ.
As a restatement of the above theorem, in the agnostic setting considered in [KKMS05], we can output
a halfspace of error at most O(η + α) in time poly(d, 1/α). The previous best result of [KKMS05]
achieves this by learning a low degree polynomial in time whose dependence on ǫ is exponential.
3
3. We obtain similar results for the case of isotropic log-concave distributions.
Theorem 1.3. There is a polynomial-time algorithm Ailcm for learning linear separators with re-
spect to any isotropic log-concave distribution in ℜd in the presence of malicious noise such that an
Ω(
ǫlog2( 1
ǫ)
)
upper bound on η suffices to imply that for any ǫ, δ > 0, the output w of Ailcm satisfies
Pr(x,y)∼D[sign(w · x) 6= sign(w∗ · x)] ≤ ǫ with probability at least 1− δ.
This improves on the best previous bound of Ω(
ǫ3
log2(d/ǫ)
)
on the noise rate [KLS09]. Notice that our
noise tolerance bound has no dependence on d.
Theorem 1.4. There is a polynomial-time algorithm Ailcl for learning linear separators with respect
to isotropic log-concave distribution in ℜd in the presence of adversarial label noise such that an
Ω(
ǫ/ log2(1/ǫ))
upper bound on η suffices to imply that for any ǫ, δ > 0, the output w of Ailcl
satisfies Pr(x,y)∼D[sign(w · x) 6= sign(w∗ · x)] ≤ ǫ with probability at least 1− δ.
This improves on the best previous bound of Ω(
ǫ3
log(1/ǫ)
)
on the noise rate [KLS09].
4. A particularly nice feature of our algorithms is that they can naturally exploit the power of active
learning. We show that in this model, the label complexity of both algorithms depends only poly-
logarithmically in 1/ǫ where ǫ is the desired error rate, while still using only a polynomial number
of unlabeled samples (for the uniform distribution, the dependence of the number of labels on ǫ is
O(log(1/ǫ))). Our efficient algorithm that tolerates adversarial label noise solves an open problem
posed in [BBL06, Mon06]. Furthermore, our paper provides the first active learning algorithm for
learning linear separators in the presence of non-trivial amount of adversarial noise that can affect not
only the label part, but also the feature part.
Our work exploits the power of localization for designing noise-tolerant polynomial-time algorithms.
Such localization techniques have been used for analyzing sample complexity for passive learning (see
[BBM05, BBL05, Zha06, BLL09, BL13]) or for designing active learning algorithms (see [BBZ07, Kol10,
Han11, BL13]). In order to make such a localization strategy computationally efficient and tolerate mali-
cious noise we introduce several key ingredients described in Section 1.2.
We note that all our algorithms are proper in that they return a linear separator. (Linear models can be
evaluated efficiently, and are otherwise easy to work with.) We summarize our results in Tables 1 and 2.
Table 1: Comparison with previous poly(d, 1/ǫ)-time algs. for uniform distribution
Passive Learning Prior work Our work
malicious η = Ω( ǫd1/4
) [KKMS05] η = Ω(ǫ)
η = Ω( ǫ2
log(d/ǫ)) [KLS09]
adversarial η = Ω(ǫ/√
log(1/ǫ)) [KKMS05] η = Ω(ǫ)
Active Learning (malicious and adversarial) NA η = Ω(ǫ)
1.2 Techniques
Hinge Loss Minimization As minimizing the 0-1 loss in the presence of noise is NP-hard [JP78, GJ90],
a natural approach is to minimize a surrogate convex loss that acts as a proxy for the 0-1 loss. A common
4
Table 2: Comparison with previous poly(d, 1/ǫ)-time algorithms isotropic log-concave distributions
Passive Learning Prior work Our work
malicious η = Ω( ǫ3
log2(d/ǫ)) [KLS09] η = Ω( ǫ
log2(1/ǫ))
adversarial η = Ω( ǫ3
log(1/ǫ)) [KLS09] η = Ω( ǫlog2(1/ǫ)
)
Active Learning (malicious and adversarial) NA Ω( ǫlog2(1/ǫ)
)
choice in machine learning is to use the hinge loss defined as ℓτ (w, x, y) = max(
0, 1− y(w·x)τ
)
, and, for
a set T of examples, we let ℓτ (w, T ) =1|T |∑
(x,y)∈T ℓτ (w, x, y). Here τ is a parameter that changes during
training. It can be shown that minimizing hinge loss with an appropriate normalization factor can tolerate a
noise rate of Ω(ǫ2/√d) under the uniform distribution over the unit ball in ℜd. This is also the limit for such
a strategy since a more powerful malicious adversary with can concentrate all the noise directly opposite to
the target vector w∗ and make sure that the hinge-loss is no longer a faithful proxy for the 0-1 loss.
Localization in the instance and concept space Our first key insight is that by using an iterative lo-
calization technique, we can limit the harm caused by an adversary at each stage and hence can still do
hinge-loss minimization despite significantly more noise. In particular, the iterative style algorithm we pro-
pose proceeds in stages and at stage k, we have a hypothesis vector wk of a certain error rate. The goal in
stage k is to produce a new vector wk+1 of error rate half of wk. In order to halve the error rate, we focus
on a band of size bk = Θ(2−k√d) around the boundary of the linear classifier whose normal vector is wk,
i.e. Swk,bk = x : |wk · x| < bk. For the rest of the paper, we will repeatedly refer to this key region
of borderline examples as “the band”. The key observation made in [BBZ07] is that outside the band, all
the classifiers still under consideration (namely those hypotheses within radius rk of the previous weight
vector wk) will have very small error. Furthermore, the probability mass of this band under the original
distributions is small enough, so that in order to make the desired progress we only need to find a hypothesis
of constant error rate over the data distribution conditioned on being within margin bk of wk. This insight
has been crucially used in the [BBZ07] in order to obtain active learning algorithms with improved label
complexity ignoring computational complexity considerations1 .
In this work, we show the surprising fact that this idea can be extended and adapted to produce polyno-
mial time algorithms with improved noise tolerance as well! Not only do we use this localization idea for
different purposes, but our analysis significantly departs from [BBZ07]. To obtain our results, we exploit
several new ideas: (1) the performance of the rescaled hinge loss minimization in smaller and smaller bands,
(2) a careful variance analysis, and (3) another type of localization — we develop and analyze a novel soft
and localized outlier removal procedure. In particular, we first show that if we minimize a variant of the
hinge loss that is rescaled depending on the width of the band, it remains a faithful enough proxy for the 0-1
error even when there is significantly more noise. As a first step towards this goal, consider the setting where
we pick τk proportionally to bk, the size of the band, and rk is proportional to the error rate of wk, and then
minimize a normalized hinge loss function ℓτk(w, x, y) = max(0, 1 − y(w·x)τk
) over vectors w ∈ B(wk, rk).We first show that w∗ has small hinge loss within the band. Furthermore, within the band the adversarial
examples cannot hurt the hinge loss of w∗ by a lot. To see this notice that if the malicious noise rate is η,
within Swk−1,bk the effective noise rate is Θ(η2k). Also the maximum value of the hinge loss for vectors
w ∈ B(wk, 2−k) is O(
√d). Hence the maximum amount by which the adversary can affect the hinge loss
1We note that the localization considered by [BBZ07] is a more aggressive one than those considered in disagreement based
active learning literature [BBL06, Han07, Kol10, Han11, Wan11] and earlier in passive learning [BBM05, BBL05, Zha06].
5
is O(η2k√d). Using this approach we get a noise tolerance of Ω(ǫ/
√d).
In order to get a much better noise tolerance in the adversarial or agnostic setting, we crucially exploit
a careful analysis of the variance of w · x for vectors w close to the current vector wk−1, one can get a
much tighter bound on the amount by which an adversary can “hurt” the hinge loss. This then leads to an
improved noise tolerance of Ω(ǫ).For the case of malicious noise, in addition we need to deal with the presence of outliers, i.e. points
not generated from the uniform distribution. We do this by introducing a soft localized outlier removal
procedure at each stage (described next). This procedure assigns a weight to each data point indicating how
“noisy” the point is. We then minimize the weighted hinge loss. This combined with the variance analysis
mentioned above leads to a noise of tolerance of Ω(ǫ) in the malicious case as well.
Soft Localized Outlier Removal Outlier removal techniques have been studied before in the context of
learning problems [BFKV97, KLS09]. The goal of outlier removal is to limit the ability of the adversary to
coordinate the effects of noisy examples – excessive such coordination is detected and removed. Our outlier
removal procedure (see Figure 2) is similar in spirit to that of [KLS09] with two key differences. First, as
in [KLS09], we will use the variance of the examples in a particular direction to measure their coordination.
However, due to the fact that in round k, we are minimizing the hinge loss only with respect to vectors that
are close to wk−1, we only need to limit the variance in these directions. This variance is Θ(b2k) which is
much smaller than 1/d. This allows us to limit the harm of the adversary to a greater extent than was possible
in the analysis of [KLS09]. The second difference is that, unlike previous outlier removal techniques, we
do not remove any examples but instead weigh them appropriately and then minimize the weighted hinge
loss. The weights indicate how noisy a given example is. We show that these weights can be computed
by solving a linear program with infinitely many constraints. We then show how to design an efficient
separation oracle for the linear program using recent general-purpose techniques from the optimization
community [SZ03, BM13].
In Section 4 we show that our results hold for a more general class of distributions which we call
admissible distributions. From Section 4 it also follows that our results can be extended to β-log-concave
distributions (for small enough β). Such distributions, for instance, can capture mixtures of log-concave
distributions [BL13].
2 Preliminaries
Our algorithms and analysis will use the hinge loss defined as ℓτ (w, x, y) = max(
0, 1 − y(w·x)τ
)
, and, for
a set T of examples, we let ℓτ (w, T ) =1|T |∑
(x,y)∈T ℓτ (w, x, y). Here τ is a parameter that changes during
training. Similarly, the expected hinge loss w.r.t.D is defined as Lτ (w,D) = Ex∼D(ℓτ (w, x, sign(w∗ ·x))).
Our analysis will also consider the distribution Dw,γ obtained by conditioning D on membership in the band,
i.e. the set x : ‖x‖2 = 1, |w · x| ≤ γ.Since it is very natural, for clarity of exposition, we present our algorithms directly in the active learning
model. We will prove that our active algorithm only uses a polynomial number of unlabeled samples, which
then immediately implies a guarantee for passive learning setting as well. At a high level, our algorithms
are iterative learning algorithms that operate in rounds. In each round k we focus our attention and use
points that fall near the current hypothesized decision boundary wk−1 and use them in order to obtain a new
vector wk of lower error. In the malicious noise case, in round k we first do a soft outlier removal and then
minimize hinge loss normalized appropriately by τk. A formal description appears in Figure 1, and a formal
description of the outlier removal procedure appears in Figure 2. We will present specific choices of the
6
Figure 1 COMPUTATIONALLY EFFICIENT ALGORITHM TOLERATING MALICIOUS NOISE
Input: allowed error rate ǫ, probability of failure δ, an oracle that returns x, for (x, y) sampled from
EXη(f,D), and an oracle for getting the label from an example; a sequence of unlabeled sample sizes
nk > 0 k ∈ Z+; a sequence of labeled sample sizes mk > 0; a sequence of cut-off values bk > 0; a
sequence of hypothesis space radii rk > 0; a sequence of removal rates ξk; a sequence of variance bounds
σ2k; precision value κ; weight vector w0.
1. Draw n1 examples and put them into a working set W .
2. For k = 1, . . . , s = ⌈log2(1/ǫ)⌉
(a) Apply the algorithm from Figure 2 to W with parameters u ← wk−1, γ ← bk−1, r ← rk, ξ ← ξk,
σ2 ← σ2k and let q be the output function q : W → [0, 1] . Normalize q to form a probability distribution
p over W .
(b) Choose mk examples from W according to p and reveal their labels. Call this set T .
(c) Find vk ∈ B(wk−1, rk) to approximately minimize training hinge loss over T s.t. ‖vk‖2 ≤ 1:
ℓτk(vk, T ) ≤ minw∈B(wk−1,rk)∩B(0,1)) ℓτk(w, T ) + κ/8Normalize vk to have unit length, yielding wk = vk
‖vk‖2
.
(d) Clear the working set W .
(e) Until nk+1 additional data points are put in W , given x for (x, f(x)) obtained from EXη(f,D), if
|wk · x| ≥ bk, then reject x else put into W
Output: weight vector ws of error at most ǫ with probability 1− δ.
parameters of the algorithms in the following sections.
The description of the algorithm and its analysis is simplified if we assume that it starts with a prelimi-
nary weight vector w0 whose angle with the target w∗ is acute, i.e. that satisfies θ(w0, w∗) < π/2. We show
in Appendix B that this is without loss of generality for the types of problems we consider.
3 Learning with respect to uniform distribution with malicious noise
Let Sd−1 denote the unit ball in Rd. In this section we focus on the case where the marginal distribution D
is the uniform distribution over Sd−1 and present our results for malicious noise. We present the analysis
of our algorithm directly in the active learning model, and present a proof sketch for its correctness in
Theorem 3.1 below. The proof of Theorem 1.1 follows immediately as a corollary. Complete proof details
are in Appendix C.
Theorem 3.1. Let w∗ be the (unit length) target weight vector. There are absolute positive constants
c1, ..., c4 and a polynomial p such that, an Ω (ǫ) upper bound on η suffices to imply that for any ǫ, δ > 0,
using the algorithm from Figure 1 with ǫ0 = 1/8, cut-off values bk = c12−kd−1/2, radii rk = c22
−kπ,
κ = c3, τk = c42−kd−1/2 for k ≥ 1, ξk = cκ2, σk = (
r2kd−1 + b2k−1), a number nk = p(d, 2k, log(1/δ)) of
unlabeled examples in round k and a number mk = O(d(d + log(k/δ))) of labeled examples in round k,
after s = ⌈log2(1/ǫ)⌉ iterations, we find ws satisfying err(ws) = Pr(x,y)∼D[sign(w ·x) 6= sign(w∗ ·x)] ≤ ǫwith probability ≥ 1− δ.
7
Figure 2 LOCALIZED SOFT OUTLIER REMOVAL PROCEDURE
Input: a set S = (x1, x2, . . . xn) samples; the reference unit vector u; desired radius r; a parameter ξspecifying the desired bound on the fraction of clean examples removed; a variance bound σ2
1. Find q : S → [0, 1] satisfying the following constraints:
(a) for all x ∈ S, 0 ≤ q(x) ≤ 1
(b) 1|S|
∑
(x,y)∈S q(x) ≥ 1− ξ(c) for all w ∈ B(u, r) ∩B(0, 1), 1
|S|
∑
x∈S q(x)(w · x)2 ≤ cσ2
Output: A function q : S → [0, 1].
3.1 Proof Sketch of Theorem 3.1
We may assume without loss of generality that all examples, including noisy examples, fall in Sd−1. This is
because any example that falls outside Sd−1 can be easily identified by the algorithm as noisy and removed,
effectively lowering the noise rate.
A first key insight is that using techniques from [BBZ07], we may reduce our problem to a subproblem
concerning learning with respect to a distribution obtained by conditioning on membership in the band. In
particular, in Appendix C.1, we prove that, for a sufficiently small absolute constant κ, Theorem 3.2 stated
below, together with proofs of its computational, sample and label complexity bounds, suffices to prove
Theorem 3.1.
Theorem 3.2. After round k of the algorithm in Figure 1, with probability at least 1 − δk+k2
, we have
errDwk−1,bk−1(wk) ≤ κ.
The proof of Theorem 3.2 follows from a series of steps summarized in the lemmas below. First, we
bound the hinge loss of the target w∗ within the band Swk−1,bk−1. Since we are analyzing a particular
round k, to reduce clutter in the formulas, for the rest of this section, let us refer to ℓτk simply as ℓ and
Lτk(·,Dwk−1,bk−1) as L(·).
Lemma 3.3. L(w∗) ≤ κ/12.
Proof Sketch: Notice that y(w∗·x) is never negative, so, on any clean example (x, y), we have ℓ(w∗, x, y) =
max
0, 1− y(w∗·x)τk
≤ 1, and, furthermore, w∗ will pay a non-zero hinge only inside the region where
|w∗ · x| < τk. Hence, L(w∗) ≤ PrDwk−1,bk−1
(|w∗ · x| ≤ τk) =Prx∼D(|w∗·x|≤τk & |wk−1·x|≤bk−1)
Prx∼D(|wk−1·x|≤bk−1). Using
standard tail bounds (see Eq. 1 in Appendix C), we can lower bound the denominator Prx∼D(|wk−1 · x| <bk−1) ≥ c′1bk−1
√d for a constant c′1. Also the numerator is at most Prx∼D(|w∗ · x| ≤ τk) ≤ c′2τk
√d. For
another constant c′2. Hence, we have L(w∗) ≤ c′2
√dτk
c′1
√dbk−1
≤ κ/12, for the appropriate choice of constants c′1and c′2 and making κ small enough.
During round k we can decompose the working set W into the set of “clean” examples WC which are
drawn from Dwk−1,bk−1and the set of “dirty” or malicious examples WD which are output by the adversary.
Next, we will relate the hinge loss of vectors over the weighted set W to the hinge loss over clean examples
WC . In order to do this we will need the following guarantee from the outlier removal subroutine of Figure 2.
8
Theorem 3.4. There is a constant c and a polynomial p such that, if n ≥ p(1/η, d, 1/ξ, 1/δ, 1/γ) examples
are drawn from the distribution Du,γ (each replaced with an arbitrary unit-length vector with probability
η < 1/4), then by using the algorithm in Figure 1 with σ2 = r2
d−1 + γ2, we have that with probability
1 − δ, the output q of satisfies the following: (a)∑
(x,y)∈S q(x) ≥ (1 − ξ)|S|, and (b) for all unit length w
such that ‖w − u‖2 ≤ r, 1|S|∑
x∈S q(x)(w · x)2 ≤ cσ2. Furthermore, the algorithm can be implemented in
polynomial time.
The key points in proving this theorem are the following. We will show that the vector q∗ which assigns
a weight 1 to examples in WC and weight 0 to examples in WD is a feasible solution to the linear program
in Figure 2. In order to do this, we first show that the fraction of dirty examples in round k is not too
large, i.e., w.h.p., we have |WD| = O(η|S|). Next, we use the improved variance bound from Lemma C.2
regarding E[(w.x)2] for all w close to u. This bound is ( r2
d−1 + γ2). The proof of feasibility follows easily
by combining the variance bound with standard VC tools. In the appendix we also show how to solve the
linear program in polynomial time. The complete proof of the theorem 3.4 is in Appendix C.
As explained in the introduction, the soft outlier removal procedure allows us to get a much refined
bound on the hinge loss over the clean set WC , i.e., ℓ(w,WC) as compared to the hinge loss over the
weighted set W , i.e., ℓ(w, p). This is formalized in the following lemma. Here ℓ(w,WC) and ℓ(w, p) are
defined with respect to the true unrevealed labels that the adversary has committed to.
Lemma 3.5. There are absolute constants c1, c2 and c3 such that, for large enough d, with probability
1 − δ2(k+k2)
, if we define zk =
√
r2kd−1 + b2k−1, then for any w ∈ B(wk−1, rk), we have ℓ(w,WC) ≤
ℓ(w, p) + c1ηǫ
(
1 + zkτk
)
+ κ/32 and ℓ(w, p) ≤ 2ℓ(w,WC) + κ/32 + c2ηǫ +
c3√
η/ǫzkτk
.
A detailed proof of 3.5 is given in Appendix C. Here were give a few ideas. The loss ℓ(w, x, y) on
a particular example can be upper bounded by 1 + |w·x|τ . One source of difference between ℓ(w,WC),
the loss on the clean examples, and ℓ(w, p), the loss minimized by the algorithm, is the loss on the (total
fractional) dirty examples that were not deleted by the soft outlier removal. By using the Cauchy-Shwartz
inequality, the (weighted) sum of 1 + |w·x|τ over those surviving noisy examples can be bounded in terms of
the variance in the direction w, and the (total fractional) number of surviving dirty examples. Our soft outlier
detection allows us to bound the variance of the surviving noisy examples in terms of Θ(z2k). Another way
that ℓ(w,WC) can be different from ℓ(w, p) is effect of deleting clean examples. We can similarly use the
variance on the clean examples to bound this in terms of z. Finally, we can flesh out the detailed bound by
exploiting the (soft counterparts of) the facts that most examples are clean and few examples are excluded.
Given, these the proof of Theorem 3.2 can be summarized as follows.
Let E = errDwk−1,bk−1
(wk) = errDwk−1,bk−1(vk) be the probability that we want to bound. Applying
VC theory, w.h.p., all sampling estimates of expected loss are accurate to within κ/32, so we may assume
w.l.o.g. that this is the case. Since, for each error, the hinge loss is at least 1, we have E ≤ L(vk). Applying
Lemma 3.5 and VC theory, we get, E ≤ ℓ(vk, T )+ c1ηǫ
(
1 + zkτk
)
+κ/8. Since vk approximately minimizes
the hinge loss, VC theory implies E ≤ ℓ(w∗, p) + c1ηǫ
(
1 + zkτk
)
+ κ/3. Once again applying Lemma 3.5
and VC theory yields E ≤ 2L(w∗) + c1ηǫ
(
1 + zkτk
)
+ c2ηǫ +
c3√
η/ǫzkτk
+ κ/2. Since L(w∗) ≤ κ/12, we get
E ≤ κ/6 + c1ηǫ
(
1 + zkτk
)
+ c2ηǫ +
c3√
η/ǫzkτk
+ κ/2. Now notice that zk/τk is Θ(1). Hence an Ω(ǫ) bound
on η suffices to imply, w.h.p., that errDwk−1,bk−1(wk) ≤ κ.
9
4 Learning with respect to admissible distributions with malicious noise
One of our main results (Theorem 1.3) concerns isotropic log concave distributions. A probability distribu-
tion is isotropic log-concave if its density can be written as exp(−ψ(x)) for a convex function ψ, its mean
is 0, and its covariance matrix is I .
In this section, we extend our analysis from the previous section and show that it works for isotropic
log concave distributions, and in fact an even more general class of distributions which we call as admis-
sible distributions. In particular this includes the class of isotropic log-concave distributions in Rd and the
uniform distributions over the unit ball in Rd.
Definition 4.1. A sequenceD4,D5, ... of probability distributions over R4,R5, ... respectively is λ-admissible
if it satisfies the following conditions. (1.) There are c1, c2, c3 > 0 such that, for all d ≥ 4, for xdrawn from Dd and any unit length u ∈ R
d, (a) for all a, b ∈ [−c1, c1] for which a ≤ b, we have
Pr(u · x ∈ [a, b]) ≥ c2|b− a| and for all a, b ∈ R for which a ≤ b, Pr(u · x ∈ [a, b]) ≤ c3|b− a|. (2.) For
any c4 > 0, there is a c5 > 0 such that, for all d ≥ 4, the following holds. Let u and v be two unit vectors in
Rd, and assume that θ(u, v) = α ≤ π/2. Then Prx∼Dd[sign(u · x) 6= sign(v · x) and |v · x| ≥ c5α] ≤ c4α.
(3.) There is an absolute constant c6 such that, for any d ≥ 4, for any two unit vectors u and v in Rd
we have c6θ(v, u) ≤ Prx∼Dd(sign(u · x) 6= sign(v · x)). (4.) There is a constant c8 such that, for all
constant c7, for all d ≥ 4, for any a such that, ‖a‖2 ≤ 1, and ||u − a|| ≤ r, for any 0 < γ < c7, we have
Ex∼Dd,u,γ
(
(a · x)2)
≤ c8 logλ(1 + 1/γ)(r2 + γ2). (5.) There is a constant c9 such that Prx∼D(||x|| >
α) ≤ c9 exp(−α/√d).
For the case of admissible distributions we have the following theorem, which is proved in Appendix D.
Theorem 4.2. Let a distribution D over Rd be chosen from a λ-admissible sequence of distributions Let
w∗ be the (unit length) target weight vector. There are settings of the parameters of the algorithm A from
Figure 1, such that an Ω
(
ǫlogλ( 1
ǫ)
)
upper bound on the rate η of malicious noise suffices to imply that for
any ǫ, δ > 0, a number nk = poly(d,Mk, log(1/δ)) of unlabeled examples in round k and a number mk =O(
d log(
dǫδ
)
(d+ log(k/δ)))
of labeled examples in round k ≥ 1, and w0 such that θ(w0, w∗) < π/2,
after s = O(log(1/ǫ)) iterations, finds ws satisfying err(ws) ≤ ǫ with probability ≥ 1− δ.If the support ofD is bounded in a ball of radiusR(d), then, we have thatmk = O
(
R(d)2(d+ log(k/δ)))
label requests suffice.
The above theorem contains Theorem 1.3 as a special case. This is because of the fact that any isotropic
log-concave distribution is 2-admissible (see Appendix F.2 for a proof).
5 Adversarial label noise
The intuition in the case of adversarial label noise is the same as for malicious noise, except that, because the
adversary cannot change the marginal distribution over the instances, it is not necessary to perform outlier
removal. Bounds for learning with adversarial label noise are not corollaries of bounds for learning with
malicious noise, however, because, while the marginal distribution over the instances for all the examples,
clean and noisy, is not affected by the adversary, the marginal distribution over the clean examples is changed
(because the examples whose labels are flipped are removed from the distribution over clean examples).
Theorem 1.2 and Theorem 1.4, which concern adversarial label noise, can be proved by combining the
analysis in Appendix E with the facts that the uniform distribution and i.l.c. distributions are 0-admissible
and 2-admissible respectively.
10
6 Discussion
Localization in this paper refers to the practice of narrowing the focus of a learning algorithm to a restricted
range of possibilities (which we know to be safe given the information so far), thereby reducing sensitivity
of estimates of the quality of these possibilities based on random data –this in turn leads to better noise
tolerance in our work. (Note that, while the examples in the band in round k do not occupy a neighborhood
in feature space, they concern differences between hypotheses in a neighborhood around wk−1.) We note
that the idea of localization in the concept space is traditionally used in statistical learning theory both in
supervised and active learning for getting sharper rates [BBL05, BLL09, Kol10]. Furthermore, the idea of
localization in the instance space has been used in margin-based analysis of active learning [BBZ07, BL13].
In this work we used localization in both senses in order to get polynomial-time algorithms with better noise
tolerance. It would be interesting to further exploit this idea for other concept spaces.
References
[AB99] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge
University Press, 1999.
[ABS10] P. Awasthi, A. Blum, and O. Sheffet. Improved guarantees for agnostic learning of disjunctions.
COLT, 2010.
[ABSS93] S. Arora, L. Babai, J. Stern, and Z. Sweedyk. The hardness of approximate optima in lat-
tices, codes, and systems of linear equations. In Proceedings of the 1993 IEEE 34th Annual
Foundations of Computer Science, 1993.
[Bau90] E. B. Baum. The perceptron algorithm is fast for nonmalicious distributions. Neural Compu-
tation, 2:248–260, 1990.
[BBL05] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: a survey of recent ad-
vances. ESAIM: Probability and Statistics, 9:9:323–375, 2005.
[BBL06] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In ICML, 2006.
[BBM05] P. L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. Annals of
Statistics, 33(4):1497–1537, 2005.
[BBZ07] M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In COLT, 2007.
[BF13] M.-F. Balcan and V. Feldman. Statistical active learning algorithms. NIPS, 2013.
[BFKL94] Avrim Blum, Merrick L. Furst, Michael J. Kearns, and Richard J. Lipton. Cryptographic prim-
itives based on hard learning problems. In Proceedings of the 13th Annual International Cryp-
tology Conference on Advances in Cryptology, 1994.
[BFKV97] A. Blum, A. Frieze, R. Kannan, and S. Vempala. A polynomial time algorithm for learning
noisy linear threshold functions. Algorithmica, 22(1/2):35–52, 1997.
[BGMN05] F. Barthe, O. Guedon, S. Mendelson, and A. Naor. A probabilistic approach to the geometry of
the pn-ball. The Annals of Probability, 33(2):480–513, 2005.
11
[BH12] M.-F. Balcan and S. Hanneke. Robust interactive learning. In COLT, 2012.
[BHLZ10] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without con-
straints. In NIPS, 2010.
[BHW08] M.-F. Balcan, S. Hanneke, and J. Wortman. The true sample complexity of active learning. In
COLT, 2008.
[BL13] M.-F. Balcan and P. M. Long. Active and passive learning of linear separators under log-
concave distributions. In Conference on Learning Theory, 2013.
[BLL09] N. H. Bshouty, Y. Li, and P. M. Long. Using the doubling dimension to analyze the generaliza-
tion of learning algorithms. JCSS, 2009.
[BM13] D. Bienstock and A. Michalka. Polynomial solvability of variants of the trust-region subprob-
lem, 2013. Optimization Online.
[BSS12] A. Birnbaum and S. Shalev-Shwartz. Learning halfspaces with the zero-one loss: Time-
accuracy tradeoffs. NIPS, 2012.
[Byl94] T. Bylander. Learning linear threshold functions in the presence of classification noise. In
Conference on Computational Learning Theory, 1994.
[CAL94] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine
Learning, 15(2), 1994.
[CGZ10] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Learning noisy linear classifiers via adaptive
and selective sampling. Machine Learning, 2010.
[CN07] R. Castro and R. Nowak. Minimax bounds for active learning. In COLT, 2007.
[CST00] N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines and other
kernel-based learning methods. Cambridge University Press, 2000.
[Das05] S. Dasgupta. Coarse sample complexity bounds for active learning. In NIPS, volume 18, 2005.
[Das11] S. Dasgupta. Active learning. Encyclopedia of Machine Learning, 2011.
[DGS12] O. Dekel, C. Gentile, and K. Sridharan. Selective sampling and active learning from single and
multiple teachers. JMLR, 2012.
[DHM07] S. Dasgupta, D.J. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. NIPS,
20, 2007.
[FGKP06] V. Feldman, P. Gopalan, S. Khot, and A. Ponnuswami. New results for learning noisy parities
and halfspaces. In FOCS, pages 563–576, 2006.
[FSST97] Y. Freund, H.S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by
committee algorithm. Machine Learning, 28(2-3):133–168, 1997.
[GHRU11] Anupam Gupta, Moritz Hardt, Aaron Roth, and Jonathan Ullman. Privately releasing conjunc-
tions and the statistical query barrier. In Proceedings of the 43rd annual ACM symposium on
Theory of computing, 2011.
12
[GJ90] Michael R. Garey and David S. Johnson. Computers and Intractability; A Guide to the Theory
of NP-Completeness. 1990.
[GR06] Venkatesan Guruswami and Prasad Raghavendra. Hardness of learning halfspaces with noise.
In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science,
2006.
[GR09] V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. SIAM Journal
on Computing, 39(2):742–765, 2009.
[GSSS13] A. Gonen, S. Sabato, and S. Shalev-Shwartz. Efficient pool-based active learning of halfspaces.
In ICML, 2013.
[Han07] S. Hanneke. A bound on the label complexity of agnostic active learning. In ICML, 2007.
[Han11] S. Hanneke. Rates of convergence in active learning. The Annals of Statistics, 39(1):333–361,
2011.
[JP78] D. S. Johnson and F. Preparata. The densest hemisphere problem. Theoretical Computer
Science, 6(1):93 – 107, 1978.
[KKMS05] Adam Tauman Kalai, Adam R. Klivans, Yishay Mansour, and Rocco A. Servedio. Agnostically
learning halfspaces. In Proceedings of the 46th Annual IEEE Symposium on Foundations of
Computer Science, 2005.
[KL88] Michael Kearns and Ming Li. Learning in the presence of malicious errors. In Proceedings of
the twentieth annual ACM symposium on Theory of computing, 1988.
[KLS09] A. R. Klivans, P. M. Long, and Rocco A. Servedio. Learning halfspaces with malicious noise.
Journal of Machine Learning Research, 10, 2009.
[Kol10] V. Koltchinskii. Rademacher complexities and bounding the excess risk in active learning.
Journal of Machine Learning Research, 11:2457–2485, 2010.
[KSS94] Michael J. Kearns, Robert E. Schapire, and Linda M. Sellie. Toward efficient agnostic learning.
Mach. Learn., 17(2-3), November 1994.
[KV94] M. Kearns and U. Vazirani. An introduction to computational learning theory. MIT Press,
Cambridge, MA, 1994.
[LS06] P. M. Long and R. A. Servedio. Attribute-efficient learning of decision lists and linear threshold
functions under unconcentrated distributions. NIPS, 2006.
[LS11] P. M. Long and R. A. Servedio. Learning large-margin halfspaces with more malicious noise.
NIPS, 2011.
[LV07] L. Lovasz and S. Vempala. The geometry of logconcave functions and sampling algorithms.
Random Structures and Algorithms, 30(3):307–358, 2007.
[Mon06] Claire Monteleoni. Efficient algorithms for general active learning. In Proceedings of the 19th
annual conference on Learning Theory, 2006.
13
[Pol11] D. Pollard. Convergence of Stochastic Processes. Springer Series in Statistics. 2011.
[Reg05] Oded Regev. On lattices, learning with errors, random linear codes, and cryptography. In
Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, 2005.
[RR11] M. Raginsky and A. Rakhlin. Lower bounds for passive and active learning. In NIPS, 2011.
[Ser01] Rocco A. Servedio. Smooth boosting and learning with malicious noise. In 14th Annual Con-
ference on Computational Learning Theory and 5th European Conference on Computational
Learning Theory, 2001.
[SZ03] J. Sturm and S. Zhang. On cones of nonnegative quadratic functions. Mathematics of Opera-
tions Research, 28:246–267, 2003.
[Val85] L. G. Valiant. Learning disjunction of conjunctions. In Proceedings of the 9th International
Joint Conference on Artificial intelligence, 1985.
[Vap98] V. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.
[Vem10] S. Vempala. A random-sampling-based algorithm for learning intersections of halfspaces.
JACM, 57(6), 2010.
[Wan11] L. Wang. Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active
Learning. JMLR, 2011.
[Zha06] T. Zhang. Information theoretical upper and lower bounds for statistical estimation. IEEE
Transactions on Information Theory, 52(4):1307–1321, 2006.
A Additional Related Work
Passive Learning Blum et al. [BFKV97] considered noise-tolerant learning of halfspaces under a more
idealized noise model, known as the random noise model, in which the label of each example is flipped
with a certain probability, independently of the feature vector. Some other, less closely related, work on
efficient noise-tolerant learning of halfspaces includes [Byl94, BFKV97, FGKP06, GR09, Ser01, ABS10,
LS11, BSS12].
Active Learning As we have mentioned, most prior theoretical work on active learning focuses on either
sample complexity bounds (without regard for efficiency) or on providing polynomial time algorithms in the
noiseless case or under simple noise models (random classification [BF13] noise or linear noise [CGZ10,
DGS12]).
In [CGZ10, DGS12] online learning algorithms in the selective sampling framework are presented,
where labels must be actively queried before they are revealed. Under the assumption that the label condi-
tional distribution is a linear function determined by a fixed target vector, they provide bounds on the regret
of the algorithm and on the number of labels it queries when faced with an adaptive adversarial strategy of
generating the instances. As pointed out in [DGS12], these results can also be converted to a distributional
PAC setting where instances xt are drawn i.i.d. In this setting they obtain exponential improvement in label
complexity over passive learning. These interesting results and techniques are not directly comparable to
ours. Our framework is not restricted to halfspaces. Another important difference is that (as pointed out
14
in [GSSS13]) the exponential improvement they give is not possible in the noiseless version of their set-
ting. In other words, the addition of linear noise defined by the target makes the problem easier for active
sampling. By contrast RCN can only make the classification task harder than in the realizable case.
Recently, [BF13] showed the first polynomial time algorithms for actively learning thresholds, balanced
rectangles, and homogenous linear separators under log-concave distributions in the presence of random
classification noise. Active learning with respect to isotropic log-concave distributions in the absence of
noise was studied in [BL13].
B Initializing with vector w0
Suppose we have an algorithm B as a subroutine that works, given access to such a w0. Then we can arrive
at an algorithm A which works without it as follows. We will describe the procedure below for general
admissible distributions. With probability 1, for a random u, either u or −u has an acute angle with w∗. We
may then run B with both choices, ǫ set to πc64 for any admissible distribution. Here c6 is the constant in
Definition 4.1. Then we can use hypothesis testing on O(log(1/δ)) examples, and, with high probability,
find a hypothesis w′ with error less than πc64 . Part 3 of Definition 4.1 then implies that A may then set
w0 = w′, and call B again.
C Proof of Theorem 3.1
We start by stating state properties of the distribution D which will be useful in our analysis in the next
section.
1. [Bau90, BBZ07, KKMS05] For any C > 0, there are c1, c2 > 0 such that, for x drawn from the
uniform distribution over Sd−1 and any unit length u ∈ Rd,
• for all a, b ∈ [−C/√d,C/
√d] for which a ≤ b, we have
c1|b− a|√d ≤ Pr(u · x ∈ [a, b]) ≤ c2|b− a|
√d, (1)
• and if b ≥ 0, we have
Pr(u · x > b) ≤ 1
2e−db2/2. (2)
2. [BBZ07, BL13] For any c6 > 0, there is a c7 > 0 such that, for all d ≥ 4, the following holds. Let uand v be two unit vectors in Rd, and assume that θ(u, v) = α ≤ π/2. Then
Prx∼Dd
[sign(u · x) 6= sign(v · x) and |v · x| ≥ c7α√d] ≤ c6α. (3)
C.1 Margin based analysis
The proof of Theorem 3.1 follows the high level structure of the proof of [BBZ07]; the new element is
the application of Theorem C.4 which analyzes the performance of the hinge loss minimization algorithm
for learning inside the band, which in turn applies Theorem C.1, which analyzes the benefits of our new
localized outlier removal procedure.
Proof (of Theorem 1.1): We will prove by induction on k that after k ≤ s iterations, we have errD(wk) ≤2−(k+1) with probability 1− δ(1 − 1/(k + 1))/2.
15
When k = 0, all that is required is errD(w0) ≤ 1/2.
Assume now the claim is true for k − 1 (k ≥ 1). Then by induction hypothesis, we know that with
probability at least 1− δ(1 − 1/k)/2, wk−1 has error at most 2−k. This implies θ(wk−1, w∗) ≤ π2−k.
Let us define Swk−1,bk−1= x : |wk−1 · x| ≤ bk−1 and Swk−1,bk−1
= x : |wk−1 · x| > bk−1.Since wk−1 has unit length, and vk ∈ B(wk−1, rk), we have θ(wk−1, vk) ≤ rk which in turn implies
θ(wk−1, wk) ≤ rk.
Applying Equation 3 to bound the error rate outside the band, we have both:
Prx
[
(wk−1 · x)(wk · x) < 0, x ∈ Swk−1,bk−1
]
≤ 2−(k+4) and
Prx
[
(wk−1 · x)(w∗ · x) < 0, x ∈ Swk−1,bk−1
]
≤ 2−(k+4).
Taking the sum, we obtain Prx[
(wk · x)(w∗ · x) < 0, x ∈ Swk−1,bk−1
]
≤ 2−(k+3). Therefore, we have
err(wk) ≤ (errDwk−1,bk−1(wk)) Pr(Swk−1,bk−1
) + 2−(k+3).
Let c′2 be the constant from Equation 1. We have Pr(Swk−1,bk−1) ≤ 2c′2bk−1
√d, this implies
err(wk) ≤ (errDwk−1,bk−1
(wk))2c′2bk−1
√d+ 2−(k+3) ≤ 2−(k+1)
(
(errDwk−1,bk−1
(wk))4c1c′2 + 1/2
)
.
Recall that Dwk−1,bk−1is the distribution obtained by conditioning D on the event that x ∈ Swk−1,bk−1
.
Applying Theorem C.4, with probability 1 − δ2(k+k2)
, wk has error at most κ = 18c1c′2
within Swk−1,bk−1,
implying that err(wk) ≤ 2−(k+1), completing the proof of the induction, and therefore showing, with
probability at least 1− δ, O(log(1/ǫ)) iterations suffice to achieve err(wk) ≤ ǫ.A polynomial number of unlabeled samples are required by the algorithm and the number of labeled
examples required by the algorithm is∑
kmk = O(d(d+ log log(1/ǫ) + log(1/δ)) log(1/ǫ)).
C.2 Analysis of the outlier removal subroutine
The analysis of the learning algorithm uses the following theorem (same as Theorem 3.4 in the main body)
about the outlier removal subroutine of Figure 2.
Theorem C.1. There is a polynomial p such that, if n ≥ p(1/η, d, 1/ξ, 1/δ, 1/γ) examples are drawn from
the distribution Du,γ (each replaced with an arbitrary unit-length vector with probability η < 1/4), then,
with probability 1− δ, the output q of the algorithm in Figure 1 satisfies the following:
•∑
x∈S q(x) ≥ (1− ξ)|S| (a fraction 1− ξ of the weight is retained)
• For all unit length w such that ‖w − u‖2 ≤ r,
1
|S|∑
x∈Sq(x)(w · x)2 ≤ 2
(
r2
d− 1+ γ2
)
. (4)
Furthermore, the algorithm can be implemented in polynomial time.
Our proof of Theorem 3.4 proceeds through a series of lemmas. We would like to point out that in the
analysis below we will treat each element xi ∈ S as distinct (even if xi = xj for some j). Obviously, a
feasible q satisfies the requirements of the lemma. So all we need to show is
16
• there is a feasible solution q, and
• we can simulate a separation oracle: given a provisional solution q, we can find a linear constraint
violated by q in polynomial time.
We will start by working on proving that there is a feasible q. First of all, a Chernoff bound implies that
n ≥ poly(1/η, 1/δ) suffices for it to be the case that, with probability 1 − δ, at most 2η members of S are
noisy. Let us assume from now on that this is the case.
We will show that q∗ which sets q∗(x, y) = 0 for each noisy point, and q∗(x, y) = 1 for each non-noisy
point, is feasible. First we get a bound on E[(a.x)2] for all vectors a close to u. This is formlized in the
following lemma
Lemma C.2. For all a such that ‖u− a‖2 ≤ r and ‖a‖2 ≤ 1
Ex∼Uu,γ((a · x)2) ≤ r2/(d− 1) + γ2.
Proof. W.l.o.g. we may assume that u = (1, 0, 0, ..., 0). We can write x = (x1, x2, . . . , xd) as x = (x1, x′),
so that x′ is chosen uniformly over all vectors in Rd−1 of length at most
√
1− x21. Let us decompose
Ex∼D((a · x)2) into parts that we can analyze separately as follows.
Ex∼Uu,γ((a · x)2) = a21Ex∼Uu,γ(x21) + a1
n∑
i=2
aiEx∼Uu,γ(x1xi) +Ex∼Uu,γ((x′ · a)2). (5)
Thus Ex∼D((x′ · a)2) is at most the expectation of (x′ · a)2 when x′ = (0, x2, ..., xd) is sampled uniformly
from the unit ball in Rd−1. Thus
Ex∼Uu,γ((x′ · a)2) ≤ 1
d− 1
d∑
i=2
a2i ≤r2
d− 1. (6)
Furthermore, since |x1| ≤ γ when x is drawn from Uu,γ , we have
Ex∼Uu,γ(x21) ≤ γ2. (7)
Finally, by symmetry, Ex∼Uu,γ(x1xi) = 0 for all i. Putting this together with (7), (6) and (5) completes the
proof.
Next, we use VC tools to show the following bound on clean examples.
Lemma C.3. If we draw ℓ times i.i.d. from D to form C , with probability 1 − δ, we have that for any unit
length a,
1
ℓ
∑
x∈C(a · x)2 ≤ E[(a · x)2] +
√
O(d log(ℓ/δ)(d + log(1/δ)))
ℓ.
Proof: See Appendix H.
The above two lemmas imply that n = poly (d, 1/η, 1/δ, 1/γ) suffices for it to be the case that, for all
w ∈ B(u, r),1
|S|∑
x
q∗(x)(a · x)2 ≤ 2E[(a · x)2] ≤ 2(r2
d− 1+ γ2),
17
so that q∗ is feasible.
So what is left is to prove that the convex program has a separation oracle. First, it is easy to check
whether, for all x ∈ S, 0 ≤ q(x) ≤ 1, and whether∑
x∈S q(x) ≥ (1− ξ)|S|. An algorithm can first do that.
If these pass, then it needs to check whether there is a w ∈ B(u, r) with ||w||2 ≤ 1 such that
1
|S|∑
x∈Sq(x)(w · x)2 > c(
r2
d− 1+ γ2).
This can be done by finding w ∈ B(u, r) with ||w||2 ≤ 1 that maximizes∑
x∈S q(x)(w · x)2, and checking
it.
Suppose X is a matrix with a row for each x ∈ S, where the row is√
q(x)x. Then∑
x∈S q(x)(w ·x)2 =wTXTXw, and, maximizing this over w is an equivalent problem to minimizing wT (−XTX)w subject to
‖w − u‖2 ≤ r and ||w|| ≤ 1. Since −XTX is symmetric, problems of this form are known to be solvable
in polynomial time [SZ03] (see [BM13]).
C.3 The error within a band in each iteration
At each iteration, the algorithm of Figure 1 concentrates its attention on examples in the band. Our next
theorem (same as Theorem 3.2 in the main body) analyzes its error on these examples.
Theorem C.4. After round k of the algorithm in Figure 1, with probability 1− δk+k2
, we have errDwk−1,bk−1(wk) ≤
κ.
We will prove Theorem C.4 using a series of lemmas below. First, we bound the hinge loss of the
target w∗ within the band Swk−1,bk−1. Since we are analyzing a particular round k, to reduce clutter in the
formulas, for the rest of this section, let us refer to
• ℓτk simply as ℓ,
• Lτk(·,Dwk−1,bk−1) as L(·).
Lemma C.5. L(w∗) ≤ κ/12.
Proof. Notice that y(w∗ · x) is never negative, so, on any clean example (x, y), we have
ℓ(w∗, x, y) = max
0, 1 − y(w∗ · x)τk
≤ 1,
and, furthermore, w∗ will pay a non-zero hinge only inside the region where |w∗ · x| < τk. Hence,
L(w∗) ≤ PrDwk−1,bk−1
(|w∗ · x| ≤ τk) =Prx∼D(|w∗ · x| ≤ τk & |wk−1 · x| ≤ bk−1)
Prx∼D(|wk−1 · x| ≤ bk−1).
Let c′1 and c′2 be the constants in Equation (1) respectively. We can lower bound the denominator Prx∼D(|wk−1·x| < bk−1) ≥ 2c′1bk−1
√d. Also the numerator is at most Prx∼D(|w∗ · x| ≤ τk) ≤ 2c′2τk
√d. Hence, we
have L(w∗) ≤ 2c′2τk
2c′1bk−1
= κ/12. (by setting c4 = c′1 and c1 = c′2/2.)
During round k we can decompose the working set W into the set of “clean” examples WC which are
drawn from Dwk−1,bk−1and the set of “dirty” or malicious examples WD which are output by the adversary.
We will next show that the fraction of dirty examples in round k is not too large.
18
Lemma C.6. With probability 1− δ6(k+k2)
,
|WD| ≤ 8c1c4ηnk2k. (8)
Proof. From Equation 1 and the setting of our parameters, the probability that an example falls in Swk−1,bk−1
2c1c42−k. Therefore, with probability (1 − δ
12(k+k2)), the number of examples we must draw before we
encounter nk examples that fall within Swk−1,bk−1is at most 4c1c4nk2
k. The probability that each unlabeled
example we draw is noisy is at most η. Applying a Chernoff bound, with probability at least 1− δ12(k+k2)
,
|WD| ≤ 8c1c4ηnk2k.
completing the proof.
Next, we bound the loss on an example in terms of the norm of x.
Lemma C.7. For any w ∈ B(wk−1, rk), and all x,
ℓ(w, x, y) ≤ 4c2π
c4
√d.
Proof. A simple calculation shows:
ℓ(w, x, y) ≤ 1 +|w · x|τk
≤ 1 +|wk−1 · x|+ ‖w − wk−1‖2||x||2
τk
≤ 1 +bk−1 + rk
τk≤ 4c2π
c4
√d.
Recall that the total variation distance between two probability distributions is the maximum difference
between the probabilities that the assign to any event. We can think of q as soft indicator functions for
“keeping” examples, and so interpret the inequality∑
x∈W q(x) ≥ (1 − ξ)|W | as roughly akin to saying
that most examples are kept. This means that distribution p obtained by normalizing q is close to the uniform
distribution over W . We make this precise in the following lemma.
Lemma C.8. The total variation distance between p and the uniform distribution over W is at most ξ.
Proof. Lemma 1 of [LS06] implies that the total variation distance ρ between q and the uniform distribution
over W satisfies
ρ = 1−∑
x∈Wmin
q(x),1
|W |
.
Since q(x) ≤ 1 for all x, we have∑
x∈W q(x) ≤ |W |, so that
ρ ≤ 1− 1
|W |∑
x∈Wminq(x), 1.
Again, since q(x) ≤ 1, we have
ρ ≤ 1− (1− ξ)|W ||W | = ξ.
19
Next, we will relate the average hinge loss when examples are weighted according to p i.e., ℓ(w, p) to
the hinge loss averaged over clean examples WC , i.e., ℓ(w,WC). This is relationship is better than using a
uniform bound on the variance since, within the band, projecting the data onto directions close to wk−1 will
lead to much smaller variance. Specifically, we prove the following lemma (same as Lemma 3.5 in the main
body but with precise constants) Here ℓ(w,WC) and ℓ(w, p) are defined with respect to the true unrevealed
labels that the adversary has committed to.
Lemma C.9. Define zk =
√
r2kd−1 + b2k−1. For large enough d, with probability 1 − δ
2(k+k2), for any
w ∈ B(wk−1, rk), we have
ℓ(w,WC ) ≤ ℓ(w, p) +32c1c4η
ǫ
(
1 +zkτk
)
+ κ/32 (9)
and
ℓ(w, p) ≤ 2ℓ(w,WC ) + κ/32 +8c1c4η
ǫ+
√
32c1c4η/ǫzkτk
(10)
Proof. As in the analysis of the outlier removal procedure, we will treat each element (x, y) ∈W as distinct.
Fix an arbitrary w ∈ B(wk−1, rk). By the guarantee of Theorem C.1, Lemma C.6, and Lemmas C.2 and
C.3 we know that, with probability 1− δ2(k+k2) ,
1
|W |∑
x∈Wq(x)(w · x)2 ≤ 4z2k , (11)
together with
|WD| ≤ 8c1c4ηnk2k (12)
and1
|WC |∑
(x,y)∈WC
(w · x)2 ≤ 2z2k, (13)
Assume that (11), (12) and (13) all hold.
Since∑
x∈W q(x) ≥ (1− ξk)|W | ≥ |W |/2, we have that (11) implies
∑
x∈Wp(x)(w · x)2 ≤ 8z2k . (14)
First, let us bound the weighted loss on noisy examples in the training set. In particular, we will show
that∑
(x,y)∈WD
p(x)ℓ(w, x, y) ≤ C0η2k + ξk +
√
2c′C0η2kzkτk
. (15)
20
To see this, notice that,
∑
(x,y)∈WD
p(x)ℓ(w, x, y) =∑
(x,y)∈WD
p(x)max
0, 1− y(w · x)τk
≤ Prp(WD) +
1
τk
∑
(x,y)∈WD
p(x)|w · x| = Prp(WD) +
1
τk
∑
(x,y)∈Wp(x)1WD
(x, y)|w · x|
≤ Prp(WD) +
1
τk
√
∑
(x,y)∈Wp(x)1WD
(x, y)
√
∑
(x,y)∈Wp(x)(w · x)2 (by the Cauchy-Shwartz inequality)
≤ Prp(WD) +
√
8Prp(WD)zkτk
≤ 8c1c4η2k + ξk +
√
64c1c4η2kzkτk
where the second to last inequality follows by by (14) and the last one follows by Lemma C.8 and (8).
Similarly, we will show that
∑
(x,y)∈Wp(x)ℓ(w, x, y) ≤ 1 +
4zkτk. (16)
To see this notice that,
∑
(x,y)∈Wp(x)ℓ(w, x, y) =
∑
(x,y)∈Wp(x)max
0, 1− y(w · x)τk
≤ 1 +1
τk
∑
x∈Wp(x)|w · x| ≤ 1 +
1
τk
√
∑
x∈Wp(x)(w · x)2 ≤ 1 +
4zkτk,
where the last step follow by (14). Next, we have
ℓ(w,WC) =1
|WC |
∑
(x,y)∈Wq(x)ℓ(w, x, y) + (1WC
(x, y)− q(x))ℓ(w, x, y)
≤ 1
|WC |
∑
(x,y)∈Wq(x)ℓ(w, x, y) +
∑
(x,y)∈WC
(1− q(x))ℓ(w, x, y)
≤ 1
|WC |
∑
(x,y)∈Wq(x)ℓ(w, x, y) +
∑
(x,y)∈WC
(1− q(x))(
1 +|w · x|τk
)
≤ 1
|WC |
∑
(x,y)∈Wq(x)ℓ(w, x, y) + ξk|W |+
1
τk
∑
(x,y)∈WC
(1− q(x))|w · x|
≤ 1
|WC |
∑
(x,y)∈Wq(x)ℓ(w, x, y) + ξk|W |+
1
τk
√
∑
(x,y)∈WC
(1− q(x))2√
∑
(x,y)∈WC
(w · x)2
21
by the Cauchy-Shwartz inequality. Recall that 0 ≤ q(x) ≤ 1, and∑
x∈W q(x) ≥ (1− ξk)|W |. Thus,
ℓ(w,WC ) ≤1
|WC |
∑
(x,y)∈Wq(x)ℓ(w, x, y) + ξk|W |+
1
τk
√
ξk|W |√
∑
x∈WC
(w · x)2
≤ 1
|WC |
∑
(x,y)∈Wq(x)ℓ(w, x, y) + ξk|W |+
√
ξk|W ||WC |2z2kτk
by (25). Since |WC | ≥ |W |/2, we have
ℓ(w,WC) ≤1
|WC |
∑
(x,y)∈Wq(x)ℓ(w, x, y)
+ 2ξk +
√
4ξkz2k
τk.
We have chosen ξk small enough that
ℓ(w,WC) ≤1
|WC |
∑
(x,y)∈Wq(x)ℓ(w, x, y)
+ κ/32
=
∑
(x,y)∈W q(x)
|WC |
∑
(x,y)∈Wp(x)ℓ(w, x, y)
+ κ/32
= ℓ(w, p) +
(∑
(x,y)∈W q(x)
|WC |− 1
)
∑
(x,y)∈Wp(x)ℓ(w, x, y)
+ κ/32
≤ ℓ(w, p) +( |W ||WC |
− 1
)
∑
(x,y)∈Wp(x)ℓ(w, x, y)
+ κ/32
≤ ℓ(w, p) +( |W ||WC |
− 1
)(
1 +4zkτk
)
+ κ/32.
Applying (12) yields (9).
22
Also,
ℓ(w, p) =∑
(x,y)∈Wp(x)ℓ(w, x, y)
=∑
(x,y)∈WC
p(x)ℓ(w, x, y) +∑
(x,y)∈WD
p(x)ℓ(w, x, y)
≤∑
(x,y)∈WC
p(x)ℓ(w, x, y) + 8c1c4η2k + ξk +
√
32c1c4η2kzkτk
(by (27)).
=
∑
(x,y)∈WCq(x)ℓ(w, x, y)
∑
(x,y)∈WCq(x)
+ 8c1c4η2k + ξk +
√
32c1c4η2kzkτk
≤∑
(x,y)∈WCℓ(w, x, y)
∑
(x,y)∈WCq(x)
+ 8c1c4η2k + ξk +
√
32c1c4η2kzkτk
(since ∀x, q(x) ≤ 1)).
≤∑
(x,y)∈WCℓ(w, x, y)
|WC | − ξk|W |+ 8c1c4η2
k + ξk +
√
32c1c4η2kzkτk
≤ 2ℓ(w,WC ) + 8c1c4η2k + ξk +
√
32c1c4η2kzkτk
,
by (8), which in turn implies (10).
Finally, we need some bounds about estimates of the hinge loss.
Lemma C.10. With probability 1− δ2(k+k2)
, for all w ∈ B(wk−1, rk),
|L(w) − ℓ(w,WC)| ≤ κ/32 (17)
and
|ℓ(w, p) − ℓ(w, T )| ≤ κ/32. (18)
Proof. See Appendix H.
Proof of Theorem C.4. By Lemma C.10, with probability 1 − δ2(k+k2)
, for all w ∈ B(wk−1, rk), (17) and
(18) hold. Also with probability 1− δ2(k+k2)
, both (9) and (10) hold. Let us assume from here on that all of
these hold.
Then we have
errDwk−1,bk−1(wk) = errDwk−1,bk−1
(vk)
≤ L(vk) (since for each error, the hinge loss is at least 1)
≤ ℓ(vk,WC) + κ/32 (by (17))
≤ ℓ(vk, p) +32c1c4η
ǫ
(
1 +zkτk
)
+ κ/16 (by (9))
≤ ℓ(vk, T ) +32c1c4η
ǫ
(
1 +zkτk
)
+ κ/8 (by (18))
≤ ℓ(w∗, T ) +32c1c4η
ǫ
(
1 +zkτk
)
+ κ/4 (since w∗ ∈ B(wk−1, rk))
≤ ℓ(w∗, p) +32c1c4η
ǫ
(
1 +zkτk
)
+ κ/3 (by (18)).
23
This, together with (10) and (17), gives
errDwk−1,bk−1(wk) ≤ 2ℓ(w∗,WC) +
8c1c4η
ǫ+
√
32c1c4η/ǫzkτk
+32c1c4η
ǫ
(
1 +zkτk
)
+ 2κ/5
≤ 2L(w∗) +8c1c4η
ǫ+
√
32c1c4η/ǫzkτk
+32c1c4η
ǫ
(
1 +zkτk
)
+ κ/2
≤ κ/3 +8c1c4η
ǫ+
√
32c1c4η/ǫzkτk
+32c1c4η
ǫ
(
1 +zkτk
)
+ κ/2,
by Lemma C.5.
Now notice that zk/τk is Θ(1). Hence an Ω(ǫ) bound on η suffices to imply that errDwk−1,bk−1(wk) ≤ κ
with probability (1− δk+k2
).
D Proof of Theorem 4.2
Throughout this section, assume that the clean training examples are obtained by labeling data drawn ac-
cording to a distribution D over Rd chosen from a λ-admissible sequence. The main algorithm and the
outlier removal procedure remain the same with the following parameters.
D.1 Parameters for the algorithm
The parameters of the algorithm are set as follows. Let M = max 2c6π
, 2, where c6 is from Definition 4.1.
Let c′1 be the value of c5 in part 2 of Definition 4.1 corresponding to the case where c4 is c64M ; then let
bk = c′1M−k.
Let c′6 and c′7 be c2 and c3 respectively, from part 1 of Definition 4.1. Let rk = minM−(k−1)/c6, π/2,where c6 is from Definition 4.1 and κ = 1
4c′1c′7M . Finally, let τk =
c2 minbk−1,c1κ6c3
, where c1, c2 and c3 are
the values from Definition 4.1. Let z2k = (r2k + b2k−1) and ξk = cmin(κ,κ2τ2kz2k
). The value of σ2k for the
outlier removal procedure is lnλ(1 + 1bk−1
)(r2k + b2k−1)
D.2 Analysis of the outlier removal subroutine
The analysis of the learning algorithm uses the following lemma about the outlier removal subroutine of
Figure 2.
Theorem D.1. For any C > 0, there is a constant c and a polynomial p such that, for all ξ > 2η and all
0 < γ < C , if n ≥ p(1/η, d, 1/ξ, 1/δ, 1/γ), then, with probability 1 − δ, the output q of the algorithm in
Figure 2 satisfies the following:
• ∑x∈S q(x) ≥ (1− ξ)|S| (a fraction 1− ξ of the weight is retained)
• For all unit length w such that ‖w − u‖2 ≤ r,
1
|S|∑
x∈Sq(x)(w · x)2 ≤ c lnλ(1 + 1
γ)(r2 + γ). (19)
Furthermore, the algorithm can be implemented in polynomial time.
24
Almost identical to the previous section our proof of Theorem D.1 proceeds through a series of lem-
mas. Again, we would like to point out that in the analysis below we will treat each element xi ∈ S as
distinct (even if xi = xj for some j). Obviously, a feasible q satisfies the requirements of the lemma. So all
we need to show is
• there is a feasible solution q, and
• we can simulate a separation oracle: given a provisional solution q, we can find a linear constraint
violated by q in polynomial time.
We will start by working on proving that there is a feasible q. First of all, a Chernoff bound implies that
n ≥ poly(1/η, 1/δ) suffices for it to be the case that, with probability 1 − δ, at most 2η members of S are
noisy. Let us assume from now on that this is the case.
We will show that q∗ which sets q∗(x) = 0 for each noisy point, and q∗(x) = 1 for each non-noisy
point, is feasible.
First, we use VC tools to show that, if enough examples are chosen, a bound like part 4 of Definition 4.1,
but averaged over the clean examples, likely holds for all relevant directions.
Lemma D.2. If we draw ℓ times i.i.d. from D to form C , with probability 1 − δ, we have that for any unit
length a,
1
ℓ
∑
x∈C(a · x)2 ≤ E[(a · x)2] +
√
O(d log(ℓ/δ)(d + log(1/δ)))
ℓ.
Proof: See Appendix H.
Lemma D.2 and part 4 of Definition 4.1 together directly imply that
n = poly
(
d, 1/η, 1/δ,1
c(r2 + γ2) lnλ(1 + 1/γ)
)
= poly (d, 1/η, 1/δ, 1/γ)
suffices for it to be the case that, for all w ∈ B(u, r),
1
|S|∑
(x,y)
q∗(x)(a · x)2 ≤ 2E[(a · x)2] ≤ 2c8(r2 + γ2) lnλ(1 + 1/γ),
so that, if c = 2c8, we have that q∗ is feasible.
So what is left is to prove that the convex program has a separation oracle. First, it is easy to check
whether, for all x, 0 ≤ q(x) ≤ 1, and whether∑
x∈S q(x) ≥ (1 − ξ)|S|. An algorithm can first do that. If
these pass, then it needs to check whether there is a w ∈ B(u, r) with ||w||2 ≤ 1 such that
1
|S|∑
x∈Sq(x)(w · x)2 > c logλ
(
1 +1
γ
)
(r2 + γ2).
This can be done by finding w ∈ B(u, r) with ||w||2 ≤ 1 that maximizes∑
x∈S q(x)(w · x)2, and checking
it.
Suppose X is a matrix with a row for each x ∈ S, where the row is√
q(x)x. Then∑
x∈S q(x)(w ·x)2 =wTXTXw, and, maximizing this over w is an equivalent problem to minimizing wT (−XTX)w subject to
‖w − u‖2 ≤ r and ||w|| ≤ 1. Since −XTX is symmetric, problems of this form are known to be solvable
in polynomial time [SZ03] (see [BM13]).
25
D.3 The error within a band in each iteration
At each iteration, the algorithm of Figure 1 concentrates its attention on examples in the band. Our next
theorem analyzes its error on these examples.
Theorem D.3. After round k of the algorithm in Figure 1, with probability 1− δk+k2
, we have errDwk−1,bk−1(wk) ≤
κ.
We will prove Theorem D.3 using a series of lemmas below. First, we bound the hinge loss of the
target w∗ within the band Swk−1,bk−1. Since we are analyzing a particular round k, to reduce clutter in the
formulas, for the rest of this section, let us refer to
• ℓτk simply as ℓ,
• Lτk(·,Dwk−1,bk−1) as L(·).
Lemma D.4. L(w∗) ≤ κ/6.
Proof. Notice that y(w∗ · x) is never negative, so, on any clean example (x, y), we have
ℓ(w∗, x, y) = max
0, 1 − y(w∗ · x)τk
≤ 1,
and, furthermore, w∗ will pay a non-zero hinge only inside the region where |w∗ · x| < τk. Hence,
L(w∗) ≤ PrDwk−1,bk−1
(|w∗ · x| ≤ τk) =Prx∼D(|w∗ · x| ≤ τk & |wk−1 · x| ≤ bk−1)
Prx∼D(|wk−1 · x| ≤ bk−1).
Using part 1 of Definition 4.1, for the values of c1 and c2 in that definition, we can lower bound the
denominator:
Prx∼D
(|wk−1 · x| < bk−1) ≥ c2 minbk−1, c1.
part 1 of Definition 4.1 also implies that the numerator is at most
Prx∼D
(|w∗ · x| ≤ τk) ≤ c3τk.
Hence, we have
L(w∗) ≤ c3τkc2 minbk−1, c1
= κ/6.
During round k we can decompose the working set W into the set of “clean” examples WC which are
drawn from Dwk−1,bk−1and the set of “dirty” or malicious examples WD which are output by the adversary.
We will next show that the fraction of dirty examples in round k is not too large.
Lemma D.5. There is an absolute positive constant C0 such that, with probability 1− δ6(k+k2)
,
|WD| ≤ C0ηnkMk. (20)
26
Proof. From Equation 1 and the setting of our parameters, the probability that an example falls in Swk−1
is at least Ω(M−k). Therefore, with probability (1 − δ12(k+k2)
), the number of examples we must draw
before we encounter nk examples that fall within Swk−1,bk−1is at most O(nkM
k). The probability that
each unlabeled example we draw is noisy is at most η. Applying a Chernoff bound, with probability at least
1− δ12(k+k2)
,
|WD| ≤ C0ηnkMk.
completing the proof.
Next, we bound the loss on an example in terms of the norm of x.
Lemma D.6. There is a constant c such that, for any w ∈ B(wk−1, rk), and all x,
ℓ(w, x, y) ≤ c(1 + ||x||2).
Proof.
ℓ(w, x, y) ≤ 1 +|w · x|τk
≤ 1 +|wk−1 · x|+ ‖w − wk−1‖2||x||2
τk
≤ 1 +bk−1 + rk||x||2
τk= 1 +
c′1M−k +minM−(k−1)/c6, π/2||x||2
c2 minc′1M−k,c1κ
6c3
.
If the support of D is bounded, Lemma D.6 gives a useful worst-case bound on the loss. Next, we give
a high-probability bound that holds for all λ-admissible distributions.
Lemma D.7. For an absolute constant c, with probability 1− δ6(k+k2) ,
maxx∈WC
||x||2 ≤ c√d ln
( |WC |kδ
)
.
Proof. Applying part 5 of Definition 4.1, together with a union bound, we have
Pr(∃x ∈WC , ||x|| > α) ≤ c9|WC | exp(−α/√d),
and α =√d ln
(
12c9|WC |k2δ
)
makes the RHS at most δ6(k+k2)
.
Recall that the total variation distance between two probability distributions is the maximum difference
between the probabilities that the assign to any event.
We can think of q as soft indicator functions for “keeping” examples, and so interpret the inequality∑
x∈W q(x) ≥ (1− ξ)|W | as roughly akin to saying that most examples are kept. This means that distribu-
tion p obtained by normalizing q is close to the uniform distribution over W . We make this precise in the
following lemma.
Lemma D.8. The total variation distance between p and the uniform distribution over W is at most ξ.
27
Proof. Lemma 1 of [LS06] implies that the total variation distance ρ between p and the uniform distribution
over W satisfies
ρ = 1−∑
x∈Wmin
p(x),1
|W |
.
Since q(x) ≤ 1 for all x, we have∑
x∈W q(x) ≤ |W |, so that
ρ ≤ 1− 1
|W |∑
x∈Wminq(x), 1.
Again, since q(x) ≤ 1, we have
ρ ≤ 1− (1− ξ)|W ||W | = ξ.
Next, we will relate the average hinge loss when examples are weighted according to p, i.e., ℓ(w, p) to
the hinge loss averaged over clean examples WC , i.e., ℓ(w,WC). Here ℓ(w,WC) and ℓ(w, p) are defined
with respect to the true unrevealed labels that the adversary has committed to.
Lemma D.9. There are absolute constants c1, c2 and c3 such that, for large enough d, with probability
1− δ2(k+k2)
, if we define zk =√
r2k + b2k−1, then for any w ∈ B(wk−1, rk), we have
ℓ(w,WC ) ≤ ℓ(w, p) +c1η
ǫ
(
1 +lnλ/2(1 + 1/bk)zk
τk
)
+ κ/32 (21)
and
ℓ(w, p) ≤ 2ℓ(w,WC) + κ/32 +c2η
ǫ+c3√
η/ǫ lnλ/2(1 + 1/bk)zkτk
(22)
Proof. As in the analysis of the outlier removal procedure, we will treat each element (x, y) ∈W as distinct.
Fix an arbitrary w ∈ B(wk−1, rk). By the guarantee of Theorem D.1, Lemma D.5, part 5 of Definition 4.1,
part 4 of Definition 4.1, and Lemma D.2, we know that, with probability 1− δ2(k+k2) ,
1
|W |∑
x∈Wq(x)(w · x)2 ≤ c′ lnλ(1 + 1/bk)z
2k, (23)
together with
|WD| ≤ C0ηnkM−k (24)
(for an absolute constant C0) and
1
|WC |∑
(x,y)∈WC
(w · x)2 ≤ c′′(r2 + γ2) lnλ(1 + 1/bk), (25)
for an absolute constant c′′.Assume that (23), (24) and (25) all hold.
Since∑
x∈W q(x) ≥ (1− ξk)|W | ≥ |W |/2, we have that (23) implies
∑
x∈Wp(x)(w · x)2 ≤ 2c′ lnλ(1 + 1/bk)z
2k. (26)
28
First, let us bound the weighted loss on noisy examples in the training set. In particular, we will show
that∑
(x,y)∈WD
p(x)ℓ(w, x, y) ≤ C0ηM−k + ξk +
√
2c′C0ηM−k lnλ/2(1 + 1/bk)zkτk
. (27)
To see this, notice that,
∑
(x,y)∈WD
p(x)ℓ(w, x, y) =∑
(x,y)∈WD
p(x)max
0, 1− y(w · x)τk
≤ Prp(WD) +
1
τk
∑
(x,y)∈WD
p(x)|w · x| = Prp(WD) +
1
τk
∑
(x,y)∈Wp(x)1WD
(x, y)|w · x|
≤ Prp(WD) +
1
τk
√
∑
(x,y)∈Wp(x)1WD
(x, y)
√
∑
(x,y)∈Wp(x)(w · x)2 (by the Cauchy-Shwartz inequality)
≤ Prp(WD) +
√
2c′ Prp(WD) lnλ/2(1 + 1/bk)zk
τk≤ C0ηM
−k + ξk +
√
2c′C0ηM−k lnλ/2(1 + 1/bk)zkτk
where the second to last inequality follows by (26) and the last one by Lemma D.8 and (24).
Similarly, we will show that
∑
(x,y)∈Wp(x)ℓ(w, x, y) ≤ 1 +
√c′ lnλ/2(1 + 1/bk)zk
τk. (28)
To see this notice that,
∑
(x,y)∈Wp(x)ℓ(w, x, y) =
∑
(x,y)∈Wp(x)max
0, 1− y(w · x)τk
≤ 1 +1
τk
∑
(x,y)∈Wp(x)|w · x| ≤ 1 +
1
τk
√
∑
(x,y)∈Wp(x)(w · x)2
≤ 1 + +
√2c′ lnλ/2(1 + 1/bk)zk
τk,
by (26).
29
Next, we have
ℓ(w,WC) =1
|WC |
∑
(x,y)∈Wq(x)ℓ(w, x, y) + (1WC
(x, y)− q(x))ℓ(w, x, y)
≤ 1
|WC |
∑
(x,y)∈Wq(x)ℓ(w, x, y) +
∑
(x,y)∈WC
(1− q(x))ℓ(w, x, y)
≤ 1
|WC |
∑
(x,y)∈Wq(x)ℓ(w, x, y) +
∑
(x,y)∈WC
(1− q(x))(
1 +|w · x|τk
)
≤ 1
|WC |
∑
(x,y)∈Wq(x)ℓ(w, x, y) + ξk|W |+
1
τk
∑
(x,y)∈WC
(1− q(x))|w · x|
≤ 1
|WC |
∑
(x,y)∈Wq(x)ℓ(w, x, y) + ξk|W |+
1
τk
√
∑
(x,y)∈WC
(1− q(x))2√
∑
(x,y)∈WC
(w · x)2
by the Cauchy-Shwartz inequality. Recall that 0 ≤ q(x) ≤ 1, and∑
(x,y)∈W q(x) ≥ 1− ξk|W |. Thus,
ℓ(w,WC ) ≤1
|WC |
∑
(x,y)∈Wq(x)ℓ(w, x, y) + ξk|W |+
1
τk
√
ξk|W |√
∑
(x,y)∈WC
(w · x)2
≤ 1
|WC |
∑
(x,y)∈Wq(x)ℓ(w, x, y) + ξk|W |+
√
ξk|W ||WC |c′′(r2 + γ2) lnλ(1 + 1/bk)
τk
by (25). Since |WC | ≥ |W |/2, we have
ℓ(w,WC) ≤1
|WC |
∑
(x,y)∈Wq(x)ℓ(w, x, y)
+ 2ξk +
√
2ξkc′′(r2 + γ2) lnλ(1 + 1/bk)
τk.
30
We have chosen ξk small enough that
ℓ(w,WC) ≤1
|WC |
∑
(x,y)∈Wq(x)ℓ(w, x, y)
+ κ/32
=
∑
(x,y)∈W q(x)
|WC |
∑
(x,y)∈Wp(x)ℓ(w, x, y)
+ κ/32
= ℓ(w, p) +
(∑
(x,y)∈W q(x)
|WC |− 1
)
∑
(x,y)∈Wp(x)ℓ(w, x, y)
+ κ/32
≤ ℓ(w, p) +( |W ||WC |
− 1
)
∑
(x,y)∈Wp(x)ℓ(w, x, y)
+ κ/32
≤ ℓ(w, p) +( |W ||WC |
− 1
)
(
1 +
√c′ lnκ/2(1 + 1/bk)zk
τk
)
+ κ/32.
Applying (24) yields (21).
Also,
ℓ(w, p) =∑
(x,y)∈Wp(x)ℓ(w, x, y)
=∑
(x,y)∈WC
p(x)ℓ(w, x, y) +∑
(x,y)∈WD
p(x)ℓ(w, x, y)
≤∑
(x,y)∈WC
p(x)ℓ(w, x, y) + C0ηM−k + ξk +
√
2c′C0ηM−k lnλ/2(1 + 1/bk)zkτk
(by (27)).
=
∑
(x,y)∈WCq(x)ℓ(w, x, y)
∑
(x,y)∈WCq(x)
+ C0ηM−k + ξk +
√
2c′C0ηM−k lnλ/2(1 + 1/bk)zkτk
≤∑
(x,y)∈WCℓ(w, x, y)
∑
(x,y)∈WCq(x)
+ C0ηM−k + ξk +
√
2c′C0ηM−k lnλ/2(1 + 1/bk)zkτk
(since ∀x, q(x) ≤ 1)).
≤∑
(x,y)∈WCℓ(w, x, y)
|WC | − ξ|W |+ C0ηM
−k + ξk +
√
2c′C0ηM−k lnλ/2(1 + 1/bk)zkτk
≤ 2ℓ(w,WC ) + C0ηM−k + ξk +
√
2c′C0ηM−k lnλ/2(1 + 1/bk)zkτk
,
by (24), which in turn implies (22).
Finally, we need some bounds about estimates of the hinge loss.
Lemma D.10. With probability 1− δ2(k+k2)
, for all w ∈ B(wk−1, rk),
|L(w) − ℓ(w,WC)| ≤ κ/32 (29)
31
and
|ℓ(w, p) − ℓ(w, T )| ≤ κ/32. (30)
Proof. See Appendix H.
Proof of Theorem D.3. By Lemma D.10, with probability 1 − δ2(k+k2)
, for all w ∈ B(wk−1, rk), (29) and
(30) hold. Also with probability 1 − δ2(k+k2)
, both (21) and (22) hold. Let us assume from here on that all
of these hold.
Then we have
errDwk−1,bk−1(wk) = errDwk−1,bk−1
(vk)
≤ L(vk) (since for each error, the hinge loss is at least 1)
≤ ℓ(vk,WC) + κ/16 (by (29))
≤ ℓ(vk, p) +c1η
ǫ
(
1 +lnλ/2(1 + 1/bk))zk
τk
)
+ κ/8 (by (21))
≤ ℓ(vk, T ) +c1η
ǫ
(
1 +lnλ/2(1 + 1/bk))zk
τk
)
+ κ/4 (by (30))
≤ ℓ(w∗, T ) +c1η
ǫ
(
1 +lnλ/2(1 + 1/bk))zk
τk
)
+ κ/4 (since w∗ ∈ B(wk−1, rk))
≤ ℓ(w∗, p) +c1η
ǫ
(
1 +lnλ/2(1 + 1/bk))zk
τk
)
+ κ/3 (by (30)).
This, together with (22) and (29), gives
errDwk−1,bk−1(wk) ≤ 2ℓ(w∗,WC) +
c2η
ǫ+c3√
η/ǫ lnλ/2(1 + 1/bk)zkτk
+c1η
ǫ
(
1 +lnλ/2(1 + 1/bk)zk
τk
)
+ 2κ/5
≤ 2L(w∗) +c2η
ǫ+c3√
η/ǫ lnλ/2(1 + 1/bk)zkτk
+c1η
ǫ
(
1 +lnλ/2(1 + 1/bk)zk
τk
)
+ κ/2
≤ κ/3 +c2η
ǫ+c3√
η/ǫ lnλ/2(1 + 1/bk)zkτk
+c1η
ǫ
(
1 +lnλ/2(1 + 1/bk)zk
τk
)
+ κ/2,
by Lemma D.4.
Now notice that zk/τk is Θ(1). Hence an Ω( ǫlogλ(1/ǫ)
) bound on η suffices to imply that errDwk−1,bk−1(wk) ≤
κ with probability (1− δk+k2 ).
D.4 Putting it together
Now we are ready to put everything together. The proof of Theorem 4.2 follows the high level structure of
the proof of [BBZ07]; the new element is the application of Theorem D.3 which analyzes the performance
of the hinge loss minimization algorithm for learning inside the band, which in turn applies Theorem D.1,
which analyzes the benefits of our new localized outlier removal procedure.
32
Proof (of Theorem 4.2): We will prove by induction on k that after k ≤ s iterations, we have errD(wk) ≤Mk with probability 1− δ(1 − 1/(k + 1))/2.
When k = 0, all that is required is errD(w0) ≤ 1.
Assume now the claim is true for k − 1 (k ≥ 1). Then by induction hypothesis, we know that with
probability at least 1− δ(1 − 1/k)/2, wk−1 has error at most M−(k−1). Using part 3 of Definition 4.1, this
implies that θ(wk−1, w∗) ≤ M−(k−1)/c6. This in turn implies θ(wk−1, w
∗) ≤ π/2. (When k = 1, this is
by assumption, and otherwise it is implied by part 3 of Definition 4.1.)
Let us define Swk−1,bk−1= x : |wk−1 · x| ≤ bk−1 and Swk−1,bk−1
= x : |wk−1 · x| > bk−1.Since wk−1 has unit length, and vk ∈ B(wk−1, rk), we have θ(wk−1, vk) ≤ rk which in turn implies
θ(wk−1, wk) ≤ minM−(k−1)/c6, π/2.Applying part 2 of Definition 4.1 to bound the error rate outside the band, we have both:
Prx
[
(wk−1 · x)(wk · x) < 0, x ∈ Swk−1,bk−1
]
≤ M−k
4and
and
Prx
[
(wk−1 · x)(w∗ · x) < 0, x ∈ Swk−1,bk−1
]
≤ M−k
4.
Taking the sum, we obtain Prx[
(wk · x)(w∗ · x) < 0, x ∈ Swk−1,bk−1
]
≤ M−k
2 . Therefore, we have
err(wk) ≤ (errDwk−1,bk−1(wk)) Pr(Swk−1,bk−1
) +M−k
2.
Since Pr(Swk−1,bk−1) ≤ 2c′7bk−1, this implies
err(wk) ≤ (errDwk−1,bk−1(wk))2c
′7bk−1 +
M−k
2≤M−k
(
(errDwk−1,bk−1(wk))2c
′1c
′7M + 1/2
)
.
Recall that Dwk−1,bk−1is the distribution obtained by conditioning D on the event that x ∈ Swk−1,bk−1
.
Applying Theorem D.3, with probability 1 − δ2(k+k2)
, wk has error at most κ = 14c′
1c′7M
within Swk−1,bk−1,
implying that err(wk) ≤ 1/Mk, completing the proof of the induction, and therefore showing, with proba-
bility at least 1− δ, O(log(1/ǫ)) iterations suffice to achieve err(wk) ≤ ǫ.A polynomial number of unlabeled samples are required by the algorithm and the number of labeled
examples required by the algorithm is∑
kmk = O(d(d+ log log(1/ǫ) + log(1/δ)) log(1/ǫ)).
E Proof of Theorem E.1
In this section, we describe an algorithm for learning λ-admissible distribution in the presence of adversarial
label noise. As before, we assume that the algorithm has access to w0 such that θ(w0, w∗) < π/2. This can
be shown to be without loss of generality exactly as in the case of malicious noise.
Theorem E.1. Let D be a distribution over Rd chosen from a λ-admissible sequence of distributions. Let
w∗ be the (unit length) target weight vector. There are absolute positive constants c′1, ..., c′4 and M > 1
and polynomial p such that, an Ω
(
ǫlogλ( 1
ǫ)
)
upper bound on a rate η of adversarial label noise suffices
to imply that for any ǫ, δ > 0, using the algorithm from Figure 3 with cut-off values bk = c′1M−k, radii
rk = c′2M−k, κ = c′3, τk = c′4M
−k for k ≥ 1, a number nk = p(d,Mk, log(1/δ)) of unlabeled examples
33
Figure 3 COMPUTATIONALLY EFFICIENT ALGORITHM TOLERATING ADVERSARIAL LABEL NOISE
Input: allowed error rate ǫ, probability of failure δ, an oracle that returns x, for (x, y) sampled from
EXη(f,D), and an oracle for getting the label from an example; a sequence of sample sizes mk > 0; a
sequence of cut-off values bk > 0; a sequence of hypothesis space radii rk > 0; a precision value κ > 0
1. Draw m1 examples and put them into a working set W .
2. For k = 1, . . . , s = ⌈log2(1/ǫ)⌉
(a) Find vk ∈ B(wk−1, rk) to approximately minimize training hinge loss over W s.t. ‖vk‖2 ≤ 1:
ℓτk(vk,W ) ≤ minw∈B(wk−1,rk)∩B(0,1)) ℓτk(w,W ) + κ/8
(b) Normalize vk to have unit length, yielding wk = vk‖vk‖2
.
(c) Clear the working set W .
(d) Until mk+1 additional data points are put in W , given x for (x, f(x)) obtained from EXη(f,D), if
|wk · x| ≥ bk, then reject x else put into W
Output: Weight vector ws of error at most ǫ with probability 1− δ.
in round k and a number mk = O(
d log(
dǫδ
)
(d+ log(k/δ)))
of labeled examples in round k ≥ 1, and w0
such that θ(w0, w∗) < π/2, after s = ⌈log2(1/ǫ)⌉ iterations, we find a separator ws satisfying err(ws) =
Pr(x,y)∼D[sign(w · x) 6= sign(w∗ · x)] ≤ ǫ with probability at least 1− δ.If the support of D is bounded in a ball of radius R(d), then mk = O
(
R(d)2(d+ log(k/δ)))
label
requests suffice.
To prove Theorem E.1, all we need is Theorem E.2 below, which bounds the error inside the band in
the case of adversarial label noise. Substituting this lemma for Theorem D.3 in the proof of Theorem 4.2
suffices to prove Theorem E.1. (In particular, for the rest of this subsection, rk, bk, κ and τk are set as in the
proof of Theorem 4.2.)
Theorem E.2. During round k of the algorithm in Figure 3, with probability 1− δk+k2 , we have
errDwk−1,bk−1(wk) ≤ κ.
We will prove Theorem E.2 using a series of lemmas.
Define ℓ and L as in the proof of Theorem D.3.
First, Lemma C.5, that L(w∗) ≤ κ/6, also applies here, using exactly the same proof.
From here, the proof is organized a little differently than before. There are two main structural dif-
ferences. First, before, we analyzed a relatively large set of unlabelled examples on which the algorithm
performed soft outlier removal, before subsampling and training. Here, since the algorithm will not perform
outlier removal, we may analyze the underlying distribution in place of the large unlabeled sample. The
second difference is that, whereas before, we separately analyzed the clean examples and the dirty exam-
ples, here, we will analyze properties of the noisy portion of the underlying distribution, but, here, instead of
comparing it with the clean portion, as we did before, we will compare it with the distribution that would be
obtained by fixing the incorrect labels. One reason that this is more convenient is that the marginal over the
instances of this “fixed” distribution is D (whereas the marginal of the clean examples, in general, is not).
Let P be the joint distribution used by the algorithm, which includes the noisy labels chosen by the
adversary. Let N = (x, y) : sign(w∗ · x) 6= y consist of noisy examples, so that P (N) ≤ η. Let P be the
joint distribution obtained by applying the correct labels. Let Pk be the distribution on the examples given
34
to the algorithm in round k (obtained by conditioning P to examples that fall within the band), and let Pk
be the corresponding joint distribution with clean labels.
The key lemma here is to relate the expected loss with respect to Pk to the expected loss with respect to
Pk.
Lemma E.3. There is an absolute positive constant c such that, if we define zk =√
r2k + b2k−1 then for any
w ∈ B(wk−1, rk), we have
|E(x,y)∈Pkℓ(w, x, y)) −E(x,y)∈Pk
ℓ(w, x, y))| ≤ c√
M−kηzk logλ/2(1 + 1/γ)
τk. (31)
Proof. Fix an arbitrary w ∈ B(wk−1, rk). Recalling that N is the set of noisy examples, and that the
marginals of Pk and Pk on the inputs are the same, we have
|E(x,y)∈Pk(ℓ(w, x, y)) −E(x,y)∈Pk
(ℓ(w, x, y))|= |E(x,y)∈Pk
(ℓ(w, x, y) − ℓ(w, x, sign(w∗ · x)))|= |E(x,y)∈Pk
(1(x,y)∈N (ℓ(w, x, y) − ℓ(w, x,−y)))|≤ E(x,y)∈Pk
(1(x,y)∈N |ℓ(w, x, y) − ℓ(w, x,−y)|)
≤ 2E(x,y)∈Pk
(
1(x,y)∈N
( |w · x|τk
))
=2
τkE(x,y)∈Pk
(
1(x,y)∈N |w · x|)
≤ 2
τk
√
Pr(x,y)∼Pk
(N)×√
E(x,y)∈Pk((w · x)2)
by the Cauchy-Schwartz inequality. Part 1 of Definition 4.1 implies that
Pr(x,y)∈Pk
(N) ≤Pr(x,y)∈P (N)
Pr(x,y)∈P (Swk−1,bk−1)≤ cM−kη,
for an absolute constant c, and part 4 of Definition 4.1 implies E(x,y)∈Pk((w ·x)2) ≤ cz2k logλ(1+1/γ).
Proof of Theorem E.2. Let
cleaned(W ) = (x, sign(w∗ · x)) : (x, y) ∈W.
Exploiting the fact that ℓ(w, x, y) = O
(
√
d log(
dǫδ
)
)
for all (x, y) ∈ Swk−1,bk−1andw ∈ B(wk−1, rk)
as in the proof of Lemma D.10, with probability 1− δk+k2 , for all w ∈ B(wk−1, rk), we have
|E(x,y)∈P (ℓ(w, x, y))− ℓ(w,W )| ≤ κ/32, and |E(x,y)∈P (ℓ(w, x, y))− ℓ(w, cleaned(W ))| ≤ κ/32. (32)
35
Then we have, for absolute constants c1 and c2, the following:
errDwk−1,bk−1(wk) ≤ E(x,y)∈Pk
(ℓ(wk, x, y)) (since for each error, the hinge loss is at least 1)
≤ 2E(x,y)∈Pk(ℓ(vk, x, y)) (since ‖vk‖2 ≥ 1/2)
≤ 2E(x,y)∈Pk(ℓ(vk, x, y)) + c1
√
η
ǫ× zk log
λ/2(1 + 1/bk)
τk(by Lemma E.3)
≤ 2ℓ(vk,W ) + c1
√
η
ǫ× zk log
λ/2(1 + 1/bk)
τk+ κ/16 (by (32))
≤ 2ℓ(w∗,W ) + c1
√
η
ǫ× zk log
λ/2(1 + 1/bk)
τk+ κ/8
≤ 2E(x,y)∈Pk(ℓ(w∗, x, y)) + c1
√
η
ǫ× zk log
λ/2(1 + 1/bk)
τk+ κ/4 (by (32))
≤ 2E(x,y)∈P (ℓ(w∗, x, y)) + c2
√
η
ǫ× zk log
λ/2(1 + 1/bk)
τk+ κ/4 (by Lemma E.3)
≤ c2
√
η
ǫ× zk log
λ/2(1 + 1/bk)
τk+ κ/2.
since L(w∗) ≤ κ/6. Since zk/τk = Θ(1), there is an constant c3 such that, η ≤ c3ǫ/ logλ(1+1/bk) suffices
for errDwk−1,bk−1
(wk) ≤ κ, completing the proof.
F Admissibility
F.1 Uniform distribution is 0-admissible
We will show the properties in Definition 4.1 hold for the uniform distribution with λ = 0. Part 1 is an easy
consequence of the corresponding known lemmas about the uniform distribution on the unit ball.
Lemma F.1 (see [Bau90, BBZ07, KKMS05]). For any C > 0, there are c1, c2 > 0 such that, for x drawn
from the uniform distribution over√dSd−1 and any unit length u ∈ R
d, (a) for all a, b ∈ [−C,C for which
a ≤ b, we have c1|b−a| ≤ Pr(u ·x ∈ [a, b]) ≤ c2|b−a|, and (b) if b ≥ 0, we have Pr(u ·x > b) ≤ 12e
−b2/2.
To prove part 2, we will use a lemma from [BL13] that generalizes and strengthens a key lemma from
[BBZ07].
Lemma F.2 (Theorem 4 of [BL13]). For any c1 > 0, there is a c2 > 0 such that the following holds. Let uand v be two unit vectors in Rd, and assume that θ(u, v) = α < π/2. If D is isotropic log-concave in R
d,
then Prx∼D[sign(u · x) 6= sign(v · x) and |v · x| ≥ c2α] ≤ c1α.
This has the following corollary, which proves part 2.
Lemma F.3. For any c1 > 0, there is a c2 > 0 such that the following holds for all d ≥ 4. Let u and v be
two unit vectors inRd, and assume that θ(u, v) = α < π/2. IfD is uniform over Sd−1, Prx∼D[sign(u·x) 6=sign(v · x) and |v · x| ≥ c2α/
√d] ≤ c1α.
Proof. Consider the distribution D′ obtained sampling from D, and scaling the result up by a factor of√d.
36
We claim that the projection D′′ of D′ onto the space spanned by u ·x and v ·x is isotropic log-concave.
This will imply Lemma F.3 by applying Lemma F.2, since the event in question only concerns the span of
u · x and v · x.
Assume without loss of generality that the span of u · x and v · x is
T = (x1, x2, 0, 0, ..., 0) : x ∈ Rd.
The fact that D′′ is isotropic follows from the fact that D′ is isotropic and the fact that it is log-concave
follows from the known fact that, if (x1, ..., xd) is sampled uniformly from Sd−1, then the distribution of
(x1, x2) is log-concave (see Corollary 4 of [BGMN05]).
Part 3 of Definition 4.1 holds trivially in the case of the uniform distribution.
The fact that part 4 of Definition 4.1 holds in the case of the isotropic rescaling of the uniform distribution
U over the surface of the unit ball follows immediately from Lemma C.2.
Part 5 follows from the fact that D is isotropic logconcave (see Lemma F.4 below).
F.2 Isotropic log-concave is 2 admissible
Part 1 of Definition 4.1 is part of the following lemma.
Lemma F.4 ([LV07]). Assume that D is isotropic log-concave in Rd and let f be its density function.
(a) We have Prx∼D [||X||2 ≥ α√d] ≤ e−α+1. If d = 1 then: Prx∼D [X ∈ [a, b]] ≤ |b− a|.
(b) All marginals of D are isotropic log-concave.
(c) If d = 1 we have f(0) ≥ 1/8 and f(x) ≤ 1 for all x.
(d) There is an absolute constant c such that, if d = 1, f(x) > c for all x ∈ [−1/9, 1/9].
Part 2 is Lemma F.2.
Part 3 is implicit in [Vem10] (see Lemma 3 of [BL13]).
In order to prove part 4, we will use the following lemma.
Lemma F.5. For any C > 0, there exists a constant c s.t., for any isotropic log-concave distribution D, for
any a such that, ‖a‖2 ≤ 1, and ||u− a||2 ≤ r, for any 0 < γ < C , and for any K ≥ 4, we have
Prx∼Du,γ
(
|a · x| > K√
r2 + γ2)
≤ c
γe−K .
Proof. W.l.o.g. we may assume that u = (1, 0, 0, · · · , 0).Let a′ = (a2, ..., ad), and, for a random x = (x1, x2, ..., xd) drawn from Du,γ , let x′ = (x2, ..., xd). Let
p = Prx∼Du,γ
(
|a · x| > K√
r2 + γ2)
be the probability that we want to bound. We may rewrite p as
p =Prx∼D
(
|a · x| > K√
r2 + γ2 and |x1| ≤ γ)
Prx∼D (|x1| ≤ γ). (33)
37
Lemma F.4 implies that there is a positive constant c1 such that the denominator satisfies the following lower
bound:
Prx∼D
(|x1| ≤ γ) ≥ c1 minγ, 1/9 ≥ c1γ
9C. (34)
So now, we just need an upper bound on the numerator. We have
Prx∼D
(
|a · x| > K√
r2 + γ2 and |x1| ≤ γ)
≤ Prx∼D
(
|a′ · x′| > K√
r2 + γ2 − γ)
≤ Prx∼D
(
|a′ · x′| > (K − 1)√
r2 + γ2)
≤ Prx∼D
(
|a′ · x′| > (K − 1)r)
≤ Prx∼D
(∣
∣
∣
∣
(
a′
||a′||2
)
· x′∣
∣
∣
∣
> K − 1
)
≤ e−(K−1),
by Lemma F.4, since the marginal distribution over x′ is isotropic log-concave. Combining with (33) and
(34) completes the proof.
Now we’re ready to prove Part 4.
Lemma F.6. For any C , there is a constant c such that, for all 0 < γ ≤ C , for all a such that ‖u− a‖2 ≤ rand ‖a‖2 ≤ 1
Ex∼Du,γ((a · x)2) ≤ c(r2 + γ2) ln2(1 + 1/γ).
Proof: Let z =√
r2 + γ2. Setting, with foresight, t = 9z2 ln2(1 + 1/γ), we have
Ex∼Du,γ((a · x)2)
=
∫ ∞
0Pr
x∼Du,γ
((a · x)2 ≥ α) dα
≤ t+∫ ∞
tPr
x∼Du,γ
((a · x)2 ≥ α) dα. (35)
Since t ≥ 4√
r2 + γ2, Lemma F.5 implies that, for an absolute constant c, we have
Ex∼Du,γ((a · x)2) ≤ t+c
γ
∫ ∞
texp
(
−(
α
r2 + γ2
)1/2)
dα.
Now, we want to evaluate the integral. Since z =√
r2 + γ2, so
∫ ∞
texp
(
−√
α
r2 + γ2
)
dα =
∫ ∞
texp
(
−√α/z
)
dα.
Using a change of variables u2 = α, we get∫ ∞
texp
(
−√α/z
)
dα = 2
∫ ∞
√tu exp (−u/z) du = 2z2(
√t+ 1) exp
(
−√t/z)
.
Putting it together, we get
Ex∼Du,γ((a · x)2) ≤ t+z2(√t+ 1) exp
(
−√t/z)
γ≤ t+ z2,
since t = 9z2 ln2(1 + 1/γ), completing the proof.
Finally Part 5 is also part of Lemma F.4.
38
G Relating Adversarial Label Noise and the Agnostic Setting
In this section we study the agnostic setting of [KSS94, KKMS05] and describe how our results imply
constant factor approximations in that model. In the agnostic model, data (x, y) is generated from a dis-
tribution D over ℜd × 1,−1. For a given concept class C , let OPT be the error of the best classifier
in C . In other words, OPT = argminf∈CerrD(f) = argminf∈CPr(x,y)∼D[f(x) 6= y]. The goal of
the learning algorithm is to output a hypothesis h which is nearly as good as f , i.e., given ǫ > 0, we want
errD(h) ≤ c · OPT + ǫ, where c is the approximation factor. Any result in the adversarial model that we
study, translates into a result for the agnostic setting via the following lemma.
Lemma G.1. For a given concept class C and distribution D, if there exists an algorithm in the adversarial
noise model which runs in time poly(d, 1/ǫ) and tolerates a noise rate of η = Ω(ǫ), then there exists an
algorithm for (C,D) in the agnostic setting which runs in time poly(d, 1/ǫ) and achieves errorO(OPT+ǫ).
Proof. Let f∗ be the optimal halfspace with error OPT . In the adversarial setting, w.r.t. f∗, the noise
rate η will be exactly OPT . Set ǫ′ = c(OPT + ǫ) as input to the algorithm for the adversarial model.
By the guarantee of the algorithm we will get a hypothesis h such that Pr(x,y)∼D[h(x) 6= f∗(x)] ≤ ǫ′ =c(OPT+ǫ). Hence by triangle inequality, we have errD(h) ≤ errD(f∗)+c(OPT+ǫ) = O(OPT+ǫ).
For the case when C is the class of origin centered halfspaces inRd and the marginal ofD is the uniform
distribution over Sd−1, the above lemma along with Theorem 1.2 implies that we can output a halfspace of
accuracy O(OPT + ǫ) in time poly(d, 1/ǫ). The work of [KKMS05] achieves a guarantee of O(OPT + ǫ)in time exponential in 1/ǫ by doing L2 regression to learn a low degree polynomial2 .
H Proof of VC lemmas
In this section, we apply some standard VC tools to establish some lemmas about estimates of expectations.
Definition H.1. Say that a set F of real-valued functions with a common domain X shatters x1, ..., xd ∈ Xif there are thresholds t1, ..., td such that
(sign(f(x1)− t1), ..., sign(f(xd)− td)) : f ∈ F = −1, 1d.
The pseudo-dimension of F is the size of the largest set shattered by F .
We will use the following bound.
Lemma H.2 (see [AB99]). Let F be a set of functions from a common domain X to [a, b] and let d be the
pseudo-dimension of F , and letD be a probability distribution overX. Then, form = O(
(b−a)2
α2 (d+ log(1/δ)))
,
if x1, ..., xm are drawn independently at random according to D, with probability 1− δ, for all f ∈ F ,
∣
∣
∣
∣
∣
Ex∼D(f(x))−1
m
m∑
t=1
f(xt)
∣
∣
∣
∣
∣
≤ α.
2They further show that L1 regression can achieve a stronger guarantee of OPT + ǫ
39
H.1 Proof of Lemma C.10 and Lemmas D.10
The pseudo-dimension of the set of linear combinations of d variables is known to be d [Pol11]. Since, for
any non-increasing function ψ : R → R and any F , the pseudo-dimension of ψ f : f ∈ F is at most
that of F (see [Pol11]), the pseudo-dimension of ℓ(w, ·) : w ∈ Rd is at most d.
Let D′ be the distribution obtained by conditioning D on the event that ||x|| < R (||x|| < 1 for uniform
distribution). For ℓ ≤ nk, the total variation distance between the joint distribution of ℓ draws from D′ and
ℓ draws from D is at most 1− δ4(k+k2)
, so it suffices to prove (29) and (30) with respect to D′ ((17) and (18)
respectively for the uniform distribution). Applying Lemma D.6 and Lemma H.2 then completes the proof.
H.2 Proof of Lemma D.2
Define fa by fa(x) = (a · x)2. The pseudo-dimension of the set of all such functions is O(d) [KLS09]. As
the proof of Lemma D.10, w.l.o.g., all x have ||x||2 ≤ R, and applying Lemma H.2 completes the proof.
40