Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the...
Transcript of Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the...
![Page 1: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/1.jpg)
Classification with imperfect training labels
Richard J. Samworth
University of Cambridge
39th Conference on Applied Statistics in Ireland (CASI2019)Dundalk, Ireland
15 May 2019
![Page 2: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/2.jpg)
Collaborators
Tim Cannings Yingying Fan
Richard J. Samworth 2/26
![Page 3: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/3.jpg)
Supervised classification
Richard J. Samworth 3/26
![Page 4: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/4.jpg)
Classification and label noise
With perfect labels in the binary response se�ing, we observe
(X1, Y1), . . . , (Xn, Yn)iid∼ P taking values in Rd × {0, 1}.
Task: Predict the class Y of a new observation X , where (X,Y ) ∼ Pindependently of the training data.
In many modern applications, however, it may be too expensive, di�icult ortime-consuming to determine class labels perfectly:
Uncorrupted: (X1, 1), (X2, 1), (X3, 0), (X4, 0) . . . , (Xn, 0)
Corrupted: (X1, 1), (X2, 0), (X3, 0), (X4, 0), . . . , (Xn, 1)
Richard J. Samworth 4/26
![Page 5: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/5.jpg)
Existing work
The topic has been well-studied in the machine learning/computer scienceliterature (Frénay and Kabán, 2014; Frénay and Verleysen, 2014).
I Lachenbruch (1966): LDA with zero intercept is consistent withρ-homogeneous noise, where each observation mislabelled independentlywith probability ρ ∈ (0, 1/2).
I Okamoto and Nobuhiro (1997) consider the k-nearest neighbour classifierwith n = 32 and small k.
‘. . . the predictive accuracy of 1−NN is strongly a�ectedby. . .class noise’.
I Ghosh et al. (2015):‘Many standard algorithms such as SVM perform poorly in the
presence of label noise’.
Other work seeks to identify mislabelled observations and flip or remove them.
Richard J. Samworth 5/26
![Page 6: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/6.jpg)
Existing work
The topic has been well-studied in the machine learning/computer scienceliterature (Frénay and Kabán, 2014; Frénay and Verleysen, 2014).
I Lachenbruch (1966): LDA with zero intercept is consistent withρ-homogeneous noise, where each observation mislabelled independentlywith probability ρ ∈ (0, 1/2).
I Okamoto and Nobuhiro (1997) consider the k-nearest neighbour classifierwith n = 32 and small k.
‘. . . the predictive accuracy of 1−NN is strongly a�ectedby. . .class noise’.
I Ghosh et al. (2015):‘Many standard algorithms such as SVM perform poorly in the
presence of label noise’.
Other work seeks to identify mislabelled observations and flip or remove them.
Richard J. Samworth 5/26
![Page 7: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/7.jpg)
Existing work
The topic has been well-studied in the machine learning/computer scienceliterature (Frénay and Kabán, 2014; Frénay and Verleysen, 2014).
I Lachenbruch (1966): LDA with zero intercept is consistent withρ-homogeneous noise, where each observation mislabelled independentlywith probability ρ ∈ (0, 1/2).
I Okamoto and Nobuhiro (1997) consider the k-nearest neighbour classifierwith n = 32 and small k.
‘. . . the predictive accuracy of 1−NN is strongly a�ectedby. . .class noise’.
I Ghosh et al. (2015):‘Many standard algorithms such as SVM perform poorly in the
presence of label noise’.
Other work seeks to identify mislabelled observations and flip or remove them.
Richard J. Samworth 5/26
![Page 8: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/8.jpg)
Motivating example
−4 −2 0 2
−3
−2
−1
01
23
−4 −2 0 2−
3−
2−
10
12
3
Priors π0 = 0.9, π1 = 0.1. Class conditionals X|Y = 0 ∼ N2
((−1, 0)>, I2
),
X|Y = 1 ∼ N2
((1, 0)>, I2
), n = 1000.
Le�: no noise; right: ρ-homogeneous noise with ρ = 0.3.
Richard J. Samworth 6/26
![Page 9: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/9.jpg)
Risks in motivating example
5 6 7 8
81
01
21
4
log(n)
Err
or
Misclassification error for predicting the true label of the test point, for the knn(black), SVM (red) and LDA (blue) classifiers.Solid lines: no label noise; dashed lines: 0.3-homogeneous label noise.
Richard J. Samworth 7/26
![Page 10: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/10.jpg)
Statistical se�ing
Let (X,Y, Y ), (X1, Y1, Y1), . . . , (Xn, Yn, Yn) be i.i.d. triples taking values inX × {0, 1} × {0, 1}.
We observe (X1, Y1), . . . , (Xn, Yn) and X . The task is to predict Y .
I For x ∈ X , define the regression function
η(x) := P(Y = 1|X = x)
and its corrupted version
η(x) := P(Y = 1|X = x).
I For x ∈ X and r ∈ {0, 1}, the conditional noise probabilities are
ρr(x) := P(Y 6= Y |X = x, Y = r).
We also write PX for the marginal distribution of X .
Richard J. Samworth 8/26
![Page 11: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/11.jpg)
Classifiers
A classifier C is a (measurable) function from X to {0, 1}.
The risk R(C) := P{C(X) 6= Y } is minimised by the Bayes classifier
CBayes(x) :=
{1 if η(x) ≥ 1/2
0 otherwise.
A classifier Cn, depending on the training data, is said to be consistent ifR(Cn)→ R(CBayes) as n→∞.
The corrupted risk R(C) := P{C(X) 6= Y } is minimised by the corrupted Bayesclassifier
CBayes(x) :=
{1 if η(x) ≥ 1/2
0 otherwise.
Richard J. Samworth 9/26
![Page 12: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/12.jpg)
General finite-sample result
Let S := {x ∈ X : η(x) = 1/2}, let B := {x ∈ Sc : ρ0(x) + ρ1(x) < 1} and let
A :=
{x ∈ B :
ρ1(x)− ρ0(x)
{2η(x)− 1}{1− ρ0(x)− ρ1(x)}< 1
}.
Theorem.(i) PX
(A4{x ∈ B : CBayes(x) = CBayes(x)}
)= 0.
(ii) Now suppose there exist ρ∗ < 1/2 and a∗ < 1 such thatPX({x ∈ Sc : ρ0(x) + ρ1(x)} > 2ρ∗}
)= 0, and
PX
(x ∈ B :
ρ1(x)− ρ0(x)
{2η(x)− 1}{1− ρ0(x)− ρ1(x)}> a∗
})= 0.
Then, for any classifier C ,
R(C)−R(CBayes) ≤ R(C)− R(CBayes)
(1− 2ρ∗)(1− a∗).
Richard J. Samworth 10/26
![Page 13: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/13.jpg)
General finite-sample result
Let S := {x ∈ X : η(x) = 1/2}, let B := {x ∈ Sc : ρ0(x) + ρ1(x) < 1} and let
A :=
{x ∈ B :
ρ1(x)− ρ0(x)
{2η(x)− 1}{1− ρ0(x)− ρ1(x)}< 1
}.
Theorem.(i) PX
(A4{x ∈ B : CBayes(x) = CBayes(x)}
)= 0.
(ii) Now suppose there exist ρ∗ < 1/2 and a∗ < 1 such thatPX({x ∈ Sc : ρ0(x) + ρ1(x)} > 2ρ∗}
)= 0, and
PX
(x ∈ B :
ρ1(x)− ρ0(x)
{2η(x)− 1}{1− ρ0(x)− ρ1(x)}> a∗
})= 0.
Then, for any classifier C ,
R(C)−R(CBayes) ≤ R(C)− R(CBayes)
(1− 2ρ∗)(1− a∗).
Richard J. Samworth 10/26
![Page 14: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/14.jpg)
Discussion
I This result is particularly useful when the classifier C is trained using thenoisy labels, i.e. with (X1, Y1), . . . , (Xn, Yn), since then the training andtest data in R(C) have the same distribution.
I We can then find conditions under which a classifier trained with imperfectlabels will remain consistent for classifying uncorrupted test data points.
For specific classifiers and under stronger conditions, we can provide furthercontrol of the excess risk
R(C)−R(CBayes).
Richard J. Samworth 11/26
![Page 15: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/15.jpg)
The k-nearest neighbour classifier
For x ∈ Rd, let (X(1), Y(1)), . . . , (X(n), Y(n)) be the reordering of the corruptedtraining data pairs such that
‖X(1) − x‖ ≤ . . . ≤ ‖X(n) − x‖.
Define
Cknn(x) :=
{1 if 1
k
∑ki=1 1{Y(i)=1} ≥ 1/2
0 otherwise.
Corollary. Assume the conditions of part (ii) of the lemma. If k = kn →∞,but k/n→ 0, then
R(Cknn)−R(CBayes)→ 0
as n→∞.
Richard J. Samworth 12/26
![Page 16: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/16.jpg)
Further assumptions
I Label noise: Assume the conditions of part (ii) of the lemma and that
ρ0(x) = g(η(x)) and ρ1(x) = g(1− η(x)),
where g : (0, 1)→ [0, 1) is twice di�erentiable. Assume thatg′(1/2) > 2g(1/2)− 1 and that g′′ is uniformly continuous.
I Distribution (Cannings et al., 2018): Among other technical conditions, assume thatPX has a density f , that η is twice continuously di�erentiable withinfx0∈S ‖η′(x0)‖ > 0, and that∫
Rd‖x‖αf(x) dx <∞.
I For β ∈ (0, 1/2), let
Kβ := {d(n− 1)βe, . . . , b(n− 1)1−βc}.
Richard J. Samworth 13/26
![Page 17: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/17.jpg)
Asymptotic expansion
Theorem. Under our assumptions, we have two cases:
(i) Suppose that d ≥ 5 and α > 4dd−4 , and let νn,k := k−1 + (k/n)4/d. Then there
exist B1 = B1(d, P ) > 0, B2 = B2(d, P ) ≥ 0 such that for each β ∈ (0, 1/2),
R(Cknn)−R(CBayes) =B1
k{1−2g(1/2)+g′(1/2)}2+B2
(kn
)4/d
+ o(νn,k)
as n→∞, uniformly for k ∈ Kβ .
(ii) Suppose that either d ≤ 4, or, d ≥ 5 and α ≤ 4dd−4 . Then for each ε > 0 and
β ∈ (0, 1/2), we have
R(Cknn)−R(CBayes) =B1
k{1− 2g(1/2) + g′(1/2)}2+ o
(1
k+(kn
) αα+d−ε
).
as n→∞, uniformly for k ∈ Kβ .
Richard J. Samworth 14/26
![Page 18: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/18.jpg)
Asymptotic expansion
Theorem. Under our assumptions, we have two cases:
(i) Suppose that d ≥ 5 and α > 4dd−4 , and let νn,k := k−1 + (k/n)4/d. Then there
exist B1 = B1(d, P ) > 0, B2 = B2(d, P ) ≥ 0 such that for each β ∈ (0, 1/2),
R(Cknn)−R(CBayes) =B1
k{1−2g(1/2)+g′(1/2)}2+B2
(kn
)4/d
+ o(νn,k)
as n→∞, uniformly for k ∈ Kβ .
(ii) Suppose that either d ≤ 4, or, d ≥ 5 and α ≤ 4dd−4 . Then for each ε > 0 and
β ∈ (0, 1/2), we have
R(Cknn)−R(CBayes) =B1
k{1− 2g(1/2) + g′(1/2)}2+ o
(1
k+(kn
) αα+d−ε
).
as n→∞, uniformly for k ∈ Kβ .
Richard J. Samworth 14/26
![Page 19: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/19.jpg)
Relative asymptotic performance
Given k to be used by the knn classifier in the noiseless case, let
kg :=⌊{1− 2g(1/2) + g′(1/2)}−2d/(d+4)k
⌋.
This coupling reflects the ratio of the optimal choices of k in the corrupted anduncorrupted se�ings.
Corollary. Under the assumptions of part (i) of the theorem, and providedB2 > 0, we have that for any β ∈ (0, 1/2),
R(Ckgnn)−R(CBayes)
R(Cknn)−R(CBayes)→ 1
{1− 2g(1/2) + g′(1/2)}8/(d+4)
as n→∞, uniformly for k ∈ Kβ .
If g′(1/2) > 2g(1/2), then the label noise improves the asymptotic performance!
Richard J. Samworth 15/26
![Page 20: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/20.jpg)
Intuition
For x ∈ Sc, we have
η(x)− 1/2 = {1− ρ1(x)}η(x) + ρ0(x){1− η(x)} − 1/2
= {η(x)− 1/2}{
1− ρ0(x)− ρ1(x) +ρ0(x)− ρ1(x)
2η(x)− 1
}.
But, writing t := η(x)− 1/2,
1− ρ0(x)− ρ1(x) +ρ0(x)− ρ1(x)
2η(x)− 1
= 1− g(1/2 + t)− g(1/2− t) +g(1/2 + t)− g(1/2− t)
2tt→0→ 1− 2g(1/2) + g′(1/2).
Richard J. Samworth 16/26
![Page 21: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/21.jpg)
Estimated regret ratios
Model: X|Y = r ∼ N5(µr, I5), where µ1 = (3/2, 0, 0, 0, 0)T = −µ0, π1 = 0.5.
Labels: Let g(1/2 + t) = 0 ∨min{g0(1 + h0t), 2g0}, then set ρ0(x) = g(η(x))
and ρ1(x) = g(1− η(x)).
4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
1.0
1.5
2.0
2.5
log(n)
Re
gre
t R
atio
g0 h0 Asymptotic RR0.1 −1 1.37
0.1 0 1.22
0.1 1 1.10
0.1 2 1
0.1 3 0.92
Richard J. Samworth 17/26
![Page 22: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/22.jpg)
Support Vector Machines
LetH denote an RKHS, and let L(y, t) := max{0, 1− (2y − 1)t} denote thehinge loss function. The SVM classifier is given by
CSVM(x) :=
{1 if f(x) ≥ 0
0 otherwise,
where
f ∈ argminf∈H
{1
n
n∑i=1
L(Yi, f(Xi)) + λ‖f‖2H}.
We focus on the case whereH has the Gaussian radial basis reproducing kernelfunction K(x, x′) := exp(−σ2‖x− x′‖2), for σ > 0.
Richard J. Samworth 18/26
![Page 23: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/23.jpg)
SVM asymptotic analysis
If PX is compactly supported and λ = λn is chosen appropriately then thisSVM classifier is consistent in the uncorrupted labels case (Steinwart, 2005).
Corollary. Assume the conditions of our lemma, and suppose that PX iscompactly supported. If λ = λn → 0 but nλn
| log λn|d+1 →∞, then
R(CSVM)−R(CBayes)→ 0
as n→∞.
Richard J. Samworth 19/26
![Page 24: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/24.jpg)
SVM assumptions
1. We say that the distribution P satisfies the margin assumption withparameter γ1 ∈ [0,∞) if there exists κ1 > 0 such that
PX({x ∈ Rd : 0 < |η(x)− 1/2| ≤ t}) ≤ κ1tγ1
for all t > 0.
2. Let S+ := {x ∈ Rd : η(x) > 1/2} and S− := {x ∈ Rd : η(x) < 1/2}, andfor x ∈ Rd, let τx := infx′∈S∪S+ ‖x− x′‖+ infx′∈S∪S− ‖x− x′‖. Say Phas geometric noise exponent γ2 ∈ [0,∞) if there exists κ2 > 0, such that∫
Rd|2η(x)− 1| exp
(τ2x
t2
)dPX(x) ≤ κ2t
γ2d
for all t > 0.
Richard J. Samworth 20/26
![Page 25: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/25.jpg)
Rate of convergence
With perfect labels and when PX(B(0, 1)
)= 1, the excess risk of the SVM
classifier is O(n−Γ+ε) for every ε > 0, where
Γ :=
{γ2
2γ2+1 if γ2 ≤ γ1+22γ1
2γ2(γ1+1)2γ2(γ1+2)+3γ1+4 otherwise.
(Steinwart and Scovel, 2007).
Theorem. Suppose that P has margin parameter γ1 ∈ [0,∞], geometric noiseexponent γ2 ∈ (0,∞) and PX
(B(0, 1)
)= 1. Assume the conditions of the
lemma and that ρ0(x) = g(η(x)), ρ1(x) = g(1− η(x)), where g : (0, 1)→ [0, 1)
is di�erentiable at 1/2.
Let λ = λn := n−(γ2+1)/(γ2Γ) and σ = σn := nΓ/(γ2d). Then
R(CSVM)−R(CBayes) = O(n−Γ+ε),
as n→∞, for every ε > 0.
Richard J. Samworth 21/26
![Page 26: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/26.jpg)
Linear Discriminant Analysis
Suppose that Pr = Nd(µr,Σ) for r = 0, 1. Then
CBayes(x) =
{1 if log
(π1
π0
)+(x− µ0+µ1
2
)>Σ−1(µ1 − µ0) ≥ 0
0 otherwise.
Define
CLDA(x) :=
{1 if log
(π1
π0
)+(x− µ0+µ1
2
)>Σ−1(µ1 − µ0) ≥ 0
0 otherwise,
where, πr := n−1∑ni=1 1{Yi=r}, µr :=
∑ni=1Xi1{Yi=r}/
∑ni=1 1{Yi=r}, and
Σ :=1
n− 2
n∑i=1
1∑r=0
(Xi − µr)(Xi − µr)>1{Yi=r}.
Richard J. Samworth 22/26
![Page 27: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/27.jpg)
LDA asymptotic analysis
Theorem. Assume we have ρ-homogeneous noise (ρ < 1/2) and suppose thatPr = Nd(µr,Σ), for r = 0, 1. Then
limn→∞
CLDA(x) =
{1 if c0 +
(x− µ0+µ1
2
)>Σ−1(µ1 − µ0) > 0
0 if c0 +(x− µ0+µ1
2
)>Σ−1(µ1 − µ0) < 0,
where c0 can be expressed in terms of ∆2 := (µ1 − µ0)TΣ−1(µ1 − µ0), ρand π1. As a consequence,
limn→∞
R(CLDA) = π0Φ
(c0∆− ∆
2
)+ π1Φ
(−c0
∆− ∆
2
)≥ R(CBayes), (1)
with equality if π0 = π1 = 1/2. Moreover, for each ρ ∈ (0, 1/2) and π0 6= π1,there is a unique value of ∆ > 0 for which we have equality in (1).
Richard J. Samworth 23/26
![Page 28: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/28.jpg)
LDA with ρ-homogeneous noise
5 6 7 8
51
01
5
log(n)
Err
or
Here, X|{Y = r} ∼ N5(µr, I5), where µ1 = ( 32 , 0, . . . , 0)> = −µ0 ∈ R5, and
π1 = 0.9.
No label noise (black), ρ-homogeneous noise for ρ = 0.1 (red), 0.2 (blue), 0.3(green) and 0.4 (purple). The do�ed lines show our asymptotic limit.
Richard J. Samworth 24/26
![Page 29: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/29.jpg)
Summary
I The knn and SVM classifiers remain consistent with label noise under mildassumptions on the noise mechanism and data distribution.
I Under stronger conditions, the rate of convergence of the excess risk forthese classifiers is preserved.
I However, the LDA classifier is typically not consistent, unless the classpriors are equal (even with homogeneous noise).
Main reference:
I Cannings, T. I., Fan, Y. and Samworth, R. J. (2018) Classification withimperfect training labels. https://arxiv.org/abs/1805.11505.
Richard J. Samworth 25/26
![Page 30: Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the machine learning/computer science literature (Frénay and Kabán, 2014; Frénay](https://reader034.fdocuments.us/reader034/viewer/2022042204/5ea530d40a2c6709a9411591/html5/thumbnails/30.jpg)
Other referencesI Cannings, T. I., Berre�, T. B. and Samworth, R. J. (2018) Local nearest neighbour
classification with applications to semi-supervised learning.https://arxiv.org/abs/1704.00642v2.
I Frénay, B. and Kabán, A. (2014) A comprehensive introduction to label noise. Proc.Euro. Sym. Artificial Neural Networks, 667–676.
I Frénay, B. and Verleysen, M. (2014) Classification in the presence of label noise: asurvey. IEEE Trans. on NN and Learn. Sys., 25, 845–869.
I Ghosh, A., Manwani, N. and Sastry, P. S. (2015) Making risk minimization tolerantto label noise. Neurocomputing, 160, 93–107.
I Lachenbruch, P. A. (1966) Discriminant analysis when the initial samples aremisclassified. Technometrics, 8, 657–662.
I Okamoto, S. and Nobuhiro, Y. (1997) An average-case analysis of the k-nearestneighbor classifier for noisy domains. In Proc. 15th Int. Joint Conf. Artif. Intell., 1,238–243.
I Steinwart, I. (2005) Consistency of support vector machines and other regularizedkernel classifiers. IEEE Trans. Inf. Th., 51, 128–142.
I Steinwart, I. and Scovel, C. (2007) Fast rates for support vector machines usingGaussian kernels. Ann. Statist., 35, 575–607.
Richard J. Samworth 26/26