arXiv:2105.00303v1 [cs.LG] 1 May 2021

RATT: Leveraging Unlabeled Data to Guarantee Generalization

Saurabh Garg 1 Sivaraman Balakrishnan 1 J. Zico Kolter 1 Zachary C. Lipton 1

AbstractTo assess generalization, machine learning scien-tists typically either (i) bound the generalizationgap and then (after training) plug in the empir-ical risk to obtain a bound on the true risk; or(ii) validate empirically on holdout data. How-ever, (i) typically yields vacuous guarantees foroverparameterized models; and (ii) shrinks thetraining set and its guarantee erodes with each re-use of the holdout set. In this paper, we leverageunlabeled data to produce generalization bounds.After augmenting our (labeled) training set withrandomly labeled data, we train in the standardfashion. Whenever classifiers achieve low erroron the clean data but high error on the randomdata, our bound ensures that the true risk is low.We prove that our bound is valid for 0-1 empiri-cal risk minimization and with linear classifierstrained by gradient descent. Our approach is es-pecially useful in conjunction with deep learningdue to the early learning phenomenon wherebynetworks fit true labels before noisy labels butrequires one intuitive assumption. Empirically,on canonical computer vision and NLP tasks,our bound provides non-vacuous generalizationguarantees that track actual performance closely.This work enables practitioners to certify gen-eralization even when (labeled) holdout data isunavailable and provides insights into the relation-ship between random label noise and generaliza-tion. Code is available at https://github.com/acmi-lab/RATT generalization bound.

1. IntroductionTypically, machine learning scientists establish generaliza-tion in one of two ways. One approach, favored by learningtheorists, places an a priori bound on the gap between theempirical and true risks, usually in terms of the complex-

1Carnegie Mellon University. Correspondence to: SaurabhGarg <[email protected]>.

Proceedings of the 38 th International Conference on MachineLearning, PMLR 139, 2021. Copyright 2021 by the author(s).

ity of the hypothesis class. After fitting the model on theavailable data, one can plug in the empirical risk to obtaina guarantee on the true risk. The second approach, favoredby practitioners, involves splitting the available data intotraining and holdout partitions, fitting the models on theformer and estimating the population risk with the latter.

Surely, both approaches are useful, with the former provid-ing theoretical insights and the latter guiding the develop-ment of a vast array of practical technology. Nevertheless,both methods have drawbacks. Most a priori generaliza-tion bounds rely on uniform convergence and thus fail toexplain the ability of overparameterized networks to gener-alize (Zhang et al., 2016; Nagarajan & Kolter, 2019b). Onthe other hand, provisioning a holdout dataset restricts theamount of labeled data available for training. Moreover,risk estimates based on holdout sets lose their validity withsuccessive re-use of the holdout data due to adaptive over-fitting (Murphy, 2012; Dwork et al., 2015; Blum & Hardt,2015). However, recent empirical studies suggest that onlarge benchmark datasets, adaptive overfitting is surprisinglyabsent (Recht et al., 2019).

In this paper, we propose Randomly Assign, Train and Track(RATT), a new method that leverages unlabeled data to pro-vide a post-training bound on the true risk (i.e., the popula-tion error). Here, we assign random labels to a fresh batchof unlabeled data, augmenting the clean training datasetwith these randomly labeled points. Next, we train on thisdata, following standard risk minimization practices. Fi-nally, we track the error on the randomly labeled portion oftraining data, estimating the error on the mislabeled portionand using this quantity to upper bound the population error.

Counterintuitively, we guarantee generalization by guaran-teeing overfitting. Specifically, we prove that EmpiricalRisk Minimization (ERM) with 0-1 loss leads to lower er-ror on the mislabeled training data than on the mislabeledpopulation. Thus, if despite minimizing the loss on the com-bined training data, we nevertheless have high error on themislabeled portion, then the (mislabeled) population errorwill be even higher. Then, by complementarity, the (clean)population error must be low. Finally, we show how to ob-tain this guarantee using randomly labeled (vs mislabeleddata), thus enabling us to incorporate unlabeled data.

To expand the applicability of our idea beyond ERM on 0-1

arX

iv:2

105.

0030

3v2

[cs

.LG

] 6

Nov

202

1

https://github.com/acmi-lab/RATT_generalization_bound

https://github.com/acmi-lab/RATT_generalization_bound

Leveraging Unlabeled Data to Guarantee Generalization

0 20 40 60 80 100Epoch

50

60

70

80

90

100A

ccur

acy

MLPResNet

TestPredicted bound

Figure 1. Predicted lower bound on the clean population errorwith ResNet and MLP on binary CIFAR. Results aggregated over 5seeds. ‘*’ denotes the best test performance achieved when trainingwith only clean data and the same hyperparameters (except forthe stopping point). The bound predicted by RATT (RHS in (2))closely tracks the population accuracy on clean data.

error, we prove corresponding results for a linear classifiertrained by gradient descent to minimize squared loss. Fur-thermore, leveraging the connection between early stoppingand `2-regularization in linear models (Ali et al., 2018; 2020;Suggala et al., 2018), our results extend to early-stoppedgradient descent. Because we make no assumptions on thedata distribution, our results on linear models hold for morecomplex models such as kernel regression and neural net-works in the Neural Tangent Kernel (NTK) regime (Jacotet al., 2018; Du et al., 2018; 2019; Allen-Zhu et al., 2019b;Chizat et al., 2019).

Addressing practical deep learning models, our guaranteerequires an additional (reasonable) assumption. Our experi-ments show that the bound yields non-vacuous guaranteesthat track test error across several major architectures ona range of benchmark datasets for computer vision andNatural Language Processing (NLP). Because, in practice,overparameterized deep networks exhibit an early learningphenomenon, fitting clean data before mislabeled data (Liuet al., 2020; Arora et al., 2019; Li et al., 2019), our pro-cedure yields tight bounds in the early phases of learning.Experimentally, we confirm the early learning phenomenonin standard Stochastic Gradient Descent (SGD) trainingand illustrate the effectiveness of weight decay combinedwith large initial learning rates in avoiding interpolation tomislabeled data while maintaining fit on the training data,strengthening the guarantee provided by our method.

To be clear, we do not advocate RATT as a blanket replace-ment for the holdout approach. Our main contribution is tointroduce a new theoretical perspective on generalizationand to provide a method that may be applicable even whenthe holdout approach is unavailable. Of interest, unlikegeneralization bounds based on uniform-convergence thatrestrict the complexity of the hypothesis class (Neyshabur

et al., 2018; 2015; 2017b; Bartlett et al., 2017; Nagarajan& Kolter, 2019a), our post hoc bounds depend only on thefit to mislabeled data. We emphasize that our theory doesnot guarantee a priori that early learning should take placebut only a posteriori that when it does, we can provide non-vacuous bounds on the population error. Conceptually, thisfinding underscores the significance of the early learningphenomenon in the presence of noisy labels and motivatesfurther work to explain why it occurs.

2. PreliminariesBy ||¨||, and x¨, ¨y we denote the Euclidean norm and innerproduct, respectively. For a vector v P Rd, we use vj todenote its jth entry, and for an event E we let I rEs denotethe binary indicator of the event.

Suppose we have a multiclass classification problem withthe input domain X Ď Rd and label space Y “

t1, 2, . . . , ku1. By D, we denote the distribution over XˆY .A dataset S :“ tpxi, yiqu

ni“1 „ Dn contains n points sam-

pled i.i.d. from D. By S, T , and rS, we denote the (uni-form) empirical distribution over points in datasets S, T ,and rS, respectively. Let F be a class of hypotheses map-ping X to Rk. A training algorithm A: takes a dataset Sand returns a classifier fpA, Sq P F . When the contextis clear, we drop the parentheses for convenience. Givena classifier f and datum px, yq, we denote the 0-1 error(i.e., classification error) on that point by Epfpxq, yq :“I“

y R argmaxjPY fjpxq‰

, We express the population er-ror on D as EDpfq :“ Epx,yq„D rEpfpxq, yqs and the em-pirical error on S as ESpfq :“ Epx,yq„S rEpfpxq, yqs “1n

řni“1 Epfpxiq, yiq.

Throughout, we consider a random label assignment proce-dure: draw x „ DX (the underlying distribution over X ),and then assign a label sampled uniformly at random. Wedenote a randomly labeled dataset by rS :“ tpxi, yiqu

mi“1 „

rDm, where rD is the distribution of randomly labeled data.By D1, we denote the mislabeled distribution that corre-sponds to selecting examples px, yq according to D andthen re-assigning the label by sampling among the incorrectlabels y1 ‰ y (renormalizing the label marginal).

3. Generalization Bound for RATT with ERMWe now present our generalization bound and proof sketchesfor ERM on the 0-1 loss (full proofs in App. A). For anydataset T , ERM returns the classifier pf that minimizes theempirical error:

pf :“ argminfPF

ET pfq . (1)

1For binary classification, we use Y “ t´1, 1u.


We focus first on binary classification. Assume we have aclean dataset S „ Dn of n points and a randomly labeleddataset rS „ rDm of m pă nq points with labels in rS areassigned uniformly at random. We show that with 0-1 lossminimization on the union of S and rS, we obtain a classifierwhose error on D is upper bounded by a function of theempirical errors on clean data ES (lower is better) and onrandomly labeled data E

rS (higher is better):

Theorem 1. For any classifier pf obtained by ERM (1) ondataset SY rS, for any δ ą 0, with probability at least 1´δ,we have

EDp pfq ď ESp pfq ` 1´ 2ErSppfq

`

´?2E

rSppfq ` 2`

m

2n

¯

c

logp4{δq

m. (2)

In short, this theorem tells us that if after training on bothclean and randomly labeled data, we achieve low error onthe clean data but high error (close to 1{2) on the randomlylabeled data, then low population error is guaranteed. Notethat because the labels in rS are assigned randomly, the errorErSpfq for any fixed predictor f (not dependent on rS) will be

approximately 1/2. Thus, if ERM produces a classifier thathas not fit to the randomly labeled data, then p1´ 2E

rSppfqq

will be approximately 0, and our error will be determined bythe fit to clean data. The final term accounts for finite sampleerror—notably, it (i) does not depend on the complexity ofthe hypothesis class; and (ii) approaches 0 at a Op1{

?mq

rate (for m ă n).

Our proof strategy unfolds in three steps. First, in Lemma 1we bound EDp pfq in terms of the error on the mislabeledsubset of rS. Next, in Lemmas 2 and 3, we show that theerror on the mislabeled subset can be accurately estimatedusing only clean and randomly labeled data.

To begin, assume that we actually knew the original labelsfor the randomly labeled data. By rSC and rSM , we denotethe clean and mislabeled portions of the randomly labeleddata, respectively (with rS “ rSM Y rSC). Note that forbinary classification, a lower bound on mislabeled popu-lation error ED1p pfq directly upper bounds the error on theoriginal population EDp pfq. Thus we only need to provethat the empirical error on the mislabeled portion of ourdata is lower than the error on unseen mislabeled data, i.e.,ErSMp pfq ď ED1p pfq “ 1´ E

rSMp pfq (upto Op1{

?mq).

Lemma 1. Assume the same setup as in Theorem 1. Thenfor any δ ą 0, with probability at least 1 ´ δ over therandom draws of mislabeled data rSM , we have

EDp pfq ď 1´ ErSMp pfq `

c

logp1{δq

m. (3)

Proof Sketch. The main idea of our proof is to regard theclean portion of the data (SYrSC ) as fixed. Then, there exists

a classifier f˚ that is optimal over draws of the mislabeleddata rSM . Formally,

f˚ :“ argminfPF

EqDpfq,

where qD is a combination of the empirical distributionover correctly labeled data S Y rSC and the (population)distribution over mislabeled data D1. Recall that pf :“argminfPF ESY rSpfq. Since, pf minimizes 0-1 error onS Y rS, we have ESY rSp

pfq ď ESY rSpf˚q. Moreover, since

f˚ is independent of rSM , we have with probability at least1´ δ that

ErSMpf˚q ď ED1pf˚q `

c

logp1{δq

m.

Finally, since f˚ is the optimal classifier on qD, we haveEqDpf

˚q ď EqDp

pfq. Combining the above steps and usingthe fact that ED “ 1´ED1 , we obtain the desired result.

While the LHS in (3) depends on the unknown portion rSM ,our goal is to use unlabeled data (with randomly assignedlabels) for which the mislabeled portion cannot be read-ily identified. Fortunately, we do not need to identify themislabeled points to estimate the error on these points inaggregate E

rSMp pfq. Note that because the label marginal is

uniform, approximately half of the data will be correctlylabeled and the remaining half will be mislabeled. Conse-quently, we can utilize the value of E

rSppfq and an estimate

of ErSCp pfq to lower bound E

rSMp pfq. We formalize this as

follows:

Lemma 2. Assume the same setup as Theorem 1. Thenfor any δ ą 0, with probability at least 1´ δ over the ran-dom draws of rS, we have

∣∣∣2ErSppfq ´ E

rSCp pfq ´ E

rSMp pfq

∣∣∣ ď2E

rSppfqb

logp4{δq2m .

To complete the argument, we show that due to the ex-changeability of the clean data S and the clean portion ofthe randomly labeled data SC , we can estimate the error onthe latter E

rSCp pfq by the error on the former ESp pfq.

Lemma 3. Assume the same setup as Theorem 1. Thenfor any δ ą 0, with probability at least 1 ´ δ over therandom draws of rSC and S, we have

∣∣∣ErSCp pfq ´ ESp pfq

∣∣∣ ď`

1` m2n

˘

b

logp2{δqm .

Lemma 3 establishes a tight bound on the difference ofthe error of classifier pf on rSC and on S. The proof usesHoeffding’s inequality for randomly sampled points from afixed population (Hoeffding, 1994; Bardenet et al., 2015).

Having established these core components, we can nowsummarize the proof strategy for Theorem 1. We bound the


population error on clean data (the term on the LHS of (2))in three steps: (i) use Lemma 1 to upper bound the error onclean distribution EDp pfq, by the error on mislabeled trainingdata E

rSMp pfq; (ii) approximate E

rSMp pfq by E

rSCp pfq and the

error on randomly labeled training data (i.e., ErSppfq) using

Lemma 2; and (iii) use Lemma 3 to estimate ErSCp pfq using

the error on clean training data (ESp pfq).

Comparison with Rademacher bound Our bound inTheorem 1 shows that we can upper bound the clean pop-ulation error of a classifier by estimating its accuracy onthe clean and randomly labeled portions of the training data.Next, we show that our bound’s dominating term is upperbounded by the Rademacher complexity (Shalev-Shwartz &Ben-David, 2014), a standard distribution-dependent com-plexity measure.

Proposition 1. Fix a randomly labeled dataset rS „ rDm.Then for any classifier f P F (possibly dependent on rS)2

and for any δ ą 0, with probability at least 1 ´ δ overrandom draws of rS, we have

1´ 2ErSpfq ď Eε,x

«

supfPF

ˆř

i εifpxiq

m

˙

ff

`

d

2 logp 2δ q

m,

where ε is drawn from a uniform distribution over t´1, 1um

and x is drawn from DmX .

In other words, the proposition above highlights that theaccuracy on the randomly labeled data is never larger thanthe Rademacher complexity of F w.r.t. the underlying dis-tribution over X , implying that our bound is never looserthan a bound based on Rademacher complexity. The prooffollows by application of the bounded difference conditionand McDiarmid’s inequality (McDiarmid, 1989). We nowdiscuss extensions of Theorem 1 to regularized ERM andmulticlass classification.

Extension to regularized ERM Consider any functionR : F Ñ R, e.g., a regularizer that penalizes some mea-sure of complexity for functions in class F . Consider thefollowing regularized ERM:

pf :“ argminfPF

ESpfq ` λRpfq , (4)

where λ is a regularization constant. If the regularizationcoefficient is independent of the training data S Y rS, thenour guarantee (Theorem 1) holds. Formally,

Theorem 2. For any regularization function R, assume weperform regularized ERM as in (4) on S Y rS and obtaina classifier pf . Then, for any δ ą 0, with probability at

2We restrict F to functions which output a label in Y “

t´1, 1u.

least 1 ´ δ, we have EDp pfq ď ESp pfq ` 1 ´ 2ErSppfq `

´?2E

rSppfq ` 2` m

2n

¯

b

logp1{δqm .

A key insight here is that the proof of Theorem 1 treats theclean data S as fixed and considers random draws of themislabeled portion. Thus a data-independent regularizationfunction does not alter our chain of arguments and hence,has no impact on the resulting inequality. We prove thisresult formally in App. A.

We note one immediate corollary from Theorem 2: whenlearning any function f parameterized by w with L2-normpenalty on the parameters w, the population error with pf isdetermined by the error on the clean training data as long asthe error on randomly labeled data is high (close to 1{2).

Extension to multiclass classification Thus far, we haveaddressed binary classification. We now extend these resultsto the multiclass setting. As before, we obtain datasets Sand rS. Here, random labels are assigned uniformly amongall classes.

Theorem 3. For any regularization function R, assume weperform regularized ERM as in (4) on S Y rS and obtain aclassifier pf . For a multiclass classification problem with kclasses, for any δ ą 0, with probability at least 1 ´ δ, wehave

EDp pfq ď ESp pfq ` pk ´ 1q´

1´ kk´1E rSp

pfq¯

` c

d

logp 4δ q

2m, (5)

for some constant c ď p2k `?k ` m

n?kq.

We first discuss the implications of Theorem 3. Besidesempirical error on clean data, the dominating term in theabove expression is given by pk´ 1q

´

1´ kk´1E rSp

pfq¯

. For

any predictor f (not dependent on rS), the term ErSppfq would

be approximately pk ´ 1q{k and for pf , the difference nowevaluates to the accuracy of the randomly labeled data. Notethat for binary classification, (5) simplifies to Theorem 1.

The core of our proof involves obtaining an inequality simi-lar to (3). While for binary classification, we could upperbound E

rSMwith 1 ´ ED (in the proof of Lemma 1), for

multiclass classification, error on the mislabeled data andaccuracy on the clean data in the population are not so di-rectly related. To establish an inequality analogous to (3),we break the error on the (unknown) mislabeled data intotwo parts: one term corresponds to predicting the true labelon mislabeled data, and the other corresponds to predictingneither the true label nor the assigned (mis-)label. Finally,we relate these errors to their population counterparts toestablish an inequality similar to (3).


4. Generalization Bound for RATT withGradient Descent

In the previous section, we presented results with ERM on0-1 loss. While minimizing the 0-1 loss is hard in general,these results provide important theoretical insights. In thissection, we show parallel results for linear models trainedwith Gradient Descent (GD).

To begin, we introduce the setup and some additional no-tation. For simplicity, we begin discussion with binaryclassification with X “ Rd. Define a linear functionfpx;wq :“ wTx for some w P Rd and x P X . Giventraining set S, we suppose that the parameters of the linearfunction are obtained via gradient descent on the followingL2 regularized problem:

LSpw;λq :“nÿ

i“1

pwTxi ´ yiq2 ` λ ||w||

22 , (6)

where λ ě 0 is a regularization parameter. Our choiceto analyze squared loss minimization for linear networksis motivated in part by its analytical convenience, and fol-lows recent theoretical work which analyze neural networkstrained via squared loss minimization in the Neural TangentKernel (NTK) regime when they are well approximated bylinear networks (Jacot et al., 2018; Arora et al., 2019; Duet al., 2019; Hu et al., 2019). Moreover, recent research sug-gests that for classification tasks, squared loss minimizationperforms comparably to cross-entropy loss minimization(Muthukumar et al., 2020; Hui & Belkin, 2020).

For a given training set S, we use Spiq to denote the trainingset S with the ith point removed. We now introduce onestability condition:

Condition 1 (Hypothesis Stability). We have β hypothesisstability if our training algorithm A satisfies the followingfor all i P t1, 2, . . . , nu:

ES,px,yqPD“∣∣E pfpxq, yq ´ E

`

fpiqpxq, y˘∣∣‰ ď β

n,

where fpiq :“ fpA, Spiqq and f :“ fpA, Sq.

This condition is similar to a notion of stability called hy-pothesis stability (Bousquet & Elisseeff, 2002; Kearns &Ron, 1999; Elisseeff et al., 2003). Intuitively, Condition 1states that empirical leave-one-out error and average pop-ulation error of leave-one-out classifiers are close. Thiscondition is mild and does not guarantee generalization. Wediscuss the implications in more detail in App. B.3.

Now we present the main result of this section. Asbefore, we assume access to a clean dataset S “

tpxi, yiquni“1 „ Dn and randomly labeled dataset rS “

tpxi, yiqun`mi“n`1 „

rDm. Let X “ rx1, x2, ¨ ¨ ¨ , xm`ns andy “ ry1, y2, ¨ ¨ ¨ , ym`ns. Fix a positive learning rate η such

that η ď 1{´

ˇ

ˇ

ˇ

ˇXTXˇ

ˇ

ˇ

ˇ

op ` λ2¯

and an initializationw0 “ 0.Consider the following gradient descent iterates to minimizeobjective (6) on S Y rS:

wt “ wt´1 ´ η∇wLSY rSpwt´1;λq @t “ 1, 2, . . . . (7)

Then we have twtu converge to the limiting solution pw “`

XTX ` λI˘´1

XTy. Define pfpxq :“ fpx; pwq.

Theorem 4. Assume that this gradient descent algorithmsatisfies Condition 1 with β “ Op1q. Then for any δ ą 0,with probability at least 1 ´ δ over the random draws ofdatasets rS and S, we have:

EDp pfq ď ESp pfq ` 1´ 2ErSppfq `

d

4

δ

ˆ

1

m`

3β

m` n

˙

`

´?2E

rSppfq ` 1`

m

2n

¯

c

logp4{δq

m.

(8)

With a mild regularity condition, we establish the samebound on GD training with squared loss, notably the samedominating term on the population error, as in Theorem 1.In App. B.2, we present the extension to multiclass classifi-cation, where we again obtain a result parallel to Theorem 3.

Proof Sketch. Because squared loss minimization does notimply 0-1 error minimization, we cannot use argumentsfrom Lemma 1. This is the main technical difficulty. Tocompare the 0-1 error at a train point with an unseen point,we use the closed-form expression for pw. We show thatthe train error on mislabeled points is less than the popula-tion error on the distribution of mislabeled data (parallel toLemma 1).

For a mislabeled training point pxi, yiq in rS, we show that

I“

yixTi pw ď 0

‰

ď I“

yixTi pwpiq ď 0

‰

, (9)

where pwpiq is the classifier obtained by leaving out the ith

point from the training set. Intuitively, this condition statesthat the train error at a training point is less than the leave-one-out error at that point, i.e. the error obtained by remov-ing that point and re-training. Using Condition 1, we thenrelate the average leave-one-out error (over the index i ofthe RHS in (9)) to the population error on the mislabeleddistribution to obtain an inequality similar to (3).

Extensions to kernel regression Since the result in The-orem 4 does not impose any regularity conditions on theunderlying distribution over X ˆ Y , our guarantees extendstraightforwardly to kernel regression by using the transfor-mation x Ñ φpxq for some feature transform function φ.Furthermore, recent literature has pointed out a concrete con-nection between neural networks and kernel regression with


the so-called Neural Tangent Kernel (NTK) which holdsin a certain regime where weights do not change muchduring training (Jacot et al., 2018; Du et al., 2019; 2018;Chizat et al., 2019). Using this concrete correspondence, ourbounds on the clean population error (Theorem 4) extend towide neural networks operating in the NTK regime.

Extensions to early stopped GD Often in practice, gradi-ent descent is stopped early. We now provide theoretical ev-idence that our guarantees may continue to hold for an earlystopped GD iterate. Concretely, we show that in expectation,the outputs of the GD iterates are close to that of a problemwith data-independent regularization (as considered in The-orem 2). First, we introduce some notation. By LSpwq,we denote the objective in (6) with λ “ 0. Consider theGD iterates defined in (7). Let rwλ “ argminw LSpw;λq.Define ftpxq :“ fpx;wtq as the solution at the tth iterateand rfλpxq :“ fpx; rwλq as the regularized solution. Let κ bethe condition number of the population covariance matrixand let smin be the minimum positive singular value of theempirical covariance matrix.Proposition 2 (informal). For λ “ 1

tη , we have

Ex„DX

”

pftpxq ´ rfλpxqq2ı

ď cpt, ηq ¨ Ex„DX

“

ftpxq2‰

,

where cpt, ηq « κ ¨minp0.25, 1s2mint

2η2q. An equivalent guar-

antee holds for a point x sampled from the training data.

The proposition above states that for large enough t, GDiterates stay close to a regularized solution with data-independent regularization constant. Together with our guar-antees in Theorem 4 for regularization solution with λ “ 1

tη ,Proposition 2 shows that our guarantees with RATT mayhold on early stopped GD. See the formal result in App. B.4.

Remark Proposition 2 only bounds the expected squareddifference between the tth gradient descent iterate and acorresponding regularized solution. The expected squareddifference and the expected difference of classification er-rors (what we wish to bound) are not related, in general.However, they can be related under standard low-noise (mar-gin) assumptions. For instance, under the Tsybakov noisecondition (Tsybakov et al., 1997; Yao et al., 2007), we canlower-bound the expression on the LHS of Proposition 2with the difference of expected classification error.

Extensions to deep learning Note that the main lemmaunderlying our bound on (clean) population error states thatwhen training on a mixture of clean and randomly labeleddata, we obtain a classifier whose empirical error on themislabeled training data is lower than its population error onthe distribution of mislabeled data. We prove this for ERMon 0-1 loss (Lemma 1). For linear models (and networks inNTK regime), we obtained this result by assuming hypothe-sis stability and relating training error at a datum with the

leave-one-out error (Theorem 4). However, to extend ourbound to deep models we must assume that training on themixture of random and clean data leads to overfitting on therandom mixture. Formally:

Assumption 1. Let pf be a model obtained by training withan algorithm A on a mixture of clean data S and randomlylabeled data rS. Then with probability 1´δ over the randomdraws of mislabeled data rSM , we assume that the followingcondition holds:

ErSMp pfq ď ED1p pfq ` c

c

logp1{δq

2m,

for a fixed constant c ą 0.

Under Assumption 1, our results in Theorem 1, 2 and 3extend beyond ERM with the 0-1 loss to general learningalgorithms. We include the formal result in App. B.5. Notethat given the ability of neural networks to interpolate thedata, this assumption seems uncontroversial in the laterstages of training. Moreover, concerning the early phases oftraining, recent research has shown that learning dynamicsfor complex deep networks resemble those for linear mod-els (Nakkiran et al., 2019; Hu et al., 2020), much like thewide neural networks that we do analyze. Together, thesearguments help to justify Assumption 1 and hence, the ap-plicability of our bound in deep learning. Motivated by ouranalysis on linear models trained with gradient descent, wediscuss conditions in App. B.6 which imply Assumption 1for constant values δ ą 0. In the next section, we em-pirically demonstrate applicability of our bounds for deepmodels.

5. Empirical Study and ImplicationsHaving established our framework theoretically, we nowdemonstrate its utility experimentally. First, for linear mod-els and wide networks in the NTK regime where our guar-antee holds, we confirm that our bound is not only valid,but closely tracks the generalization error. Next, we showthat in practical deep learning settings, optimizing cross-entropy loss by SGD, the expression for our (0-1) ERMbound nevertheless tracks test performance closely and innumerous experiments on diverse models and datasets isnever violated empirically.

Datasets To verify our results on linear models, we con-sider a toy dataset, where the class conditional distributionppx|yq for each label is Gaussian. For binary tasks, we usebinarized CIFAR-10 (first 5 classes vs rest) (Krizhevsky &Hinton, 2009), binary MNIST (0-4 vs 5-9) (LeCun et al.,1998) and IMDb sentiment analysis dataset (Maas et al.,2011). For multiclass setup, we use MNIST and CIFAR-10.


0.0 0.1 0.2 0.3 0.4Fraction of unlabeled data

50

60

70

80

90

100A

ccur

acy

Underparameterized modelMSECE

TestPredicted bound

(a)


50

60

70

80

90

100

Acc

urac

y

MNIST

SGDEarly stopWeight decay

TestPredicted bound

(b)

0.0 0.2 0.4 0.6 0.8 1.0Fraction of steps

50

60

70

80

90

100

Acc

urac

y

ELMo-LSTMBERT

TestPredicted bound

(c)

Figure 2. We plot the accuracy and corresponding bound (RHS in (1)) at δ “ 0.1. for binary classification tasks. Results aggregated over3 seeds. (a) Accuracy vs fraction of unlabeled data (w.r.t clean data) in the toy setup with a linear model trained with GD. (b) Accuracy vsfraction of unlabeled data for a 2-layer wide network trained with SGD on binary MNIST. With SGD and no regularization (red curve in(b)), we interpolate the training data and hence the predicted lower bound is 0. However, with early stopping (or weight decay) we obtaintight guarantees. (c) Accuracy vs gradient iteration on IMDb dataset with unlabeled fraction fixed at 0.2. In plot (c), ‘*’ denotes the besttest accuracy with the same hyperparameters and training only on clean data. See App. C for exact hyperparameter values.

Architectures To simulate the NTK regime, we experi-ment with 2-layered wide networks both (i) with the sec-ond layer fixed at random initialization; (ii) and updatingboth layers’ weights. For vision datasets (e.g., MNIST andCIFAR10), we consider (fully connected) multilayer per-ceptrons (MLPs) with ReLU activations and ResNet18 (Heet al., 2016). For the IMDb dataset, we train Long Short-Term Memory Networks (LSTMs; Hochreiter & Schmid-huber (1997)) with ELMo embeddings (Peters et al., 2018)and fine-tune an off-the-shelf uncased BERT model (Devlinet al., 2018; Wolf et al., 2020).

Methodology To bound the population error, we requireaccess to both clean and unlabeled data. For toy datasets,we obtain unlabeled data by sampling from the underlyingdistribution over X . For image and text datasets, we holdout a small fraction of the clean training data and discardtheir labels to simulate unlabeled data. We use the randomlabeling procedure described in Sec. 2. After augmentingclean training data with randomly labeled data, we train inthe standard fashion. See App. C for experimental details.

Underparameterized linear models On toy Gaussiandata, we train linear models with GD to minimize cross-entropy loss and mean squared error. Varying the fraction ofrandomly labeled data we observe that the accuracy on cleanunseen data is barely impacted (Fig. 2(a)). This highlightsthat in low dimensional models adding randomly labeleddata with the clean dataset (in toy setup) has minimal effecton the performance on unseen clean data. Moreover, we findthat RATT offers a tight lower bound on the unseen cleandata accuracy. We observe the same behavior with Stochas-tic Gradient Descent (SGD) training (ref. App. C). Observethat the predicted bound goes up as the fraction of unlabeleddata increases. While the accuracy as dictated by the dom-inating term in the RHS of (2) decreases with an increase

in the fraction of unlabeled data, we observe a relativelysharper decrease in Op p1{

?mq term of the bound, leading

to an overall increase in the predicted accuracy bound. Inthis toy setup, we also evaluated a kernel regression boundfrom Bartlett & Mendelson (2002) (Theorem 21), however,the predicted kernel regression bound remains vacuous.

Wide Nets Next, we consider MNIST binary classifica-tion with a wide 2-layer fully-connected network. In ex-periments with SGD training on MSE loss without earlystopping or weight decay regularization, we find that addingextra randomly label data hurts the unseen clean perfor-mance (Fig. 2(b)). Additionally, due to the perfect fit on thetraining data, our bound is rendered vacuous. However, withearly stopping (or weight decay), we observe close to zeroperformance difference with additional randomly labeleddata. Alongside, we obtain tight bounds on the accuracy onunseen clean data paying only a small price to negligible forincorporating randomly labeled data. Similar results holdfor SGD and GD and when cross-entropy loss is substitutedfor MSE (ref. App. C).

Deep Nets We verify our findings on (i) ResNet-18 and5-layer MLPs trained with binary CIFAR (Fig. 1); and (ii)ELMo-LSTM and BERT-Base models fine-tuned on theIMDb dataset (Fig. 2(c)). See App. C for additional resultswith deep models on binary MNIST. We fix the amount ofunlabeled data at 20% of the clean dataset size and train allmodels with standard hyperparameters. Consistently, wefind that our predicted bounds are never violated in prac-tice. And as training proceeds, the fit on the mislabeleddata increases with perfect overfitting in the interpolationregime rendering our bounds vacuous. However, with earlystopping, our bound predicts test performance closely. Forexample, on IMDb dataset with BERT fine-tuning we pre-dict 79.8 as the accuracy of the classifier, when the true


Dataset Model Pred. Acc Test Acc. Best Acc.

MNIST MLP 93.1 97.4 97.9ResNet 96.8 98.8 98.9

CIFAR10 MLP 48.4 54.2 60.0ResNet 76.4 88.9 92.3

Table 1. Results on multiclass classification tasks. With pred. acc.we refer to the dominating term in RHS of (5). At the given samplesize and δ “ 0.1, the remaining term evaluates to 30.7, decreasingour predicted accuracy by the same. We note that test acc. denotesthe corresponding accuracy on unseen clean data. Best acc. is thebest achievable accuracy with just training on just the clean data(and same hyperparamters except the stopping point). Note thatacross all tasks our predicted bound is tight and the gap between thebest accuracy and test accuracy is small. Exact hyperparametersare included in App. C.

performance is 88.04 (and the best achievable performanceon unseen data is 92.45). Additionally, we observe that ourmethod tracks the performance from the beginning of thetraining and not just towards the end.

Finally, we verify our multiclass bound on MNIST and CI-FAR10 with deep MLPs and ResNets (see results in Table 1and per-epoch curves in App. C). As before, we fix theamount of unlabeled data at 20% of the clean dataset tominimize cross-entropy loss via SGD. In all four settings,our bound predicts non-vacuous performance on unseendata. In App. C, we investigate our approach on CIFAR100showing that even though our bound grows pessimistic withgreater numbers of classes, the error on the mislabeled datanevertheless tracks population accuracy.

6. Discussion and Connections to Prior WorkImplicit bias in deep learning Several recent lines ofresearch attempt to explain the generalization of neural net-works despite massive overparameterization via the implicitbias of gradient descent (Soudry et al., 2018; Gunasekaret al., 2018a;b; Ji & Telgarsky, 2019; Chizat & Bach, 2020).Noting that even for overparameterized linear models, thereexist multiple parameters capable of overfitting the trainingdata (with arbitrarily low loss), of which some generalizewell and others do not, they seek to characterize the favoredsolution. Notably, Soudry et al. (2018) find that for linearnetworks, gradient descent converges (slowly) to the maxmargin solution. A complementary line of work focuses onthe early phases of training, finding both empirically (Rol-nick et al., 2017; Arpit et al., 2017) and theoretically (Aroraet al., 2019; Li et al., 2020; Liu et al., 2020) that even inthe presence of a small amount of mislabeled data, gradi-ent descent is biased to fit the clean data first during initialphases of training. However, to best our knowledge, noprior work leverages this phenomenon to obtain general-ization guarantees on the clean data, which is the primary

focus of our work. Our method exploits this phenomenonto produce non-vacuous generalization bounds. Even whenwe cannot prove a priori that models will fit the clean datawell while performing badly on the mislabeled data, wecan observe that it indeed happens (often in practice), andthus, a posteriori, provide tight bounds on the populationerror. Moreover, by using regularizers like early stopping orweight decay, we can accentuate this phenomenon, enablingour framework to provide even tighter guarantees.

Generalization bounds Conventionally, generalizationin machine learning has been studied through the lens ofuniform convergence bounds (Blumer et al., 1989; Vapnik,1999). Representative works on understanding generaliza-tion in overparameterized networks within this frameworkinclude Neyshabur et al. (2015; 2017b;a; 2018); Dziugaite &Roy (2017); Bartlett et al. (2017); Arora et al. (2018); Li &Liang (2018); Zhou et al. (2018); Allen-Zhu et al. (2019a);Nagarajan & Kolter (2019a). However, uniform conver-gence based bounds typically remain numerically loose rel-ative to the true generalization error. Several works havealso questioned the ability of uniform convergence basedapproaches to explain generalization in overparameterizedmodels (Zhang et al., 2016; Nagarajan & Kolter, 2019b).Subsequently, recent works have proposed unconventionalways to derive generalization bounds (Negrea et al., 2020;Zhou et al., 2020). In a similar spirit, we take departure fromcomplexity-based approaches to generalization bounds inour work. In particular, we leverage unlabeled data to derivea post-hoc generalization bound. Our work provides guaran-tees on overparameterized networks by using early stoppingor weight decay regularization, preventing a perfect fit onthe training data. Notably, in our framework, the model canperfectly fit the clean portion of the data, so long as theynevertheless fit the mislabeled data poorly.

Leveraging noisy data to provide generalization guaran-tees In parallel work, Bansal et al. (2020) presented anupper bound on the generalization gap of linear classifierstrained on representations learned via self-supervision. Un-der certain noise-robustness and rationality assumptions onthe training procedure, the authors obtained bounds depen-dent on the complexity of the linear classifier and indepen-dent of the complexity of representations. By contrast, wepresent generalization bounds for supervised learning thatare non-vacuous by virtue of the early learning phenomenon.While both frameworks highlight how robustness to randomlabel corruptions can be leveraged to obtain bounds thatdo not depend directly on the complexity of the underlyinghypothesis class, our framework, methodology, claims, andgeneralization results are very different from theirs.

Other related work. A long line of work relates earlystopped GD to a corresponding regularized solution (Fried-


man & Popescu, 2003; Yao et al., 2007; Suggala et al.,2018; Ali et al., 2018; Neu & Rosasco, 2018; Ali et al.,2020). In the most relevant work, Ali et al. (2018) andSuggala et al. (2018) address a regression task, theoreticallyrelating the solutions of early-stopped GD and a regularizedproblem, obtained with a data-independent regularizationcoefficient. Towards understanding generalization numer-ous stability conditions have been discussed (Kearns & Ron,1999; Bousquet & Elisseeff, 2002; Mukherjee et al., 2006;Shalev-Shwartz et al., 2010). Hardt et al. (2016) studies theuniform stability property to obtain generalization guaran-tees with early-stopped SGD. While we assume a benignstability condition to relate leave-one-out performance withpopulation error, we do not rely on any stability conditionthat implies generalization.

7. Conclusion and Future workOur work introduces a new approach for obtaining general-ization bounds that do not directly depend on the underlyingcomplexity of the model class. For linear models, we prov-ably obtain a bound in terms of the fit on randomly labeleddata added during training. Our findings raise a number ofquestions to be explored next. While our empirical find-ings and theoretical results with 0-1 loss hold absent furtherassumptions and shed light on why the bound may applyfor more general models, we hope to extend our proof thatoverfitting (in terms classification error) to the finite sampleof mislabeled data occurs with SGD training on broaderclasses of models and loss functions. We hope to build onsome early results (Nakkiran et al., 2019; Hu et al., 2020)which provide evidence that deep models behave like linearmodels in the early phases of training. We also wish toextend our framework to the interpolation regime. Sincemany important aspects of neural network learning takeplace within early epochs (Achille et al., 2017; Frankleet al., 2020), including gradient dynamics converging tovery small subspace (Gur-Ari et al., 2018), we might imag-ine operationalizing our bounds in the interpolation regimeby discarding the randomly labeled data after initial stagesof training.

AcknowledgementsSG thanks Divyansh Kaushik for help with NLP code.This material is based on research sponsored by Air ForceResearch Laboratory (AFRL) under agreement numberFA8750-19-1-1000. The U.S. Government is authorized toreproduce and distribute reprints for Government purposesnotwithstanding any copyright notation therein. The viewsand conclusions contained herein are those of the authorsand should not be interpreted as necessarily representingthe official policies or endorsements, either expressed orimplied, of Air Force Laboratory, DARPA or the U.S. Gov-

ernment. SB acknowledges funding from the NSF grantsDMS-1713003 and CIF-1763734, as well as Amazon AI anda Google Research Scholar Award. ZL acknowledges Ama-zon AI, Salesforce Research, Facebook, UPMC, Abridge,the PwC Center, the Block Center, the Center for MachineLearning and Health, and the CMU Software EngineeringInstitute (SEI), for their generous support of ACMI Lab’sresearch on machine learning under distribution shift.

ReferencesAbadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,

J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.Tensorflow: A system for large-scale machine learning. In12th USENIX Symposium on Operating Systems Designand Implementation, 2016.

Abou-Moustafa, K. and Szepesvari, C. An exponentialefron-stein inequality for lq stable learning rules. arXivpreprint arXiv:1903.05457, 2019.

Achille, A., Rovere, M., and Soatto, S. Critical learn-ing periods in deep neural networks. arXiv preprintarXiv:1711.08856, 2017.

Ali, A., Kolter, J. Z., and Tibshirani, R. J. A continuous-timeview of early stopping for least squares. arXiv preprintarXiv:1810.10082, 2018.

Ali, A., Dobriban, E., and Tibshirani, R. J. The implicitregularization of stochastic gradient flow for least squares.arXiv preprint arXiv:2003.07802, 2020.

Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and gener-alization in overparameterized neural networks, goingbeyond two layers. In Advances in neural informationprocessing systems, pp. 6158–6169, 2019a.

Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory fordeep learning via over-parameterization. In InternationalConference on Machine Learning, pp. 242–252. PMLR,2019b.

Arora, S., Ge, R., Neyshabur, B., and Zhang, Y. Strongergeneralization bounds for deep nets via a compressionapproach. arXiv preprint arXiv:1802.05296, 2018.

Arora, S., Du, S. S., Hu, W., Li, Z., and Wang, R. Fine-grained analysis of optimization and generalization foroverparameterized two-layer neural networks. arXivpreprint arXiv:1901.08584, 2019.

Arpit, D., Jastrzebski, S., Ballas, N., Krueger, D., Bengio,E., Kanwal, M. S., Maharaj, T., Fischer, A., Courville,A., Bengio, Y., et al. A closer look at memorization indeep networks. In Proceedings of the 34th InternationalConference on Machine Learning-Volume 70, pp. 233–242. JMLR. org, 2017.


Bansal, Y., Kaplun, G., and Barak, B. For self-supervisedlearning, rationality implies generalization, provably.arXiv preprint arXiv:2010.08508, 2020.

Bardenet, R., Maillard, O.-A., et al. Concentration inequali-ties for sampling without replacement. Bernoulli, 21(3):1361–1385, 2015.

Bartlett, P. L. and Mendelson, S. Rademacher and gaussiancomplexities: Risk bounds and structural results. Journalof Machine Learning Research, 3(Nov):463–482, 2002.

Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. Spectrally-normalized margin bounds for neural networks. In Ad-vances in neural information processing systems, pp.6240–6249, 2017.

Blum, A. and Hardt, M. The ladder: A reliable leaderboardfor machine learning competitions. In International Con-ference on Machine Learning, pp. 1006–1014. PMLR,2015.

Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth,M. K. Learnability and the vapnik-chervonenkis dimen-sion. Journal of the ACM (JACM), 36(4):929–965, 1989.

Bousquet, O. and Elisseeff, A. Stability and generalization.Journal of machine learning research, 2(Mar):499–526,2002.

Chizat, L. and Bach, F. Implicit bias of gradient descent forwide two-layer neural networks trained with the logisticloss. In Conference on Learning Theory, pp. 1305–1338.PMLR, 2020.

Chizat, L., Oyallon, E., and Bach, F. On lazy trainingin differentiable programming. In Advances in NeuralInformation Processing Systems, pp. 2937–2947, 2019.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:Pre-training of deep bidirectional transformers for lan-guage understanding. arXiv preprint arXiv:1810.04805,2018.

Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. Gradientdescent finds global minima of deep neural networks.In International Conference on Machine Learning, pp.1675–1685. PMLR, 2019.

Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradientdescent provably optimizes over-parameterized neuralnetworks. arXiv preprint arXiv:1810.02054, 2018.

Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O.,and Roth, A. L. Preserving statistical validity in adaptivedata analysis. In Proceedings of the forty-seventh annualACM symposium on Theory of computing, pp. 117–126,2015.

Dziugaite, G. K. and Roy, D. M. Computing nonvacuousgeneralization bounds for deep (stochastic) neural net-works with many more parameters than training data.arXiv preprint arXiv:1703.11008, 2017.

Elisseeff, A., Pontil, M., et al. Leave-one-out error andstability of learning algorithms with applications. NATOscience series sub series iii computer and systems sci-ences, 190:111–130, 2003.

Frankle, J., Schwab, D. J., and Morcos, A. S. Theearly phase of neural network training. arXiv preprintarXiv:2002.10365, 2020.

Friedman, J. and Popescu, B. E. Gradient directed regular-ization for linear regression and classification. Technicalreport, Technical Report, Statistics Department, StanfordUniversity, 2003.

Gunasekar, S., Lee, J., Soudry, D., and Srebro, N. Implicitbias of gradient descent on linear convolutional networks.arXiv preprint arXiv:1806.00468, 2018a.

Gunasekar, S., Woodworth, B., Bhojanapalli, S., Neyshabur,B., and Srebro, N. Implicit regularization in matrix fac-torization. In 2018 Information Theory and ApplicationsWorkshop (ITA), pp. 1–10. IEEE, 2018b.

Gur-Ari, G., Roberts, D. A., and Dyer, E. Gradientdescent happens in a tiny subspace. arXiv preprintarXiv:1812.04754, 2018.

Hardt, M., Recht, B., and Singer, Y. Train faster, generalizebetter: Stability of stochastic gradient descent. In Interna-tional Conference on Machine Learning, pp. 1225–1234.PMLR, 2016.

He, K., Zhang, X., Ren, S., and Sun, J. Deep ResidualLearning for Image Recognition. In Computer Vision andPattern Recognition (CVPR), 2016.

Hochreiter, S. and Schmidhuber, J. Long short-term memory.Neural computation, 9(8):1735–1780, 1997.

Hoeffding, W. Probability inequalities for sums of boundedrandom variables. In The Collected Works of WassilyHoeffding, pp. 409–426. Springer, 1994.

Hu, W., Li, Z., and Yu, D. Simple and effective regulariza-tion methods for training on noisily labeled data with gen-eralization guarantee. arXiv preprint arXiv:1905.11368,2019.

Hu, W., Xiao, L., Adlam, B., and Pennington, J. The sur-prising simplicity of the early-time learning dynamics ofneural networks. arXiv preprint arXiv:2006.14599, 2020.


Hui, L. and Belkin, M. Evaluation of neural architecturestrained with square loss vs cross-entropy in classificationtasks. arXiv preprint arXiv:2006.07322, 2020.

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel:Convergence and generalization in neural networks. InAdvances in neural information processing systems, pp.8571–8580, 2018.

Ji, Z. and Telgarsky, M. The implicit bias of gradient descenton nonseparable data. In Conference on Learning Theory,pp. 1772–1798. PMLR, 2019.

Kearns, M. and Ron, D. Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Neuralcomputation, 11(6):1427–1453, 1999.

Krizhevsky, A. and Hinton, G. Learning Multiple Layers ofFeatures from Tiny Images. Technical report, Citeseer,2009.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-Based Learning Applied to Document Recognition. Pro-ceedings of the IEEE, 86, 1998.

Li, M., Soltanolkotabi, M., and Oymak, S. Gradient de-scent with early stopping is provably robust to label noisefor overparameterized neural networks. arXiv preprintarXiv:1903.11680, 2019.

Li, M., Soltanolkotabi, M., and Oymak, S. Gradient de-scent with early stopping is provably robust to label noisefor overparameterized neural networks. In InternationalConference on Artificial Intelligence and Statistics, pp.4313–4324. PMLR, 2020.

Li, Y. and Liang, Y. Learning overparameterized neuralnetworks via stochastic gradient descent on structureddata. In Advances in Neural Information ProcessingSystems, pp. 8157–8166, 2018.

Liu, S., Niles-Weed, J., Razavian, N., and Fernandez-Granda, C. Early-learning regularization prevents memo-rization of noisy labels. arXiv preprint arXiv:2007.00151,2020.

Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., andPotts, C. Learning word vectors for sentiment analysis.In Proceedings of the 49th annual meeting of the asso-ciation for computational linguistics: Human languagetechnologies, pp. 142–150, 2011.

McDiarmid, C. On the method of bounded differences, pp.148–188. London Mathematical Society Lecture NoteSeries. Cambridge University Press, 1989.

Mukherjee, S., Niyogi, P., Poggio, T., and Rifkin, R. Learn-ing theory: stability is sufficient for generalization and

necessary and sufficient for consistency of empirical riskminimization. Advances in Computational Mathematics,25(1):161–193, 2006.

Murphy, K. P. Machine Learning: A Probabilistic Perspec-tive. MIT Press, 2012.

Muthukumar, V., Narang, A., Subramanian, V., Belkin, M.,Hsu, D., and Sahai, A. Classification vs regression inoverparameterized regimes: Does the loss function mat-ter? arXiv preprint arXiv:2005.08054, 2020.

Nagarajan, V. and Kolter, J. Z. Deterministic pac-bayesiangeneralization bounds for deep networks via generaliz-ing noise-resilience. arXiv preprint arXiv:1905.13344,2019a.

Nagarajan, V. and Kolter, J. Z. Uniform convergence maybe unable to explain generalization in deep learning. InAdvances in Neural Information Processing Systems, pp.11615–11626, 2019b.

Nakkiran, P., Kaplun, G., Kalimeris, D., Yang, T., Edelman,B. L., Zhang, F., and Barak, B. Sgd on neural networkslearns functions of increasing complexity. arXiv preprintarXiv:1905.11604, 2019.

Negrea, J., Dziugaite, G. K., and Roy, D. In defense of uni-form convergence: Generalization via derandomizationwith an application to interpolating predictors. In Interna-tional Conference on Machine Learning, pp. 7263–7272.PMLR, 2020.

Neu, G. and Rosasco, L. Iterate averaging as regulariza-tion for stochastic gradient descent. In Conference OnLearning Theory, pp. 3222–3242. PMLR, 2018.

Neyshabur, B., Tomioka, R., and Srebro, N. Norm-basedcapacity control in neural networks. In Conference onLearning Theory, pp. 1376–1401, 2015.

Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro,N. Exploring generalization in deep learning. arXivpreprint arXiv:1706.08947, 2017a.

Neyshabur, B., Bhojanapalli, S., and Srebro, N. Apac-bayesian approach to spectrally-normalized mar-gin bounds for neural networks. arXiv preprintarXiv:1707.09564, 2017b.

Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., andSrebro, N. The role of over-parametrization in general-ization of neural networks. In International Conferenceon Learning Representations, 2018.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison,


M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L.,Bai, J., and Chintala, S. Pytorch: An imperative style,high-performance deep learning library. In Advances inNeural Information Processing Systems 32, 2019.

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark,C., Lee, K., and Zettlemoyer, L. Deep contextualizedword representations. In Proc. of NAACL, 2018.

Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Doimagenet classifiers generalize to imagenet? In Interna-tional Conference on Machine Learning, pp. 5389–5400.PMLR, 2019.

Rolnick, D., Veit, A., Belongie, S., and Shavit, N. Deeplearning is robust to massive label noise. arXiv preprintarXiv:1705.10694, 2017.

Shalev-Shwartz, S. and Ben-David, S. Understanding ma-chine learning: From theory to algorithms. Cambridgeuniversity press, 2014.

Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan,K. Learnability, stability and uniform convergence. TheJournal of Machine Learning Research, 11:2635–2670,2010.

Sherman, J. and Morrison, W. J. Adjustment of an inversematrix corresponding to a change in one element of agiven matrix. The Annals of Mathematical Statistics, 21(1):124–127, 1950.

Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., andSrebro, N. The implicit bias of gradient descent on sepa-rable data. The Journal of Machine Learning Research,19(1):2822–2878, 2018.

Suggala, A., Prasad, A., and Ravikumar, P. K. Connect-ing optimization and regularization paths. In Advancesin Neural Information Processing Systems, pp. 10608–10619, 2018.

Tsybakov, A. B. et al. On nonparametric estimation ofdensity level sets. The Annals of Statistics, 25(3):948–969, 1997.

Vapnik, V. N. An overview of statistical learning theory.IEEE transactions on neural networks, 10(5):988–999,1999.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C.,Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M.,Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite,Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M.,Lhoest, Q., and Rush, A. M. Transformers: State-of-the-art natural language processing. In Proceedings ofthe 2020 Conference on Empirical Methods in NaturalLanguage Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics, 2020.

Yao, Y., Rosasco, L., and Caponnetto, A. On early stoppingin gradient descent learning. Constructive Approximation,26(2):289–315, 2007.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O.Understanding deep learning requires rethinking general-ization. arXiv preprint arXiv:1611.03530, 2016.

Zhou, L., Sutherland, D. J., and Srebro, N. On uniformconvergence and low-norm interpolation learning. arXivpreprint arXiv:2006.05942, 2020.

Zhou, W., Veitch, V., Austern, M., Adams, R. P., and Or-banz, P. Non-vacuous generalization bounds at the ima-genet scale: a pac-bayesian compression approach. arXivpreprint arXiv:1804.05862, 2018.


Supplementary MaterialThroughout this discussion, we will make frequently use of the following standard results concerning the exponentialconcentration of random variables:

Lemma 4 (Hoeffding’s inequality for independent RVs (Hoeffding, 1994)). Let Z1, Z2, . . . , Zn be independent boundedrandom variables with Zi P ra, bs for all i, then

P

˜

1

n

nÿ

i“1

pZi ´ E rZisq ě t

¸

ď exp

ˆ

´2nt2

pb´ aq2

˙

and

P

˜

1

n

nÿ

i“1

pZi ´ E rZisq ď ´t

¸

ď exp

ˆ

´2nt2

pb´ aq2

˙

for all t ě 0.

Lemma 5 (Hoeffding’s inequality for sampling with replacement (Hoeffding, 1994)). Let Z “ pZ1, Z2, . . . , ZN q be afinite population of N points with Zi P ra.bs for all i. Let X1, X2, . . . Xn be a random sample drawn without replacementfrom Z . Then for all t ě 0, we have

P

˜

1

n

nÿ

i“1

pXi ´ µq ě t

¸

ď exp

ˆ

´2nt2

pb´ aq2

˙

and

P

˜

1

n

nÿ

i“1

pXi ´ µq ď ´t

¸

ď exp

ˆ

´2nt2

pb´ aq2

˙

,

where µ “ 1N

řNi“1 Zi.

We now discuss one condition that generalizes the exponential concentration to dependent random variables.

Condition 2 (Bounded difference inequality). Let Z be some set and φ : Zn Ñ R. We say that φ satisfies the boundeddifference assumption if there exists c1, c2, . . . cn ě 0 s.t. for all i, we have

supZ1,Z2,...,Zn,Z1iPZn`1

∣∣φpZ1, . . . , Zi, . . . , Znq ´ φpZ1, . . . , Z1i, . . . , Znq

∣∣ ď ci .

Lemma 6 (McDiarmid’s inequality (McDiarmid, 1989)). Let Z1, Z2, . . . , Zn be independent random variables on set Zand φ : Zn Ñ R satisfy bounded difference inequality (Condition 2). Then for all t ą 0, we have

P pφpZ1, Z2, . . . , Znq ´ E rφpZ1, Z2, . . . , Znqs ě tq ď exp

ˆ

´2t2

řni“1 c

2i

˙

and

P pφpZ1, Z2, . . . , Znq ´ E rφpZ1, Z2, . . . , Znqs ď ´tq ď exp

ˆ

´2t2

řni“1 c

2i

˙

.

A. Proofs from Sec. 3Additional notation Let m1 be the number of mislabeled points (rSM ) and m2 be the number of correctly labeled points(rSC). Note m1 `m2 “ m.


A.1. Proof of Theorem 1

Proof of Lemma 1. The main idea of our proof is to regard the clean portion of the data (S Y rSC) as fixed. Then, thereexists an (unknown) classifier f˚ that minimizes the expected risk calculated on the (fixed) clean data and (random drawsof) the mislabeled data rSM . Formally,

f˚ :“ argminfPF

EqDpfq , (10)

whereqD “ n

m` nS ` m2

m` nrSC `

m1

m` nD1 .

Note here that qD is a combination of the empirical distribution over correctly labeled data S Y rSC and the (population)distribution over mislabeled data D1. Recall that

pf :“ argminfPF

ESY rSpfq . (11)

Since, pf minimizes 0-1 error on S Y rS, using ERM optimality on (11), we have

ESY rSppfq ď ESY rSpf

˚q . (12)

Moreover, since f˚ is independent of rSM , using Hoeffding’s bound, we have with probability at least 1´ δ that


d

logp1{δq

2m1. (13)

Finally, since f˚ is the optimal classifier on qD, we have

EqDpf

˚q ď EqDp

pfq . (14)

Now to relate (12) and (14), we multiply (13) by m1

m`n and add nm`nESpfq `

m2

m`nE rSCpfq both the sides. Hence, we can

rewrite (13) as follows:

ESY rSpf˚q ď E

qDpf˚q `

m1

m` n

d

logp1{δq

2m1. (15)

Now we combine equations (12), (15), and (14), to get

ESY rSppfq ď E

qDppfq `

m1

m` n

d

logp1{δq

2m1, (16)

which implies

ErSMp pfq ď ED1p pfq `

d

logp1{δq

2m1. (17)

Since rS is obtained by randomly labeling an unlabeled dataset, we assume 2m1 « m 3. Moreover, using ED1 “ 1´ ED weobtain the desired result.

Proof of Lemma 2. Recall ErSpfq “

m1

m ErSMpfq ` m2

m ErSCpfq. Hence, we have

2ErSpfq ´ E

rSMpfq ´ E

rSCpfq “

ˆ

2m1

mErSMpfq ´ E

rSMpfq

˙

`

ˆ

2m2

mErSCpfq ´ E

rSCpfq

˙

(18)

“

ˆ

2m1

m´ 1

˙

ErSMpfq `

ˆ

2m2

m´ 1

˙

ErSCpfq . (19)

3Formally, with probability at least 1´ δ, we have pm´ 2m1q ďa

m logp1{δq{2.


Since the dataset is labeled uniformly at random, with probability at least 1´ δ, we have`

2m1

m ´ 1˘

ď

b

logp1{δq2m . Similarly,

we have with probability at least 1´ δ,`

2m2

m ´ 1˘

ď

b

logp1{δq2m . Using union bound, with probability at least 1´ δ, we have

2ErS ´ E

rSMpfq ´ E

rSCpfq ď

c

logp2{δq

2m

´

ErSMpfq ` E

rSCpfq

¯

. (20)

With re-arranging ErSMpfq ` E

rSCpfq and using the inequality 1´ a ď 1

1à , we have

2ErS ´ E

rSMpfq ´ E

rSCpfq ď 2E

rS

c

logp2{δq

2m. (21)

Proof of Lemma 3. In the set of correctly labeled points S Y rSC , we have S as a random subset of S Y rSC . Hence, usingHoeffding’s inequality for sampling without replacement (Lemma 5), we have with probability at least 1´ δ

ErSCp pfq ´ ESY rSC

p pfq ď

d

logp1{δq

2m2. (22)

Re-writing ESY rSCp pfq as m2

m2`nErSCp pfq ` n

m2`nESp pfq, we have with probability at least 1´ δ

ˆ

n

n`m2

˙

´

ErSCp pfq ´ ESp pfq

¯

ď

d

logp1{δq

2m2. (23)

As before, assuming 2m2 « m, we have with probability at least 1´ δ

ErSCp pfq ´ ESp pfq ď

´

1`m2

n

¯

c

logp1{δq

mď

´

1`m

2n

¯

c

logp1{δq

m. (24)

Proof of Theorem 1. Having established these core intermediate results, we can now combine above three lemmas to provethe main result. In particular, we bound the population error on clean data (EDp pfq) as follows:

(i) First, use (17), to obtain an upper bound on the population error on clean data, i.e., with probability at least 1´ δ{4, wehave


c

logp4{δq

m. (25)

(ii) Second, use (21), to relate the error on the mislabeled fraction with error on clean portion of randomly labeled data anderror on whole randomly labeled dataset, i.e., with probability at least 1´ δ{2, we have

ÉrSMpfq ď E

rSCpfq ´ 2E

rS ` 2ErS

c

logp4{δq

2m. (26)

(iii) Finally, use (24) to relate the error on the clean portion of randomly labeled data and error on clean training data, i.e.,with probability 1´ δ{4, we have

ErSCp pfq ď ÉSp pfq `

´

1`m

2n

¯

c

logp4{δq

m. (27)

Using union bound on the above three steps, we have with probability at least 1´ δ:


´?2E

rS ` 2`m

2n

¯

c

logp4{δq

m. (28)


A.2. Proof of Proposition 1

Proof of Proposition 1. For a classifier f : X Ñ t´1, 1u, we have 1´ 2 I rfpxq ‰ ys “ y ¨ fpxq. Hence, by definition ofE , we have

1´ 2ErSpfq “

1

m

mÿ

i“1

yi ¨ fpxiq ď supfPF

1

m

mÿ

i“1

yi ¨ fpxiq . (29)

Note that for fixed inputs px1, x2, . . . , xmq in rS, py1, y2, . . . ymq are random labels. Define φ1py1, y2, . . . , ymq :“supfPF

1m

řmi“1 yi ¨ fpxiq. We have the following bounded difference condition on φ1. For all i,

supy1,...ym,y1iPt´1,1um`1

∣∣φ1py1, . . . , yi, . . . , ymq ´ φ1py1, . . . , y1i, . . . , ymq∣∣ ď 1{m. (30)

Similarly, we define φ2px1, x2, . . . , xmq :“ Eyi„Ut´1,1u

“

supfPF1m

řmi“1 yi ¨ fpxiq

‰

. We have the following boundeddifference condition on φ2. For all i,

supx1,...xm,x1iPXm`1

∣∣φ2px1, . . . , xi, . . . , xmq ´ φ1px1, . . . , x1i, . . . , xmq∣∣ ď 1{m. (31)

Using McDiarmid’s inequality (Lemma 6) twice with Condition (30) and (31), with probability at least 1´ δ, we have

supfPF

1

m

mÿ

i“1

yi ¨ fpxiq ´ Ex,y

«

supfPF

1

m

mÿ

i“1

yi ¨ fpxiq

ff

ď

c

2 logp2{δq

m. (32)

Combining (29) and (32), we obtain the desired result.


Proof of Theorem 2 follows similar to the proof of Theorem 1. Note that the same results in Lemma 1, Lemma 2, andLemma 3 hold in the regularized ERM case. However, the arguments in the proof of Lemma 1 change slightly. Hence, westate the lemma for regularized ERM and prove it here for completeness.

Lemma 7. Assume the same setup as Theorem 2. Then for any δ ą 0, with probability at least 1´ δ over the random drawsof mislabeled data rSM , we have


c

logp1{δq

m. (33)

Proof. The main idea of the proof remains the same, i.e. regard the clean portion of the data (S Y rSC ) as fixed. Then, thereexists a classifier f˚ that is optimal over draws of the mislabeled data rSM .

Formally,

f˚ :“ argminfPF

EqDpfq ` λRpfq , (34)

whereqD “ n

m` nS ` m1

m` nrSC `

m2

m` nD1 .

That is, qD a combination of the empirical distribution over correctly labeled data S Y rSC and the (population) distributionover mislabeled data D1. Recall that

pf :“ argminfPF

ESY rSpfq ` λRpfq . (35)

Since, pf minimizes 0-1 error on S Y rS, using ERM optimality on (11), we have

ESY rSppfq ` λRp pfq ď ESY rSpf

˚q ` λRpf˚q . (36)


Moreover, since f˚ is independent of rSM , using Hoeffding’s bound, we have with probability at least 1´ δ that


d

logp1{δq

2m1. (37)

Finally, since f˚ is the optimal classifier on qD, we have

EqDpf

˚q ` λRpf˚q ď EqDp

pfq ` λRp pfq . (38)

Now to relate (36) and (38), we can re-write the (37) as follows:

ESY rSpf˚q ď E

qDpf˚q `

m1

m` n

d

logp1{δq

2m1. (39)

After adding λRpf˚q on both sides in (39), we combine equations (36), (39), and (38), to get

ESY rSppfq ď E

qDppfq `

m1

m` n

d

logp1{δq

2m1, (40)

which implies


d

logp1{δq

2m1. (41)

Similar as before, since rS is obtained by randomly labeling an unlabeled dataset, we assume 2m1 « m. Moreover, usingED1 “ 1´ ED we obtain the desired result.


To prove our results in the multiclass case, we first state and prove lemmas parallel to those used in the proof of balancedbinary case. We then combine these results to obtain the result in Theorem 3.

Before stating the result, we define mislabeled distribution D1 for any D. While D1 and D share the same marginaldistribution over inputs X , the conditional distribution over labels y given an input x „ DX is changed as follows: For anyx, the Probability Mass Function (PMF) over y is defined as: pD1p¨|xq :“

1´pDp¨|xqk´1 , where pDp¨|xq is the PMF over y for

the distribution D.

Lemma 8. Assume the same setup as Theorem 3. Then for any δ ą 0, with probability at least 1´ δ over the random drawsof mislabeled data rSM , we have

EDp pfq ď pk ´ 1q´

1´ ErSMp pfq

¯

` pk ´ 1q

c

logp1{δq

m. (42)

Proof. The main idea of the proof remains the same. We begin by regarding the clean portion of the data (S Y rSC ) as fixed.Then, there exists a classifier f˚ that is optimal over draws of the mislabeled data rSM .

However, in the multiclass case, we cannot as easily relate the population error on mislabeled data to the population accuracyon clean data. While for binary classification, we could lower bound the population accuracy 1´ ED with the empirical erroron mislabeled data E

rSM(in the proof of Lemma 1), for multiclass classification, error on the mislabeled data and accuracy on

the clean data in the population are not so directly related. To establish (42), we break the error on the (unknown) mislabeleddata into two parts: one term corresponds to predicting the true label on mislabeled data, and the other corresponds topredicting neither the true label nor the assigned (mis-)label. Finally, we relate these errors to their population counterpartsto establish (42).

Formally,

f˚ :“ argminfPF

EqDpfq ` λRpfq , (43)


whereqD “ n

m` nS ` m1

m` nrSC `

m2

m` nD1 .

That is, qD is a combination of the empirical distribution over correctly labeled data S Y rSC and the (population) distributionover mislabeled data D1. Recall that

pf :“ argminfPF

ESY rSpfq ` λRpfq . (44)

Following the exact steps from the proof of Lemma 7, with probability at least 1´ δ, we have


d

logp1{δq

2m1. (45)

Similar to before, since rS is obtained by randomly labeling an unlabeled dataset, we assume kk´1m1 « m.

Now we will relate ED1p pfq with EDp pfq. Let yT denote the (unknown) true label for a mislabeled point px, yq (i.e., labelbefore replacing it with a mislabel).

Epx,yqP„D1”

I”

pfpxq ‰ yıı

“ Epx,yqP„D1”

I”

pfpxq ‰ y ^ pfpxq ‰ yTıı

l jh n

I

` Epx,yqP„D1”

I”

pfpxq ‰ y ^ pfpxq “ yTıı

l jh n

II

. (46)

Clearly, term 2 is one minus the accuracy on the clean unseen data, i.e.,

II “ 1´ Ex,y„D

”

I”

pfpxq ‰ yıı

“ 1´ EDp pfq . (47)

Next, we relate term 1 with the error on the unseen clean data. We show that term 1 is equal to the error on the unseen cleandata scaled by k´2

k´1 , where k is the number of labels. Using the definition of mislabeled distribution D1, we have

I “1

k ´ 1

˜

Epx,yqP„D

«

ÿ

iPY^i‰yI”

pfpxq ‰ i^ pfpxq ‰ yı

ff¸

“k ´ 2

k ´ 1EDp pfq . (48)

Combining the result in (47), (48) and (46), we have

ED1p pfq “ 1´1

k ´ 1EDp pfq . (49)

Finally, combining the result in (49) with equation (45), we have with probability 1´ δ,


1´ ErSMp pfq

¯

` pk ´ 1q

d

k logp1{δq

2pk ´ 1qm. (50)

Lemma 9. Assume the same setup as Theorem 3. Then for any δ ą 0, with probability at least 1´ δ over the random drawsof rS, we have ∣∣∣kE

rSppfq ´ E

rSCp pfq ´ pk ´ 1qE

rSMp pfq

∣∣∣ ď 2k

c

logp4{δq

2m.


Proof. Recall ErSpfq “

m1

m ErSMpfq ` m2

m ErSCpfq. Hence, we have

kErSpfq ´ pk ´ 1qE

rSMpfq ´ E

rSCpfq “ pk ´ 1q

ˆ

km1

pk ´ 1qmErSMpfq ´ E

rSMpfq

˙

`

ˆ

km2

mErSCpfq ´ E

rSCpfq

˙

“ k

„ˆ

m1

m´k ´ 1

k

˙

ErSMpfq `

ˆ

m2

m´

1

k

˙

ErSCpfq

.

Since the dataset is randomly labeled, we have with probability at least 1 ´ δ,`

m1

m ´ k´1k

˘

ď

b

logp1{δq2m . Similarly, we

have with probability at least 1´ δ,`

m2

m ´ 1k

˘

ď

b

logp1{δq2m . Using union bound, we have with probability at least 1´ δ

kErSpfq ´ pk ´ 1qE

rSMpfq ´ E

rSCpfq ď k

c

logp2{δq

2m

´

ErSMpfq ` E

rSCpfq

¯

. (51)

Lemma 10. Assume the same setup as Theorem 3. Then for any δ ą 0, with probability at least 1 ´ δ over the randomdraws of rSC and S, we have ∣∣∣E

rSCp pfq ´ ESp pfq

∣∣∣ ď 1.5

c

k logp2{δq

2m.

Proof. In the set of correctly labeled points S Y rSC , we have S as a random subset of S Y rSC . Hence, using Hoeffding’sinequality for sampling without replacement (Lemma 5), we have with probability at least 1´ δ

ErScp pfq ´ ESY rSC

p pfq ď

d

logp1{δq

2m2. (52)

Re-writing ESY rSCp pfq as m2

m2`nErSCp pfq ` n

m2`nESp pfq, we have with probability at least 1´ δ

ˆ

n

n`m2

˙

´

ErScp pfq ´ ESp pfq

¯

ď

d

logp1{δq

2m2. (53)

As before, assuming km2 « m, we have with probability at least 1´ δ

ErScp pfq ´ ESp pfq ď

´

1`m2

n

¯

c

k logp1{δq

2mď

ˆ

1`1

k

˙

c

k logp1{δq

2m. (54)

Proof of Theorem 3. Having established these core intermediate results, we can now combine above three lemmas. Inparticular, we bound the population error on clean data (EDp pfq) as follows:

(i) First, use (50), to obtain an upper bound on the population error on clean data, i.e., with probability at least 1´ δ{4, wehave


1´ ErSMp pfq

¯

` pk ´ 1q

d

k logp4{δq

2pk ´ 1qm. (55)

(ii) Second, use (51) to relate the error on the mislabeled fraction with error on clean portion of randomly labeled data anderror on whole randomly labeled dataset, i.e., with probability at least 1´ δ{2, we have

´pk ´ 1qErSMpfq ď E

rSCpfq ´ kE

rS ` k

c

logp4{δq

2m. (56)


(iii) Finally, use (54) to relate the error on the clean portion of randomly labeled data and error on clean training data, i.e.,with probability 1´ δ{4, we have

ErSCp pfq ď ´ESp pfq `

´

1`m

kn

¯

c

k logp4{δq

2m. (57)

Using union bound on the above three steps, we have with probability at least 1´ δ:

EDp pfq ď ESp pfq ` pk ´ 1q ´ kErSppfq ` p

a

kpk ´ 1q ` k `?k `

m

n?kq

c

logp4{δq

2m. (58)

Simplifying the term in RHS of (58), we get the desired result. in the final bound.


B. Proofs from Sec. 4We suppose that the parameters of the linear function are obtained via gradient descent on the following L2 regularizedproblem:

LSpw;λq :“nÿ

i“1

pwTxi ´ yiq2 ` λ ||w||

22 , (59)

where λ ě 0 is a regularization parameter. We assume access to a clean dataset S “ tpxi, yiquni“1 „ Dn and randomlylabeled dataset rS “ tpxi, yiqun`mi“n`1 „

rDm. Let X “ rx1, x2, ¨ ¨ ¨ , xm`ns and y “ ry1, y2, ¨ ¨ ¨ , ym`ns. Fix a positive

learning rate η such that η ď 1{´

ˇ

ˇ

ˇ

ˇXTXˇ

ˇ

ˇ

ˇ

op ` λ2¯

and an initialization w0 “ 0. Consider the following gradient descent

iterates to minimize objective (59) on S Y rS:

wt “ wt´1 ´ η∇wLSY rSpwt´1;λq @t “ 1, 2, . . . (60)

Then we have twtu converge to the limiting solution pw “`

XTX ` λI˘´1

XTy. Define pfpxq :“ fpx; pwq.

B.1. Proof of Theorem 4

We use a standard result from linear algebra, namely the Shermann-Morrison formula (Sherman & Morrison, 1950) formatrix inversion:

Lemma 11 (Sherman & Morrison (1950)). Suppose A P Rnˆn is an invertible square matrix and u, v P Rn are columnvectors. Then A` uvT is invertible iff 1` vTAu ‰ 0 and in particular

pA` uvT q´1 “ A´1 ´A´1uvTA´1

1` vTA´1u. (61)

For a given training set S Y rSC , define leave-one-out error on mislabeled points in the training data as

ELOOp rSM q“

ř

pxi,yiqP rSMEpfpiqpxiq, yiq∣∣∣rSM ∣∣∣ ,

where fpiq :“ fpA, pS Y rSqpiqq. To relate empirical leave-one-out error and population error with hypothesis stabilitycondition, we use the following lemma:

Lemma 12 (Bousquet & Elisseeff (2002)). For the leave-one-out error, we have

E„

´

ED1p pfq ´ ELOOp rSM q

¯2

ď1

2m1`

3β

n`m. (62)

Proof of the above lemma is similar to the proof of Lemma 9 in Bousquet & Elisseeff (2002) and can be found in App. D.Before presenting the proof of Theorem 4, we introduce some more notation. Let Xpiq denote the matrix of covariates withthe ith point removed. Similarly, let ypiq be the array of responses with the ith point removed. Define the corresponding

regularized GD solution as pwpiq “´

XTpiqXpiq ` λI

¯´1

XTpiqypiq. Define pfpiqpxq :“ fpx; pwpiqq.

Proof of Theorem 4. Because squared loss minimization does not imply 0-1 error minimization, we cannot use argumentsfrom Lemma 1. This is the main technical difficulty. To compare the 0-1 error at a train point with an unseen point, we usethe closed-form expression for pw and Shermann-Morrison formula to upper bound training error with leave-one-out crossvalidation error.

The proof is divided into three parts: In part one, we show that 0-1 error on mislabeled points in the training set is lower thanthe error obtained by leave-one-out error at those points. In part two, we relate this leave-one-out error with the populationerror on mislabeled distribution using Condition 1. While the empirical leave-one-out error is an unbiased estimator ofthe average population error of leave-one-out classifiers, we need hypothesis stability to control the variance of empirical


leave-one-out error. Finally, in part three, we show that the error on the mislabeled training points can be estimated with justthe randomly labeled and clean training data (as in proof of Theorem 1).

Part 1 First we relate training error with leave-one-out error. For any training point pxi, yiq in rS Y S, we have

Ep pfpxiq, yiq “ I“

yi ¨ xTi pw ă 0

‰

“ I”

yi ¨ xTi

`

XTX ` λI˘´1

XTy ă 0ı

(63)

“ I

»

—

—

—

–

yi ¨ xTi

´

XTpiqXpiq ` x

Ti xi ` λI

¯´1

l jh n

I

pXTpiqypiq ` yi ¨ xiq ă 0

fi

ffi

ffi

ffi

fl

. (64)

Letting A “

´

XTpiqXpiq ` λI

¯

and using Lemma 11 on term 1, we have

Ep pfpxiq, yiq “ I„

yi ¨ xTi

„

A´1 ´A´1xix

Ti A

´1

1` xTi A´1xi


(65)

“ I„

yi ¨

„

xTi A´1p1` xTi A

´1xiq ´ xTi A

´1xixTi A

´1

1` xTi A´1xi


(66)

“ I„

yi ¨

„

xTi A´1

1` xTi A´1xi


. (67)

Since 1` xTi A´1xi ą 0, we have

Ep pfpxiq, yiq “ I”

yi ¨ xTi A

´1pXTpiqypiq ` yi ¨ xiq ă 0

ı

(68)

“ I”

xTi A´1xi ` yi ¨ x

Ti A

´1pXTpiqypiqq ă 0

ı

(69)

ď I”

yi ¨ xTi A

´1pXTpiqypiqq ă 0

ı

“ Ep pfpiqpxiq, yiq . (70)

Using (70), we have

ErSMp pfq ď ELOOp rSM q

:“

ř

pxi,yiqP rSMEp pfpiqpxiq, yiq∣∣∣ rSM ∣∣∣ . (71)

Part 2 We now relate RHS in (71) with the population error on mislabeled distribution. To do this, we leverage Condition 1and Lemma 12. In particular, we have

ESY rSM

„

´


¯2

ď1

2m1`

3β

m` n. (72)

Using Chebyshev’s inequality, with probability at least 1´ δ, we have

ELOOp rSM qď ED1p pfq `

d

1

δ

ˆ

1

2m1`

3β

m` n

˙

. (73)

Part 3 Combining (73) and (71), we have


d

1

δ

ˆ

1

2m1`

3β

m` n

˙

. (74)


Compare (74) with (17) in the proof of Lemma 1. We obtain a similar relationship between ErSM

and ED1 but with apolynomial concentration instead of exponential concentration. In addition, since we just use concentration arguments torelate mislabeled error to the errors on the clean and unlabeled portions of the randomly labeled data, we can directly use theresults in Lemma 2 and Lemma 3. Therefore, combining results in Lemma 2, Lemma 3, and (74) with union bound, wehave with probability at least 1´ δ


´?2E

rSppfq ` 1`

m

2n

¯

c

logp4{δq

m`

d

4

δ

ˆ

1

m`

3β

m` n

˙

. (75)

B.2. Extension to multiclass classification

For multiclass problems with squared loss minimization, as standard practice, we consider one-hot encoding for theunderlying label, i.e., a class label c P rks is treated as p0, ¨, 0, 1, 0, ¨, 0q P Rk (with c-th coordinate being 1). As before, wesuppose that the parameters of the linear function are obtained via gradient descent on the following L2 regularized problem:

LSpw;λq :“nÿ

i“1

ˇ

ˇ

ˇ

ˇwTxi ´ yiˇ

ˇ

ˇ

ˇ

2

2` λ

kÿ

j“1

||wj ||22 , (76)

where λ ě 0 is a regularization parameter. We assume access to a clean dataset S “ tpxi, yiquni“1 „ Dn and randomlylabeled dataset rS “ tpxi, yiqun`mi“n`1 „

rDm. Let X “ rx1, x2, ¨ ¨ ¨ , xm`ns and y “ rey1 , ey2 , ¨ ¨ ¨ , eym`ns. Fix a positive

learning rate η such that η ď 1{´

ˇ

ˇ

ˇ

ˇXTXˇ

ˇ

ˇ

ˇ

op ` λ2¯

and an initialization w0 “ 0. Consider the following gradient descent

iterates to minimize objective (59) on S Y rS:

wjt “ wj

t´1 ´ η∇wjLSY rSpw

t´1;λq @t “ 1, 2, . . . and j “ 1, 2, . . . , k . (77)

Then we have twjtu for all j “ 1, 2, ¨ ¨ ¨ , k converge to the limiting solution pwj “`

XTX ` λI˘´1

XTyj . Definepfpxq :“ fpx; pwq.Theorem 5. Assume that this gradient descent algorithm satisfies Condition 1 with β “ Op1q. Then for a multiclassclassification problem wth k classes, for any δ ą 0, with probability at least 1´ δ, we have:

EDp pfq ď ESp pfq ` pk ´ 1q

ˆ

1´k

k ´ 1ErSppfq

˙

`

ˆ

k `?k `

m

n?k

˙

c

logp4{δq

2m`a

kpk ´ 1q

d

4

δ

ˆ

1

m`

3β

m` n

˙

. (78)

Proof. The proof of this theorem is divided into two parts. In the first part, we relate the error on the mislabeled sampleswith the population error on the mislabeled data. Similar to the proof of Theorem 4, we use Shermann-Morrison formula toupper bound training error with leave-one-out error on each pwj . Second part of the proof follows entirely from the proof ofTheorem 3. In essence, the first part derives an equivalent of (45) for GD training with squared loss and then the second partfollows from the proof of Theorem 3.

Part-1: Consider a training point pxi, yiq in rS Y S. For simplicity, we use ci to denote the class of i-th point and use yi asthe corresponding one-hot embedding. Recall error in multiclass point is given by Ep pfpxiq, yiq “ I

“

ci R argmaxxTi pw‰

.Thus, there exists a j ‰ ci P rks, such that we have

Ep pfpxiq, yiq “ I“

ci R argmaxxTi pw‰

“ I“

xTi pwci ă xTi pwj‰

(79)

“ I”

xTi`

XTX ` λI˘´1

XTyci ă xTi`

XTX ` λI˘´1

XTyj

ı

(80)

“ I

»

—

—

—

–

xTi

´

XTpiqXpiq ` x

Ti xi ` λI

¯´1

l jh n

I

´

XTpiqyci piq ` xi ´XT

piqyjpiq

¯

ă 0

fi

ffi

ffi

ffi

fl

. (81)


Letting A “

´

XTpiqXpiq ` λI

¯

and using Lemma 11 on term 1, we have

Ep pfpxiq, yiq “ I„

xTi

„

A´1 ´A´1xix

Ti A

´1

1` xTi A´1xi

´


piqyjpiq

¯

ă 0

(82)

“ I„„

xTi A´1p1` xTi A

´1xiq ´ xTi A

´1xixTi A

´1

1` xTi A´1xi

´


piqyjpiq

¯

ă 0

(83)

“ I„„

xTi A´1

1` xTi A´1xi

´


piqyjpiq

¯

ă 0

. (84)

Since 1` xTi A´1xi ą 0, we have

Ep pfpxiq, yiq “ I”

xTi A´1

´


piqyjpiq

¯

ă 0ı

(85)

“ I”

xTi A´1xi ` x

Ti A

´1XTpiqyci piq ´ x

Ti A

´1XTpiqyjpiq ă 0

ı

(86)

ď I”

xTi A´1XT

piqyci piq ´ xTi A

´1XTpiqyjpiq ă 0

ı

“ Ep pfpiqpxiq, yiq . (87)

Using (87), we have


:“

ř

pxi,yiqP rSMEp pfpiqpxiq, yiq∣∣∣ rSM ∣∣∣ . (88)

We now relate RHS in (71) with the population error on mislabeled distribution. Similar as before, to do this, we leverageCondition 1 and Lemma 12. Using (73) and (88), we have


d

1

δ

ˆ

1

2m1`

3β

m` n

˙

. (89)

We have now derived a parallel to (45). Using the same arguments in the proof of Lemma 8, we have


1´ ErSMp pfq

¯

` pk ´ 1q

d

k

δpk ´ 1q

ˆ

1

2m1`

3β

m` n

˙

. (90)

Part-2: We now combine the results in Lemma 9 and Lemma 10 to obtain the final inequality in terms of quantities thatcan be computed from just the randomly labeled and clean data. Similar to the binary case, we obtained a polynomialconcentration instead of exponential concentration. Combining (90) with Lemma 9 and Lemma 10, we have with probabilityat least 1´ δ

EDp pfq ď ESp pfq ` pk ´ 1q

ˆ

1´k

k ´ 1ErSppfq

˙

`

ˆ

k `?k `

m

n?k

˙

c

logp4{δq

2m`a

kpk ´ 1q

d

4

δ

ˆ

1

m`

3β

m` n

˙

. (91)

B.3. Discussion on Condition 1

The quantity in LHS of Condition 1 measures how much the function learned by the algorithm (in terms of error on unseenpoint) will change when one point in the training set is removed. We need hypothesis stability condition to control thevariance of the empirical leave-one-out error to show concentration of average leave-one-error with the population error.

Additionally, we note that while the dominating term in the RHS of Theorem 4 matches with the dominating term in ERMbound in Theorem 1, there is a polynomial concentration term (dependence on 1{δ instead of logp

a

1{δq) in Theorem 4.


Since with hypothesis stability, we just bound the variance, the polynomial concentration is due to the use of Chebyshev’sinequality instead of an exponential tail inequality (as in Lemma 1). Recent works have highlighted that a slightly strongercondition than hypothesis stability can be used to obtain an exponential concentration for leave-one-out error (Abou-Moustafa& Szepesvari, 2019), but we leave this for future work for now.

B.4. Formal statement and proof of Proposition 2

Before formally presenting the result, we will introduce some notation. By LSpwq, we denote the objective in (59) withλ “ 0. Assume Singular Value Decomposition (SVD) of X as

?nUS1{2V T . Hence XTX “ V SV T . Consider the GD

iterates defined in (60). We now derive closed form expression for the tth iterate of gradient descent:

wt “ wt´1 ` η ¨XT py ´Xwt´1q “ pI ´ ηV SV T qwk´1 ` ηX

Ty . (92)

Rotating by V T , we get

rwt “ pI ´ ηSq rwk´1 ` ηry, (93)

where rwt “ V Twt and ry “ V TXTy. Assuming the initial point w0 “ 0 and applying the recursion in (93), we get

rwt “ S´1pI ´ pI ´ ηSqkqry , (94)

Projecting solution back to the original space, we have

wt “ V S´1pI ´ pI ´ ηSqkqV TXTy . (95)

Define ftpxq :“ fpx;wtq as the solution at the tth iterate. Let rwλ “ argminw LSpw;λq “ pXTX ` λIq´1XTy “

V pS ` λIq´1V TXTy. and define rfλpxq :“ fpx; rwλq as the regularized solution. Assume κ be the condition number ofthe population covariance matrix and let smin be the minimum positive singular value of the empirical covariance matrix. Ourproof idea is inspired from recent work on relating gradient flow solution and regularized solution for regression problems(Ali et al., 2018). We will use the following lemma in the proof:

Lemma 13. For all x P r0, 1s and for all k P N, we have (a) kx1`kx ď 1´ p1´ xqk and (b) 1´ p1´ xqk ď 2 ¨ kx

kx`1 .

Proof. Using p1´ xqk ď 11`kx , we have part (a). For part (b), we numerically maximize p1`kxqp1´p1´xqkq

kx for all k ě 1and for all x P r0, 1s.

Proposition 3 (Formal statement of Proposition 2). Let λ “ 1tη . For a training point x, we have

Ex„S

”


ď cpt, ηq ¨ Ex„S“

ftpxq2‰

,

where cpt, ηq :“ minp0.25, 1s2mint

2η2q. Similarly for a test point, we have

Ex„DX

”


ď κ ¨ cpt, ηq ¨ Ex„DX

“

ftpxq2‰

.

Proof. We want to analyze the expected squared difference output of regularized linear regression with regularizationconstant λ “ 1

ηt and the gradient descent solution at the tth iterate. We separately expand the algebraic expression for squareddifference at a training point and a test point. Then the main step is to show that

“

S´1pI ´ pI ´ ηSqkq ´ pS ` λIq´1‰

ĺ

cpη, tq ¨ S´1pI ´ pI ´ ηSqkq.

Part 1 First, we will analyze the squared difference of the output at a training point (for simplicity, we refer to S Y rS as S),


i.e.,

Ex„S

„

´

ftpxq ´ rfλpxq¯2

“ ||Xwt ´X rwλ||22 (96)

“ˇ

ˇ

ˇ

ˇXV S´1pI ´ pI ´ ηSqtqV TXTy ´XV pS ` λIq´1V TXTyˇ

ˇ

ˇ

ˇ

2

2(97)

“ˇ

ˇ

ˇ

ˇXV`

S´1pI ´ pI ´ ηSqtq ´ pS ` λIq´1˘

V TXTyˇ

ˇ

ˇ

ˇ

2(98)

“ yTV X

¨

˚

˚

˝

S´1pI ´ pI ´ ηSqtq ´ pS ` λIq´1

l jh n

I

˛

‹

‹

‚

2

SV TXTy . (99)

We now separately consider term 1. Substituting λ “ 1tη , we get

S´1pI ´ pI ´ ηSqtq ´ pS ` λIq´1 “ S´1`

pI ´ pI ´ ηSqtq ´ pI ` S´1λq´1˘

(100)

“ S´1`

pI ´ pI ´ ηSqtq ´ pI ` pStηq´1q´1˘

l jh n

A

. (101)

We now separately bound the diagonal entries in matrix A. With si, we denote ith diagonal entry of S. Note that sinceη ď 1{ ||S||op, for all i, ηsi ď 1. Consider ith diagonal term (which is non-zero) of the diagonal matrix A, we have

Aii “1

si

ˆ

1´ p1´ siηqt ´

tηsi1` tηsi

˙

“1´ p1´ siηq

t

si

¨

˚

˚

˚

˚

˝

1´tηsi

p1` tηsiqp1´ p1´ siηqtql jh n

II

˛

‹

‹

‹

‹

‚

(102)

ď1

2

„

1´ p1´ siηqt

si

. (Using Lemma 13 (b))

Additionally, we can also show the following upper bound on term 2:

1´tηsi

p1` tηsiqp1´ p1´ siηqtq“p1` tηsiqp1´ p1´ siηq

tq ´ tηsip1` tηsiqp1´ p1´ siηqtq

(103)

ď1´ p1´ siηq

t ´ tηsip1´ siηqt

p1` tηsiqp1´ p1´ siηqtq(104)

ď1

tηsi. (Using Lemma 13 (a))

Combining both the upper bounds on each diagonal entry Aii, we have

A ĺ c1pη, tq ¨ S´1pI ´ pI ´ ηSqtq , (105)

where c1pη, tq “ minp0.5, 1tsiηq. Plugging this into (99), we have

Ex„S

„

´

ftpxq ´ rfλpxq¯2

ď cpη, tq ¨ yTV X`

S´1pI ´ pI ´ ηSqtq˘2

SV TXTy (106)

“ cpη, tq ¨ yTV X`

S´1pI ´ pI ´ ηSqtq˘

S`


V TXTy (107)

“ cpη, tq ¨ ||Xwt||22 (108)

“ cpη, tq ¨ Ex„S

”

pftpxqq2ı

, (109)

where cpη, tq “ minp0.25, 1t2s2iη

2 q.


Part 2 With Σ, we denote the underlying true covariance matrix. We now consider the squared difference of output at anunseen point:

Ex„DX

„

´

ftpxq ´ rfλpxq¯2

“ Ex„DX

“ˇ

ˇ

ˇ

ˇxTwt ´ xTrwλ

ˇ

ˇ

ˇ

ˇ

2

‰

(110)

“ˇ

ˇ

ˇ

ˇxTV S´1pI ´ pI ´ ηSqtqV TXTy ´ xTV pS ` λIq´1V TXTyˇ

ˇ

ˇ

ˇ

2(111)

“ˇ

ˇ

ˇ

ˇxTV`


V TXTyˇ

ˇ

ˇ

ˇ

2(112)

“ yTV X`


V TΣV (113)`

pI ´ pI ´ ηSqtq ´ pS ` λIq´1˘

V TXTy (114)

ď σmax ¨ yTV X

¨

˚

˚

˝

S´1pI ´ pI ´ ηSqtq ´ pS ` λIq´1

l jh n

I

˛

‹

‹

‚

2

V TXTy , (115)

where σmax is the maximum eigenvalue of the underlying covariance matrix Σ. Using the upper bound on term 1 in (105),we have

Ex„DX

„

´

ftpxq ´ rfλpxq¯2

ď σmax ¨ cpη, tq ¨ yTV X

`

S´1pI ´ pI ´ ηSqtq˘2

V TXTy (116)

“ κ ¨ cpη, tq ¨ σmin ¨ˇ

ˇ

ˇ

ˇV`


V TXTyˇ

ˇ

ˇ

ˇ

2

2(117)

ď κ ¨ cpη, tq ¨“

V`


V TXT‰T

Σ (118)“

V`


V TXT‰

y (119)

“ κ ¨ cpη, tq ¨ Ex„DX

“ˇ

ˇ

ˇ

ˇxTwtˇ

ˇ

ˇ

ˇ

2

‰

. (120)

B.5. Extension to deep learning

Under Assumption B.6, we present the formal result parallel to Theorem 3.

Theorem 6. Consider a multiclass classification problem with k classes. Under Assumption 1, for any δ ą 0, withprobability at least 1´ δ, we have

EDp pfq ď ESp pfq ` pk ´ 1q´

1´ kk´1E rSp

pfq¯

` c

d

logp 4δ q

2m, (121)

for some constant c ď ppc` 1qk `?k ` m

n?kq.

The proof follows exactly as in step (i) to (iii) in Theorem 3.

B.6. Justifying Assumption 1

Motivated by the analysis on linear models, we now discuss alternate (and weaker) conditions that imply Assumption 1. Weneed hypothesis stability (Condition 1) and the following assumption relating training error and leave-one-error:

Assumption 2. Let pf be a model obtained by training with algorithm A on a mixture of clean S and randomly labeled datarS. Then we assume we have


,

for all pxi, yiq P rSM where pfpiq :“ fpA, S Y rSM piqq and ELOOp rSM q:“

ř

pxi,yiqPrSM

Ep pfpiqpxiq,yiq

| rSM | .

Intuitively, this assumption states that the error on a (mislabeled) datum px, yq included in the training set is less than theerror on that datum px, yq obtained by a model trained on the training set S ´ tpx, yqu. We proved this for linear models


trained with GD in the proof of Theorem 5. Condition 1 with β “ Op1q and Assumption 2 together with Lemma 12 impliesAssumption 1 with a polynomial residual term (instead of logarithmic in 1{δ):

ESMp pfq ď ED1p pfq `

d

1

δ

ˆ

1

m`

3β

m` n

˙

. (122)


C. Additional experiments and detailsC.1. Datasets

Toy Dataset Assume fixed constants µ and σ. For a given label y, we simulate features x in our toy classification setup asfollows:

x :“ concat rx1, x2s where x1 „ N py ¨ µ, σ2Idˆdq and x1 „ N p0, σ2Idˆdq .

In experiements throughout the paper, we fix dimention d “ 100, µ “ 1.0, and σ “?d. Intuitively, x1 carries the

information about the underlying label and x2 is additional noise independent of the underlying label.

CV datasets We use MNIST (LeCun et al., 1998) and CIFAR10 (Krizhevsky & Hinton, 2009). We produce a binaryvariant from the multiclass classification problem by mapping classes t0, 1, 2, 3, 4u to label 1 and t5, 6, 7, 8, 9u to label ´1.For CIFAR dataset, we also use the standard data augementation of random crop and horizontal flip. PyTorch code is asfollows:

(transforms.RandomCrop(32, padding=4),transforms.RandomHorizontalFlip())

NLP dataset We use IMDb Sentiment analysis (Maas et al., 2011) corpus.

C.2. Architecture Details

All experiments were run on NVIDIA GeForce RTX 2080 Ti GPUs. We used PyTorch (Paszke et al., 2019) and Keras withTensorflow (Abadi et al., 2016) backend for experiments.

Linear model For the toy dataset, we simulate a linear model with scalar output and the same number of parameters as thenumber of dimensions.

Wide nets To simulate the NTK regime, we experiment with 2´layered wide nets. The PyTorch code for 2-layer wideMLP is as follows:

nn.Sequential(nn.Flatten(),nn.Linear(input dims, 200000, bias=True),nn.ReLU(),nn.Linear(200000, 1, bias=True))

We experiment both (i) with the second layer fixed at random initialization; (ii) and updating both layers’ weights.

Deep nets for CV tasks We consider a 4-layered MLP. The PyTorch code for 4-layer MLP is as follows:

nn.Sequential(nn.Flatten(),nn.Linear(input dim, 5000, bias=True),nn.ReLU(),nn.Linear(5000, 5000, bias=True),nn.ReLU(),nn.Linear(5000, 5000, bias=True),nn.ReLU(),nn.Linear(1024, num label, bias=True))

For MNIST, we use 1000 nodes instead of 5000 nodes in the hidden layer. We also experiment with convolutional nets.In particular, we use ResNet18 (He et al., 2016). Implementation adapted from: https://github.com/kuangliu/pytorch-cifar.git.

Deep nets for NLP We use a simple LSTM model with embeddings intialized with ELMo embeddings (Peters et al.,2018). Code adapted from: https://github.com/kamujun/elmo_experiments/blob/master/elmo_experiment/notebooks/elmo_text_classification_on_imdb.ipynb

https://github.com/kuangliu/pytorch-cifar.git

https://github.com/kuangliu/pytorch-cifar.git

https://github.com/kamujun/elmo_experiments/blob/master/elmo_experiment/notebooks/elmo_text_classification_on_imdb.ipynb

https://github.com/kamujun/elmo_experiments/blob/master/elmo_experiment/notebooks/elmo_text_classification_on_imdb.ipynb


We also evaluate our bounds with a BERT model. In particular, we fine-tune an off-the-shelf uncased BERT model (Devlinet al., 2018). Code adapted from Hugging Face Transformers (Wolf et al., 2020): https://huggingface.co/transformers/v3.1.0/custom_datasets.html.

C.3. Additonal experiments

Results with SGD on underparameterized linear models


50

60

70

80

90

100

Accu

racy

Underparameterized modelMSECE

TestPredicted bound

Figure 3. We plot the accuracy and corresponding bound (RHS in (1)) at δ “ 0.1 for toy binary classification task. Results aggregatedover 3 seeds. Accuracy vs fraction of unlabeled data (w.r.t clean data) in the toy setup with a linear model trained with SGD. Resultsparallel to Fig. 2(a) with SGD.

Results with wide nets on binary MNIST


50

60

70

80

90

100

Accu

racy

MNIST

GDEarly stopWeight decay

TestPredicted bound

(a) GD with MSE loss


50

60

70

80

90

100

Accu

racy

MNIST


TestPredicted bound

(b) SGD with CE loss


50

60

70

80

90

100

Accu

racy

MNIST


TestPredicted bound

(c) SGD with MSE loss

Figure 4. We plot the accuracy and corresponding bound (RHS in (1)) at δ “ 0.1 for binary MNIST classification. Results aggregatedover 3 seeds. Accuracy vs fraction of unlabeled data for a 2-layer wide network on binary MNIST with both the layers training in (a,b)and only first layer training in (c). Results parallel to Fig. 2(b) .

Results on CIFAR 10 and MNIST We plot epoch wise error curve for results in Table 1(Fig. 5 and Fig. 6). We observethe same trend as in Fig. 1. Additionally, we plot an oracle bound obtained by tracking the error on mislabeled data whichnevertheless were predicted as true label. To obtain an exact emprical value of the oracle bound, we need underlying truelabels for the randomly labeled data. While with just access to extra unlabeled data we cannot calculate oracle bound, wenote that the oracle bound is very tight and never violated in practice underscoring an importamt aspect of generalizationin multiclass problems. This highlight that even a stronger conjecture may hold in multiclass classification, i.e., error onmislabeled data (where nevertheless true label was predicted) lower bounds the population error on the distribution ofmislabeled data and hence, the error on (a specific) mislabeled portion predicts the population accuracy on clean data. Onthe other hand, the dominating term of in Theorem 3 is loose when compared with the oracle bound. The main reason, webelieve is the pessimistic upper bound in (45) in the proof of Lemma 8. We leave an investigation on this gap for future.

Results on CIFAR 100 On CIFAR100, our bound in (5) yields vacous bounds. However, the oracle bound as explainedabove yields tight guarantees in the initial phase of the learning (i.e., when learning rate is less than 0.1) (Fig. 7).

C.4. Hyperparameter Details

Fig. 1 We use clean training dataset of size 40, 000. We fix the amount of unlabeled data at 20% of the clean size, i.e. weinclude additional 8, 000 points with randomly assigned labels. We use test set of 10, 000 points. For both MLP and ResNet,

https://huggingface.co/transformers/v3.1.0/custom_datasets.html

https://huggingface.co/transformers/v3.1.0/custom_datasets.html


0 10 20 30 40 50 60 70Epoch

102030405060708090

100

Accu

racy

Test accuracyPredicted boundOracle bound

(a) MLP

0 10 20 30 40 50Epoch

102030405060708090

100

Accu

racy


(b) ResNet

Figure 5. Per epoch curves for CIFAR10 corresponding results in Table 1. As before, we just plot the dominating term in the RHS of(5) as predicted bound. Additionally, we also plot the predicted lower bound by the error on mislabeled data which nevertheless werepredicted as true label. We refer to this as “Oracle bound”. See text for more details.

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0Epoch

102030405060708090

100

Accu

racy


(a) MLP

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0Epoch

102030405060708090

100

Accu

racy


(b) ResNet

Figure 6. Per epoch curves for MNIST corresponding results in Table 1. As before, we just plot the dominating term in the RHS of (5) aspredicted bound. Additionally, we also plot the predicted lower bound by the error on mislabeled data which nevertheless were predictedas true label. We refer to this as “Oracle bound”. See text for more details.

we use SGD with an initial learning rate of 0.1 and momentum 0.9. We fix the weight decay parameter at 5ˆ 10´4. After100 epochs, we decay the learning rate to 0.01. We use SGD batch size of 100.

Fig. 2 (a) We obtain a toy dataset according to the process described in Sec. C.1. We fix d “ 100 and create a dataset of50, 000 points with balanced classes. Moreover, we sample additional covariates with the same procedure to create randomlylabeled dataset. For both SGD and GD training, we use a fixed learning rate 0.1.

Fig. 2 (b) Similar to binary CIFAR, we use clean training dataset of size 40, 000 and fix the amount of unlabeled data at20% of the clean dataset size. To train wide nets, we use a fixed learning of 0.001 with GD and SGD. We decide the weightdecay parameter and the early stopping point that maximizes our generalization bound (i.e. without peeking at unseen data ).We use SGD batch size of 100.

Fig. 2 (c) With IMDb dataset, we use a clean dataset of size 20, 000 and as before, fix the amount of unlabeled data at 20%of the clean data. To train ELMo model, we use Adam optimizer with a fixed learning rate 0.01 and weight decay 10´6 tominimize cross entropy loss. We train with batch size 32 for 3 epochs. To fine-tune BERT model, we use Adam optimizerwith learning rate 5ˆ 10´5 to minimize cross entropy loss. We train with a batch size of 16 for 1 epoch.

Table 1 For multiclass datasets, we train both MLP and ResNet with the same hyperparameters as described before. Wesample a clean training dataset of size 40, 000 and fix the amount of unlabeled data at 20% of the clean size. We use SGDwith an initial learning rate of 0.1 and momentum 0.9. We fix the weight decay parameter at 5ˆ 10´4. After 30 epochs forResNet and after 50 epochs for MLP, we decay the learning rate to 0.01. We use SGD with batch size 100. For Fig. 7, weuse the same hyperparameters as CIFAR10 training, except we now decay learning rate after 100 epochs.

In all experiments, to identify the best possible accuracy on just the clean data, we use the exact same set of hyperparamtersexcept the stopping point. We choose a stopping point that maximizes test performance.


0 20 40 60 80 100Epoch

0

20

40

60

80

100

Accu

racy

Test accuracyOracle bound

Figure 7. Predicted lower bound by the error on mislabeled data which nevertheless were predicted as true label with ResNet18 onCIFAR100. We refer to this as “Oracle bound”. See text for more details. The bound predicted by RATT (RHS in (5)) is vacuous.

C.5. Summary of experiments

Classification type Model category Model Dataset

Binary

Low dimensional Linear model Toy Gaussain datasetOverparameterized 2-layer wide net Binary MNISTlinear nets

Deep nets

MLP Binary MNISTBinary CIFAR

ResNet Binary MNISTBinary CIFAR

ELMo-LSTM model IMDb Sentiment AnalysisBERT pre-trained model IMDb Sentiment Analysis

Multiclass Deep nets

MLP MNISTCIFAR10

ResNetMNIST

CIFAR10CIFAR100


D. Proof of Lemma 12Proof of Lemma 12. Recall, we have a training set S Y rSC . We defined leave-one-out error on mislabeled points as

ELOOp rSM q“

ř

pxi,yiqP rSMEpfpiqpxiq, yiq∣∣∣rSM ∣∣∣ ,

where fpiq :“ fpA, pS Y rSqpiqq. Define S1 :“ S Y rS. Assume px, yq and px1, y1q as i.i.d. samples from D1. Using Lemma25 in Bousquet & Elisseeff (2002), we have

E„

´


¯2

ďES1,px,yq,px1,y1q”

Ep pfpxq, yqEp pfpx1q, y1qı

´ 2ES1,px,yq”

Ep pfpxq, yqEpfpiqpxiq, yiqı

`m1 ´ 1

m1ES1

“

Epfpiqpxiq, yiqEpfpjqpxjq, yjq‰

`1

m1ES1

“

Epfpiqpxiq, yiq‰

. (123)

We can rewrite the equation above as :

E„

´


¯2

ď ES1,px,yq,px1,y1q”

Ep pfpxq, yqEp pfpx1q, y1q ´ Ep pfpxq, yqEpfpiqpxiq, yiqı

l jh n

I

` ES1”

Epfpiqpxiq, yiqEpfpjqpxjq, yjq ´ Ep pfpxq, yqEpfpiqpxiq, yiqı

l jh n

II

`1

m1ES1

“

Epfpiqpxiq, yiq ´ Epfpiqpxiq, yiqEpfpjqpxjq, yjq‰

l jh n

III

. (124)

We will now bound term III. Using Cauchy-Schwarz’s inequality, we have

ES1“

Epfpiqpxiq, yiq ´ Epfpiqpxiq, yiqEpfpjqpxjq, yjq‰2ď ES1

“

Epfpiqpxiq, yiq‰2 ES1

“

1´ Epfpjqpxjq, yjq‰2

(125)

ď1

4. (126)

Note that since pxi, yiq, pxj , yjq, px, yq, and px1, y1q are all from same distribution D1, we directly incorporate the boundson term I and II from the proof of Lemma 9 in Bousquet & Elisseeff (2002). Combining that with (126) and our definitionof hypothesis stability in Condition 1, we have the required claim.

arXiv:2105.00303v1 [cs.LG] 1 May 2021

Documents

Transcript of arXiv:2105.00303v1 [cs.LG] 1 May 2021