Understanding Noise in Machine Learningruntianzhai.com/slides/understanding_noise.pdf ·...

Understanding Noise in Machine Learning

Runtian Zhai

School of Electronics Engineering and Computer ScienceSchool of Mathematical Science (double major)

Peking University

[email protected]

July 16, 2019

A talk at UCLA.

Runtian Zhai (PKU) Understanding Noise July 16, 2019 1 / 45

Introduction: Noise is Everywhere

Noise is everywhere. Since the time as early as the 1920s, statisticianshave been searching for ways to combat noise in collected data.In machine learning, this is a more serious problem. Training sets arelabeled by humans, and humans always make mistakes. In computervision, many images are corrupted, blurred, or compressed.An even more dangerous kind of noise is known as the adversarialexamples, crafted noise that aims to fool a certain classifier.


Learning with Noise

People have proposed various kinds of ways to learn with noise:Some propose detection methods which detect noisy samples in adataset so that they can be removed.Others suggest that even noisy samples can be useful. For instance,co-teaching is proposed to help networks learn on noisy datasets.Many defense methods are proposed to fight against adversarialexamples. The most successful one so far is adversarial training,which is training with on-the-fly adversarial examples.


How Noise Affects Training

Outline

1 How Noise Affects TrainingCritical Learning PeriodsDifferent Kinds of Noise are DifferentTwo Phases of Learning

2 Noise Fits More Slowly Than Clean DataZhang’s Experiment and Its ExplanationMeasuring Dataset ComplexityDetecting Noise

3 More Theoretical ApproachesInfluence FunctionNeural Tangent KernelFourier Analysis


How Noise Affects Training Critical Learning Periods

Critical Learning Periods

Critical Learning Periods in Deep NetworksAchille et al. (UCLA) [7] In ICLR 2019

In Biology, we are told that the first several weeks after the birth of ababy animal, known as the critical learning period, is critical for itsintellectual development.In deep learning this is also true. If a network is trained on noisyimages during the first several epochs, then it can never reach highperformance even if it is trained on clean images later on.



Experiment I

To show that the biological behavior also exists in deep learning, theauthors did the following experiment:

They trained an All-CNN on Cifar-10. During the first N epochs thenetwork was trained on noisy images. After that the network wastrained on clean images for another 160 epochs.They used blurred images for noisy images: first downsample the32 × 32 images to 8 × 8 and then upsample back to 32 × 32 withbilinear interpolation.



Result: Early Deficit Has Irremediable Negative Effect



Experiment II

They trained the network on noisy images for 40 epochs starting fromepoch N, and on clean images in other epochs. The 40 epochs iscalled the deficit window.They tested how much test accuracy would decrease with differentchoice of window onset N.



Result: Early Epochs are More Critical


How Noise Affects Training Different Kinds of Noise are Different

Different Kinds of Noise

The authors repeated the first experiment with different kinds of noise:Blur: 32 × 32 images downsampled to 8 × 8 then upsampled to32 × 32 with bilinear interpolationVertical flip: Flip the images verticallyLabel permutation: Use random labelsNoise: All images are replaced by random noise.

They also tested networks of different depth.


How Noise Affects Training Different Kinds of Noise are Different

Different Kinds of Noise are Different

For Noise the effect is not as strong. For Vertical flip and Labelpermutation, the effect is very weak.The deeper the network, the stronger the effect.


How Noise Affects Training Two Phases of Learning

Two Phases of Learning

The authors did fisher information analysis on the training process:They used the trace of Fisher Information Matrix (FIM) to measurehow much information the network had learned.The training period has two phases: In Phase I, FIM rises quickly,showing that the network is learning; In Phase II, FIM dropsdramatically (while its performance is still improving), showing thatthe network starts to forget.Runtian Zhai (PKU) Understanding Noise July 16, 2019 12 / 45

How Noise Affects Training Two Phases of Learning

Two Phases of Learning (cont.)

Many other papers [10, 11] also found that there are two phasesduring training from the optimization perspective.It is well known that during training, noise fits more slowly than cleandata. Many recent papers argue that in phase I, the network is fittingclean data; in phase II, the network is fitting noise, so it seems likethe network is forgetting useful information.This also explains why early stopping works.


Noise Fits More Slowly Than Clean Data

Outline





Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation

Deep Networks Can Fit Random Labels

Understanding Deep Learning Requires Rethinking GeneralizationZhang et al. [5] In ICLR 2017

In this paper, the authors did the following experiment: they added manykinds of noise to Cifar-10 (random labels, random pixels, gaussian, etc.),and then trained an Inception model on it. They found that

Deep networks fit noisy data easily.However, it takes much longer time than clean data.



Explaining the Results

Fine-Grained Analysis of Optimization and Generalization forOverparameterized Two-Layer Neural Networks

Arora et al. [4] In ICML 2019

In this paper, the authors prove for an overparameterized two-layerfully-connected network that

GD (gradient descent) can converge (achieve zero training loss) ondatasets with random labels.GD converges more slowly on random labels than on clean labels.Label noise can harm generalization.



Basic Setting

A two-layer ReLU network with m neurons is

fW,a(x) =1√m

m∑r=1

arσ(w⊤r x)

where x ∈ Rd is the input, W = (w1, ...,wm) ∈ Rd×m is the weight ofthe first layer and a = (a1, ..., am)⊤ ∈ Rm is the weight of the secondlayer. Assume ∥x∥2 = 1 and |y| ≤ 1.At initialization, wr(0) ∼ N (0, κ2I), ar ∼ unif({−1, 1}). Fix thesecond layer a and only train the first layer W. Denote W(k) as thevalue of W at step k.



Trajectory Based Analysis

Use MSE (Mean Square Error) as the loss function:

Φ(W) =12

n∑i=1

(yi − fW,a(xi))2

Let the trajectory of the network be u = (u1, ..., un)⊤, whereui = fW,a(xi). Then the loss function is Φ(W) = 1

2 ∥y − u∥22, wherey = (y1, ..., yn)⊤. Train with GD with learning rate η.Define H∞ as a Gram matrix:

H∞ij = Ew∼N (0,I)[x⊤i xjI{w⊤xi ≥ 0,w⊤xj ≥ 0}]

=x⊤i xj(π − arccos(x⊤i xj))

2π , ∀i, j ∈ [n]



Main Theorem

Assumptions: The initial variance κ2 and learning rate η are smallenough, and the width m is large enough.Lemma: Under the above assumptions, during training the realtrajectory {u(k)}∞k=0 stays close to another sequence {u(k)}∞k=0 whichhas a linear update rule: u(k + 1) = u(k)− ηH∞(u(k)− y) . Byanalyzing the dynamics of u(k) we can prove that

Φ(W(k)) ≈ 12

∥∥∥(I − ηH∞)ky∥∥∥22

uniformly for all k ≥ 0 with high probability.If H∞ is positive definite, we can be sure that Φ(W(k)) → 0 ask → ∞, which implies that GD always converges even if y is random.



Convergence Rate

Write the eigen-decomposition H∞ =∑n

i=1 λiviv⊤i , then

∥y − u(k)∥22 =n∑

i=1(1 − ηλi)

2k(v⊤i y)2

Since u(k) is very close to u(k), RHS can be used to estimate theconvergence rate.For a set of labels y, if they align with top eigenvectors (i.e. (v⊤i y)2 islarge for large λi), then GD converges quickly. Otherwise it convergesmore slowly.



Experimental Result

The authors showedby an experiment thatclean labels align withtop eigenvectorsperfectly, whereasrandom labels alignrandomly. Thisimplies that GD fitsnoisy data moreslowly.


Noise Fits More Slowly Than Clean Data Measuring Dataset Complexity

Analysis of Generalization

Theorem (Informal Version of Theorem 5.1)If the underlying data distribution D is non-degenerate and the training setis i.i.d. sampled from D, then for any 1-Lipschitz loss functionl : R× R → [0, 1] such that l(y, y) = 0, with probability at least 1 − δ overthe random initialization and the training samples, the two-layer ReLUnetwork fW(k),a trained by GD has population lossLD(fW(k),a) = E(x,y)∼D[l(fW(k),a(x), y)] bounded as

LD(fW(k),a) ≤√

2y⊤(H∞)−1yn + 3

√log(6/δ)

2n + o(1)



Dataset Complexity

LD(fW(k),a) ≤√

2y⊤(H∞)−1yn + 3

√log(6/δ)

2n + o(1)

This formula implies that√

2y⊤(H∞)−1y/n can be viewed as acomplexity measure of data. For a more complicated dataset, it isharder for a deep model to fit it and to generalize on it.To measure the complexity of different kinds of noise, I did severalexperiments on noisy Cifar-10, measuring its complexity with theabove metric.



My Experiment

Noise Type Complexity

No noise 33.47

Random labels 55.32

Same labels 2.80

AE of a standard model 45.42

AE of an adv trained model 41.61Gaussian (σ = 0.02) 30.17Gaussian (σ = 0.05) 22.59

I arbitrarily selected twoclasses of Cifar-10 (carsand birds) and selected500 images from eachclass. Then I normalizedall images to make surethat ∥x∥2 ≤ 1. The labelswere set to {+1,−1}. Icomputed the complexitywith the above formula.



My Experiment


No noise 33.47

Random labels 55.32

Same labels 2.80



For random labels, thedataset becomes morecomplex, so it is harderto optimize and general-ize.If all labels are the same(+1), the complexity isclose to 0, so it’s veryeasy for a model to fitthe dataset, which is ob-viously true.



My Experiment


No noise 33.47

Random labels 55.32

Same labels 2.80



Adversarial Examples alsomake the dataset morecomplex. I generated AEfor a normally trainedmodel and an adversar-ially trained model. Itturns out that AE of anormally trained model ismore complicated.



My Experiment


No noise 33.47

Random labels 55.32

Same labels 2.80



However, the metric failsto measure the complex-ity of Gaussian noise. Itturns out that the datasetplus a larger Gaussiannoise is simpler under thismetric.


Noise Fits More Slowly Than Clean Data Detecting Noise

Detecting Noise

Since noise fits more slowly, we can detect noise in the followingsimple way: the more slowly a sample fits, the more likely it is noise.A better method is co-teaching [12]. Two networks F1 and F2 aretrained simultaneously and teach one another: F1 only trains onsamples with small loss on F2 and vice versa. These samples areunlikely to be noise because F2 fits clean data faster than noise.


Noise Fits More Slowly Than Clean Data Detecting Noise

Self Distillation

Inspired by co-teaching, Dong et al.∗ suggests that a model can teachitself: it can be taught by its previous checkpoints. At epoch N, thenetwork only trains on samples with small loss at epoch N − n0.In addition, a bunch of papers in ICML 2019 [1, 2, 3] address on howto detect noisy data in the training set. All of them utilize the factthat noise fits more slowly than clean data.

* View this paper at http://www.runtianz.cn/doc/AIR.pdf. The link expires on Friday.This paper is under review. Please do not distribute.


http://www.runtianz.cn/doc/AIR.pdf

More Theoretical Approaches

Outline





More Theoretical Approaches Influence Function

Influence Function: Detecting Outliers

Understanding Black-box Predictions via Influence FunctionsKoh et al. [13] Best Paper In ICML 2017

Influence function measures how much influence a sample in thetraining set has to the final classifier. It is a classical technique instatistics. Samples with large influence can be regarded as outliers.Moreover, it can be efficiently computed by second-order optimizationtechniques.



Influence Function

Let the training samples be zi = (xi, yi) ∈ X × Y, i = 1, ..., n. Themodel Fθ is parameterized by θ. Let the empirical risk be1n∑n

i=1 L(zi, θ). The ERM (empirical risk minimizer) is given byθ = argminθ 1n

∑ni=1 L(zi, θ).

Consider the change in θ when a point z is removed from the trainingset. Let the ERM after z is removed be θ−z. Statisticans told us that

θ−z − θ ≈ −1nI(z)

whereI(z) = dθϵ,z

dϵ |ϵ=0 = −H−1θ

∇θL(z, θ)

where Hθ =1n∑n

i=1∇2θL(zi, θ) is the Hessian.



Influence Function (cont.)

I(z) is the influence function of z. The greater its norm is, the moreinfluence z has on the model.Furthermore, if we perturb z = (x, y) to zδ = (x + δ, y), and theresulting ERM is θzδ,−z, then

θzδ,−z − θ ≈ −1n(I(zδ)− I(z))

When δ is a certain kind of noise, we can estimate the complexity ofthat noise using this formula.Runtian Zhai (PKU) Understanding Noise July 16, 2019 33 / 45

More Theoretical Approaches Neural Tangent Kernel

NTK: A Powerful Tool

Neural Tangent Kernel: Convergence and Generalization in NeuralNetworks

Jacot et al. [6] In NIPS 2018

The authors prove that in the infinite-width limit, a fully-connectednetwork trained with GD evolves along the kernel gradient w.r.t. theNTK Θ, and Θ converges in probability to a deterministic limitingkernel Θ∞.Therefore, kernel principal components of Θ∞ with the highesteigenvalues in eigenspaces are fit first. Components with loweigenvalues can be regarded as noise.


More Theoretical Approaches Fourier Analysis

Fourier Analysis

Many recent papers propose to analyze the effect of noise using Fourieranalysis. For example:

On the Spectral Bias of Neural NetworksRahaman et al. [8] In ICML 2019

A Fourier Perspective on Model Robustness in Computer VisionYin et al. [9] arXiv:1906.08988


More Theoretical Approaches Fourier Analysis

Fourier Analysis

Some results from Fourier analysis:Neural networks are prone to learn towards low frequency functions,which are functions without local fluctuations.Both Gaussian noise and adversarial examples are high frequencynoise. That’s why normally trained networks are so vulnerable tothem.Common defense methods such as training with Gaussian noise andadversarial training improve robustness w.r.t. high frequency noise,but reduce robustness w.r.t. low frequency noise.


Take-aways

Deep learning has critical learning periods. Noise in these periodsdowngrades networks’ performance significantly, while noise in otherperiods doesn’t have as strong effect.There are two phases during training. In Phase I the network learns;in Phase II it forgets. The model achieves optimal performance iftraining early stops at phase transition.Neural networks fit noise more slowly than clean data, and we can usethis fact to detect noise, as in co-teaching and self distillation.Different kinds of noise have different complexity, and have differentlevels of impact on networks. Several metrics can be used to measurethe complexity of noise, such as Arora’s metric and influencefunctions.


Open Problem: Finding Relevant Data

Background: In semi-supervised learning (SSL), we have a hugeamount of unlabeled data, and we only want to train on data whichlooks similar to the limited labeled data we have. The question is:how to find data relevant for our task from a huge dataset?Details: Cifar-10 and Cifar-100 are subsets of a large unlabeleddataset called 80 million tiny images. I’d like to select 500k imagesfrom it that are the most relevant for Cifar classification. To test theperformance I will run an SSL algorithm and see its result.Suggestions: We can try Arora’s measurement of data complexity,influence functions, NTK analysis, Fourier analysis, etc. Pleasecontact me if you are interested in solving this problem.


Any Questions?


Thank you.


References

[1] Thulasidasan et al. (2019)Combating Label Noise in Deep Learning using AbstentionICML 2019

[2] Shen et al. (2019)Learning with Bad Training Data via Iterative Trimmed Loss MinimizationICML 2019

[3] Chen et al. (2019)Understanding and Utilizing Deep Neural Networks Trained with Noisy LabelsICML 2019


References

[4] Arora et al. (2019)Fine-Grained Analysis of Optimization and Generalization for OverparameterizedTwo-Layer Neural NetworksICML 2019

[5] Zhang et al. (2017)Understanding Deep Learning Requires Rethinking GeneralizationICLR 2017

[6] Jacot et al. (2018)Neural Tangent Kernel: Convergence and Generalization in Neural NetworksNIPS 2018


References

[7] Achille et al. (2019)Critical Learning Periods in Deep NetworksICLR 2019

[8] Rahaman et al. (2019)On the Spectral Bias of Neural NetworksICML 2019

[9] Yin et al. (2019)A Fourier Perspective on Model Robustness in Computer VisionarXiv:1906.08988


References

[10] Shwartz-Ziv et al. (2017)Opening the Black Box of Deep Neural Networks via InformationarXiv: 1703.00810

[11] Li et al. (2017)Convergence Analysis of Two-layer Neural Networks with ReLU ActivationNIPS 2017

[12] Han et al. (2018)Co-teaching: Robust training of deep neural networks with extremely noisy labelsNIPS 2018


References

[13] Koh et al. (2017)Understanding Black-box Predictions via Influence FunctionsICML 2017


Understanding Noise in Machine Learningruntianzhai.com/slides/understanding_noise.pdf ·...

Documents

Transcript of Understanding Noise in Machine Learningruntianzhai.com/slides/understanding_noise.pdf ·...