Critical Periods during Childhood and Adolescence: A Study ...
Understanding Noise in Machine Learningruntianzhai.com/slides/understanding_noise.pdf ·...
Transcript of Understanding Noise in Machine Learningruntianzhai.com/slides/understanding_noise.pdf ·...
Understanding Noise in Machine Learning
Runtian Zhai
School of Electronics Engineering and Computer ScienceSchool of Mathematical Science (double major)
Peking University
July 16, 2019
A talk at UCLA.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 1 / 45
Introduction: Noise is Everywhere
Noise is everywhere. Since the time as early as the 1920s, statisticianshave been searching for ways to combat noise in collected data.In machine learning, this is a more serious problem. Training sets arelabeled by humans, and humans always make mistakes. In computervision, many images are corrupted, blurred, or compressed.An even more dangerous kind of noise is known as the adversarialexamples, crafted noise that aims to fool a certain classifier.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 2 / 45
Learning with Noise
People have proposed various kinds of ways to learn with noise:Some propose detection methods which detect noisy samples in adataset so that they can be removed.Others suggest that even noisy samples can be useful. For instance,co-teaching is proposed to help networks learn on noisy datasets.Many defense methods are proposed to fight against adversarialexamples. The most successful one so far is adversarial training,which is training with on-the-fly adversarial examples.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 3 / 45
How Noise Affects Training
Outline
1 How Noise Affects TrainingCritical Learning PeriodsDifferent Kinds of Noise are DifferentTwo Phases of Learning
2 Noise Fits More Slowly Than Clean DataZhang’s Experiment and Its ExplanationMeasuring Dataset ComplexityDetecting Noise
3 More Theoretical ApproachesInfluence FunctionNeural Tangent KernelFourier Analysis
Runtian Zhai (PKU) Understanding Noise July 16, 2019 4 / 45
How Noise Affects Training Critical Learning Periods
Critical Learning Periods
Critical Learning Periods in Deep NetworksAchille et al. (UCLA) [7] In ICLR 2019
In Biology, we are told that the first several weeks after the birth of ababy animal, known as the critical learning period, is critical for itsintellectual development.In deep learning this is also true. If a network is trained on noisyimages during the first several epochs, then it can never reach highperformance even if it is trained on clean images later on.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 5 / 45
How Noise Affects Training Critical Learning Periods
Experiment I
To show that the biological behavior also exists in deep learning, theauthors did the following experiment:
They trained an All-CNN on Cifar-10. During the first N epochs thenetwork was trained on noisy images. After that the network wastrained on clean images for another 160 epochs.They used blurred images for noisy images: first downsample the32 × 32 images to 8 × 8 and then upsample back to 32 × 32 withbilinear interpolation.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 6 / 45
How Noise Affects Training Critical Learning Periods
Result: Early Deficit Has Irremediable Negative Effect
Runtian Zhai (PKU) Understanding Noise July 16, 2019 7 / 45
How Noise Affects Training Critical Learning Periods
Experiment II
They trained the network on noisy images for 40 epochs starting fromepoch N, and on clean images in other epochs. The 40 epochs iscalled the deficit window.They tested how much test accuracy would decrease with differentchoice of window onset N.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 8 / 45
How Noise Affects Training Critical Learning Periods
Result: Early Epochs are More Critical
Runtian Zhai (PKU) Understanding Noise July 16, 2019 9 / 45
How Noise Affects Training Different Kinds of Noise are Different
Different Kinds of Noise
The authors repeated the first experiment with different kinds of noise:Blur: 32 × 32 images downsampled to 8 × 8 then upsampled to32 × 32 with bilinear interpolationVertical flip: Flip the images verticallyLabel permutation: Use random labelsNoise: All images are replaced by random noise.
They also tested networks of different depth.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 10 / 45
How Noise Affects Training Different Kinds of Noise are Different
Different Kinds of Noise are Different
For Noise the effect is not as strong. For Vertical flip and Labelpermutation, the effect is very weak.The deeper the network, the stronger the effect.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 11 / 45
How Noise Affects Training Two Phases of Learning
Two Phases of Learning
The authors did fisher information analysis on the training process:They used the trace of Fisher Information Matrix (FIM) to measurehow much information the network had learned.The training period has two phases: In Phase I, FIM rises quickly,showing that the network is learning; In Phase II, FIM dropsdramatically (while its performance is still improving), showing thatthe network starts to forget.Runtian Zhai (PKU) Understanding Noise July 16, 2019 12 / 45
How Noise Affects Training Two Phases of Learning
Two Phases of Learning (cont.)
Many other papers [10, 11] also found that there are two phasesduring training from the optimization perspective.It is well known that during training, noise fits more slowly than cleandata. Many recent papers argue that in phase I, the network is fittingclean data; in phase II, the network is fitting noise, so it seems likethe network is forgetting useful information.This also explains why early stopping works.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 13 / 45
Noise Fits More Slowly Than Clean Data
Outline
1 How Noise Affects TrainingCritical Learning PeriodsDifferent Kinds of Noise are DifferentTwo Phases of Learning
2 Noise Fits More Slowly Than Clean DataZhang’s Experiment and Its ExplanationMeasuring Dataset ComplexityDetecting Noise
3 More Theoretical ApproachesInfluence FunctionNeural Tangent KernelFourier Analysis
Runtian Zhai (PKU) Understanding Noise July 16, 2019 14 / 45
Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation
Deep Networks Can Fit Random Labels
Understanding Deep Learning Requires Rethinking GeneralizationZhang et al. [5] In ICLR 2017
In this paper, the authors did the following experiment: they added manykinds of noise to Cifar-10 (random labels, random pixels, gaussian, etc.),and then trained an Inception model on it. They found that
Deep networks fit noisy data easily.However, it takes much longer time than clean data.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 15 / 45
Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation
Explaining the Results
Fine-Grained Analysis of Optimization and Generalization forOverparameterized Two-Layer Neural Networks
Arora et al. [4] In ICML 2019
In this paper, the authors prove for an overparameterized two-layerfully-connected network that
GD (gradient descent) can converge (achieve zero training loss) ondatasets with random labels.GD converges more slowly on random labels than on clean labels.Label noise can harm generalization.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 16 / 45
Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation
Basic Setting
A two-layer ReLU network with m neurons is
fW,a(x) =1√m
m∑r=1
arσ(w⊤r x)
where x ∈ Rd is the input, W = (w1, ...,wm) ∈ Rd×m is the weight ofthe first layer and a = (a1, ..., am)⊤ ∈ Rm is the weight of the secondlayer. Assume ∥x∥2 = 1 and |y| ≤ 1.At initialization, wr(0) ∼ N (0, κ2I), ar ∼ unif({−1, 1}). Fix thesecond layer a and only train the first layer W. Denote W(k) as thevalue of W at step k.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 17 / 45
Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation
Trajectory Based Analysis
Use MSE (Mean Square Error) as the loss function:
Φ(W) =12
n∑i=1
(yi − fW,a(xi))2
Let the trajectory of the network be u = (u1, ..., un)⊤, whereui = fW,a(xi). Then the loss function is Φ(W) = 1
2 ∥y − u∥22, wherey = (y1, ..., yn)⊤. Train with GD with learning rate η.Define H∞ as a Gram matrix:
H∞ij = Ew∼N (0,I)[x⊤i xjI{w⊤xi ≥ 0,w⊤xj ≥ 0}]
=x⊤i xj(π − arccos(x⊤i xj))
2π , ∀i, j ∈ [n]
Runtian Zhai (PKU) Understanding Noise July 16, 2019 18 / 45
Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation
Main Theorem
Assumptions: The initial variance κ2 and learning rate η are smallenough, and the width m is large enough.Lemma: Under the above assumptions, during training the realtrajectory {u(k)}∞k=0 stays close to another sequence {u(k)}∞k=0 whichhas a linear update rule: u(k + 1) = u(k)− ηH∞(u(k)− y) . Byanalyzing the dynamics of u(k) we can prove that
Φ(W(k)) ≈ 12
∥∥∥(I − ηH∞)ky∥∥∥22
uniformly for all k ≥ 0 with high probability.If H∞ is positive definite, we can be sure that Φ(W(k)) → 0 ask → ∞, which implies that GD always converges even if y is random.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 19 / 45
Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation
Convergence Rate
Write the eigen-decomposition H∞ =∑n
i=1 λiviv⊤i , then
∥y − u(k)∥22 =n∑
i=1(1 − ηλi)
2k(v⊤i y)2
Since u(k) is very close to u(k), RHS can be used to estimate theconvergence rate.For a set of labels y, if they align with top eigenvectors (i.e. (v⊤i y)2 islarge for large λi), then GD converges quickly. Otherwise it convergesmore slowly.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 20 / 45
Noise Fits More Slowly Than Clean Data Zhang’s Experiment and Its Explanation
Experimental Result
The authors showedby an experiment thatclean labels align withtop eigenvectorsperfectly, whereasrandom labels alignrandomly. Thisimplies that GD fitsnoisy data moreslowly.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 21 / 45
Noise Fits More Slowly Than Clean Data Measuring Dataset Complexity
Analysis of Generalization
Theorem (Informal Version of Theorem 5.1)If the underlying data distribution D is non-degenerate and the training setis i.i.d. sampled from D, then for any 1-Lipschitz loss functionl : R× R → [0, 1] such that l(y, y) = 0, with probability at least 1 − δ overthe random initialization and the training samples, the two-layer ReLUnetwork fW(k),a trained by GD has population lossLD(fW(k),a) = E(x,y)∼D[l(fW(k),a(x), y)] bounded as
LD(fW(k),a) ≤√
2y⊤(H∞)−1yn + 3
√log(6/δ)
2n + o(1)
Runtian Zhai (PKU) Understanding Noise July 16, 2019 22 / 45
Noise Fits More Slowly Than Clean Data Measuring Dataset Complexity
Dataset Complexity
LD(fW(k),a) ≤√
2y⊤(H∞)−1yn + 3
√log(6/δ)
2n + o(1)
This formula implies that√
2y⊤(H∞)−1y/n can be viewed as acomplexity measure of data. For a more complicated dataset, it isharder for a deep model to fit it and to generalize on it.To measure the complexity of different kinds of noise, I did severalexperiments on noisy Cifar-10, measuring its complexity with theabove metric.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 23 / 45
Noise Fits More Slowly Than Clean Data Measuring Dataset Complexity
My Experiment
Noise Type Complexity
No noise 33.47
Random labels 55.32
Same labels 2.80
AE of a standard model 45.42
AE of an adv trained model 41.61Gaussian (σ = 0.02) 30.17Gaussian (σ = 0.05) 22.59
I arbitrarily selected twoclasses of Cifar-10 (carsand birds) and selected500 images from eachclass. Then I normalizedall images to make surethat ∥x∥2 ≤ 1. The labelswere set to {+1,−1}. Icomputed the complexitywith the above formula.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 24 / 45
Noise Fits More Slowly Than Clean Data Measuring Dataset Complexity
My Experiment
Noise Type Complexity
No noise 33.47
Random labels 55.32
Same labels 2.80
AE of a standard model 45.42
AE of an adv trained model 41.61Gaussian (σ = 0.02) 30.17Gaussian (σ = 0.05) 22.59
For random labels, thedataset becomes morecomplex, so it is harderto optimize and general-ize.If all labels are the same(+1), the complexity isclose to 0, so it’s veryeasy for a model to fitthe dataset, which is ob-viously true.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 25 / 45
Noise Fits More Slowly Than Clean Data Measuring Dataset Complexity
My Experiment
Noise Type Complexity
No noise 33.47
Random labels 55.32
Same labels 2.80
AE of a standard model 45.42
AE of an adv trained model 41.61Gaussian (σ = 0.02) 30.17Gaussian (σ = 0.05) 22.59
Adversarial Examples alsomake the dataset morecomplex. I generated AEfor a normally trainedmodel and an adversar-ially trained model. Itturns out that AE of anormally trained model ismore complicated.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 26 / 45
Noise Fits More Slowly Than Clean Data Measuring Dataset Complexity
My Experiment
Noise Type Complexity
No noise 33.47
Random labels 55.32
Same labels 2.80
AE of a standard model 45.42
AE of an adv trained model 41.61Gaussian (σ = 0.02) 30.17Gaussian (σ = 0.05) 22.59
However, the metric failsto measure the complex-ity of Gaussian noise. Itturns out that the datasetplus a larger Gaussiannoise is simpler under thismetric.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 27 / 45
Noise Fits More Slowly Than Clean Data Detecting Noise
Detecting Noise
Since noise fits more slowly, we can detect noise in the followingsimple way: the more slowly a sample fits, the more likely it is noise.A better method is co-teaching [12]. Two networks F1 and F2 aretrained simultaneously and teach one another: F1 only trains onsamples with small loss on F2 and vice versa. These samples areunlikely to be noise because F2 fits clean data faster than noise.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 28 / 45
Noise Fits More Slowly Than Clean Data Detecting Noise
Self Distillation
Inspired by co-teaching, Dong et al.∗ suggests that a model can teachitself: it can be taught by its previous checkpoints. At epoch N, thenetwork only trains on samples with small loss at epoch N − n0.In addition, a bunch of papers in ICML 2019 [1, 2, 3] address on howto detect noisy data in the training set. All of them utilize the factthat noise fits more slowly than clean data.
* View this paper at http://www.runtianz.cn/doc/AIR.pdf. The link expires on Friday.This paper is under review. Please do not distribute.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 29 / 45
More Theoretical Approaches
Outline
1 How Noise Affects TrainingCritical Learning PeriodsDifferent Kinds of Noise are DifferentTwo Phases of Learning
2 Noise Fits More Slowly Than Clean DataZhang’s Experiment and Its ExplanationMeasuring Dataset ComplexityDetecting Noise
3 More Theoretical ApproachesInfluence FunctionNeural Tangent KernelFourier Analysis
Runtian Zhai (PKU) Understanding Noise July 16, 2019 30 / 45
More Theoretical Approaches Influence Function
Influence Function: Detecting Outliers
Understanding Black-box Predictions via Influence FunctionsKoh et al. [13] Best Paper In ICML 2017
Influence function measures how much influence a sample in thetraining set has to the final classifier. It is a classical technique instatistics. Samples with large influence can be regarded as outliers.Moreover, it can be efficiently computed by second-order optimizationtechniques.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 31 / 45
More Theoretical Approaches Influence Function
Influence Function
Let the training samples be zi = (xi, yi) ∈ X × Y, i = 1, ..., n. Themodel Fθ is parameterized by θ. Let the empirical risk be1n∑n
i=1 L(zi, θ). The ERM (empirical risk minimizer) is given byθ = argminθ 1n
∑ni=1 L(zi, θ).
Consider the change in θ when a point z is removed from the trainingset. Let the ERM after z is removed be θ−z. Statisticans told us that
θ−z − θ ≈ −1nI(z)
whereI(z) = dθϵ,z
dϵ |ϵ=0 = −H−1θ
∇θL(z, θ)
where Hθ =1n∑n
i=1∇2θL(zi, θ) is the Hessian.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 32 / 45
More Theoretical Approaches Influence Function
Influence Function (cont.)
I(z) is the influence function of z. The greater its norm is, the moreinfluence z has on the model.Furthermore, if we perturb z = (x, y) to zδ = (x + δ, y), and theresulting ERM is θzδ,−z, then
θzδ,−z − θ ≈ −1n(I(zδ)− I(z))
When δ is a certain kind of noise, we can estimate the complexity ofthat noise using this formula.Runtian Zhai (PKU) Understanding Noise July 16, 2019 33 / 45
More Theoretical Approaches Neural Tangent Kernel
NTK: A Powerful Tool
Neural Tangent Kernel: Convergence and Generalization in NeuralNetworks
Jacot et al. [6] In NIPS 2018
The authors prove that in the infinite-width limit, a fully-connectednetwork trained with GD evolves along the kernel gradient w.r.t. theNTK Θ, and Θ converges in probability to a deterministic limitingkernel Θ∞.Therefore, kernel principal components of Θ∞ with the highesteigenvalues in eigenspaces are fit first. Components with loweigenvalues can be regarded as noise.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 34 / 45
More Theoretical Approaches Fourier Analysis
Fourier Analysis
Many recent papers propose to analyze the effect of noise using Fourieranalysis. For example:
On the Spectral Bias of Neural NetworksRahaman et al. [8] In ICML 2019
A Fourier Perspective on Model Robustness in Computer VisionYin et al. [9] arXiv:1906.08988
Runtian Zhai (PKU) Understanding Noise July 16, 2019 35 / 45
More Theoretical Approaches Fourier Analysis
Fourier Analysis
Some results from Fourier analysis:Neural networks are prone to learn towards low frequency functions,which are functions without local fluctuations.Both Gaussian noise and adversarial examples are high frequencynoise. That’s why normally trained networks are so vulnerable tothem.Common defense methods such as training with Gaussian noise andadversarial training improve robustness w.r.t. high frequency noise,but reduce robustness w.r.t. low frequency noise.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 36 / 45
Take-aways
Deep learning has critical learning periods. Noise in these periodsdowngrades networks’ performance significantly, while noise in otherperiods doesn’t have as strong effect.There are two phases during training. In Phase I the network learns;in Phase II it forgets. The model achieves optimal performance iftraining early stops at phase transition.Neural networks fit noise more slowly than clean data, and we can usethis fact to detect noise, as in co-teaching and self distillation.Different kinds of noise have different complexity, and have differentlevels of impact on networks. Several metrics can be used to measurethe complexity of noise, such as Arora’s metric and influencefunctions.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 37 / 45
Open Problem: Finding Relevant Data
Background: In semi-supervised learning (SSL), we have a hugeamount of unlabeled data, and we only want to train on data whichlooks similar to the limited labeled data we have. The question is:how to find data relevant for our task from a huge dataset?Details: Cifar-10 and Cifar-100 are subsets of a large unlabeleddataset called 80 million tiny images. I’d like to select 500k imagesfrom it that are the most relevant for Cifar classification. To test theperformance I will run an SSL algorithm and see its result.Suggestions: We can try Arora’s measurement of data complexity,influence functions, NTK analysis, Fourier analysis, etc. Pleasecontact me if you are interested in solving this problem.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 38 / 45
Any Questions?
Runtian Zhai (PKU) Understanding Noise July 16, 2019 39 / 45
Thank you.
Runtian Zhai (PKU) Understanding Noise July 16, 2019 40 / 45
References
[1] Thulasidasan et al. (2019)Combating Label Noise in Deep Learning using AbstentionICML 2019
[2] Shen et al. (2019)Learning with Bad Training Data via Iterative Trimmed Loss MinimizationICML 2019
[3] Chen et al. (2019)Understanding and Utilizing Deep Neural Networks Trained with Noisy LabelsICML 2019
Runtian Zhai (PKU) Understanding Noise July 16, 2019 41 / 45
References
[4] Arora et al. (2019)Fine-Grained Analysis of Optimization and Generalization for OverparameterizedTwo-Layer Neural NetworksICML 2019
[5] Zhang et al. (2017)Understanding Deep Learning Requires Rethinking GeneralizationICLR 2017
[6] Jacot et al. (2018)Neural Tangent Kernel: Convergence and Generalization in Neural NetworksNIPS 2018
Runtian Zhai (PKU) Understanding Noise July 16, 2019 42 / 45
References
[7] Achille et al. (2019)Critical Learning Periods in Deep NetworksICLR 2019
[8] Rahaman et al. (2019)On the Spectral Bias of Neural NetworksICML 2019
[9] Yin et al. (2019)A Fourier Perspective on Model Robustness in Computer VisionarXiv:1906.08988
Runtian Zhai (PKU) Understanding Noise July 16, 2019 43 / 45
References
[10] Shwartz-Ziv et al. (2017)Opening the Black Box of Deep Neural Networks via InformationarXiv: 1703.00810
[11] Li et al. (2017)Convergence Analysis of Two-layer Neural Networks with ReLU ActivationNIPS 2017
[12] Han et al. (2018)Co-teaching: Robust training of deep neural networks with extremely noisy labelsNIPS 2018
Runtian Zhai (PKU) Understanding Noise July 16, 2019 44 / 45
References
[13] Koh et al. (2017)Understanding Black-box Predictions via Influence FunctionsICML 2017
Runtian Zhai (PKU) Understanding Noise July 16, 2019 45 / 45