Deyu Meng
Xi’an Jiaotong [email protected]
http://gr.xjtu.edu.cn/web/dymeng
Robust Deep Learning Based on Meta-learning
• Deep Learning
• Robust
• Meta-learning
LFW
The Success of Deep Learning Relies on
well-annotated & big data sets
What we think
we have:But what we really
have is always:
Commonly Encountered Data Bias (low quality data)
Label noise Data noise Class imbalance
• Deep Learning
• Robust
• Meta-learning
Robust Machine Learning for Data Bias
Design specific optimization objective (especially, robust loss)
to make it robust to certain data bias:
Label noise Data noiseClass imbalance
Lin, et al., TPAMI, 2018 Yong, et al., TPAMI, 2018Meng, et al., Information Sciences, 2017
Two Critical Issues
Generalized Cross Entropy
Symmetric Cross Entropy
Bi-Tempered logistic Loss
Polynomial SoftWeighting loss
Focal loss
CT loss
Lin, et al., TPAMI, 2018
Xie, et al., TMI, 2018
Zhao, et al., AAAI, 2015
Amid, et al., NeurIPS, 2019
Wang, et al., ICCV, 2019
Zhang, et al., NeurIPS, 2018
Hyperparameter Tunning
Non-convexity
• Deep Learning
• Robust
• Meta-learning
Training Data VS Validation Data
Hyper-parameter tuning: by validation data
Training loss Validation loss
≈ argminΘ∈{Θ1,Θ2,⋯,Θ𝑠}
1
𝑀
𝑖=1
𝑀
𝐿𝑖𝑚(𝒘∗(Θ))
Training Data VS Validation Data
Hyper-parameter tuning: by validation data
Training loss Validation loss
✓ Low efficiency✓ Low accuracy✓ Search instead of optimization✓ Heuristic instead of intelligent
≈ argminΘ∈{Θ1,Θ2,⋯,Θ𝑠}
1
𝑀
𝑖=1
𝑀
𝐿𝑖𝑚(𝒘∗(Θ))
• The function of validation data is higher than training data➢Hyper-parameter tuning VS classifier parameter learning➢Make the model adaptable to data fit (general to specific)
• Validation data is different from training data!➢Teacher vs. student➢ Ideal vs. real➢High quality vs. low quality➢ Small scale vs. large scale➢ Fixed vs. dynamic (relatively)
• What we should do?➢ Lower the threshold of training data collection; higher the threshold of validation
data selection
Intrinsic Functions of Validation Data
✓ Optimization instead of search
✓ Intelligent instead of heuristic (partially)
From Validation Loss Searching to Meta Loss Training
Hyper-parameter tuning: by meta data
Training loss Meta loss
= argminΘ∈𝒢
1
𝑀
𝑖=1
𝑀
𝐿𝑖𝑚(𝒘∗(Θ))
Many Recent Attempts
◆ Loss function.
Wu L, Tian F, Xia Y, et al. Learning to teach with dynamic loss functions. In NeurIPS, 2018: 6466-6477.Huang C, Zhai S, Talbott W, et al. Addressing the Loss-Metric Mismatch with Adaptive Loss Alignment. In ICML, 2019: 2891-2900.Xu H, Zhang H, Hu Z, et al. AutoLoss: Learning Discrete Schedule for Alternate Optimization. In ICLR, 2019.Li C, Yuan X, Lin C, et al. AM-LFS: AutoML for Loss Function Search. In ICCV, 2019: 8410-8419.Grabocka J, Scholz R, Schmidt-Thieme L. Learning Surrogate Losses[J]. arXiv preprint arXiv:1905.10108, 2019.
◆ Regularization.
Feng J, Simon N. Gradient-based regularization parameter selection for problems with nonsmooth penalty functions[J]. Journal of Computational and Graphical Statistics, 2018, 27(2): 426-435.Frecon J, Salzo S, Pontil M. Bilevel learning of the group lasso structure. In NeurIPS 2018: 8301-8311.Streeter M. Learning Optimal Linear Regularizers. In ICML. 2019: 5996-6004.
◆ learner (NAS).
Zoph B, Le Q V. Neural architecture search with reinforcement learning. In ICLR, 2017.Baker B, Gupta O, Naik N, et al. Designing neural network architectures using reinforcement learning. In ICLR, 2017.Pham H, Guan M, Zoph B, et al. Efficient Neural Architecture Search via Parameter Sharing. ICML. 2018: 4092-4101.Zoph B, Vasudevan V, Shlens J, et al. Learning transferable architectures for scalable image recognition. In CVPR, 2018: 8697-8710.Liu H, Simonyan K, Yang Y. Darts: Differentiable architecture search. In ICLR, 2019.Xie S, Zheng H, Liu C, et al. SNAS: stochastic neural architecture search. In ICLR, 2019.Liu C, Zoph B, Neumann M, et al. Progressive neural architecture search. In ECCV, 2018: 19-34.
Many Recent Attempts
◆ Hyper-parameters learning.
Maclaurin D, Duvenaud D, Adams R. Gradient-based hyperparameter optimization through reversible learning. In ICML, 2015: 2113-2122.Pedregosa F. Hyperparameter optimization with approximate gradient. In ICML, 2016: 737-746.Luketina J, Berglund M, Greff K, et al. Scalable gradient-based tuning of continuous regularization hyperparameters. In ICML. 2016: 2952-2960.Franceschi L, Donini M, Frasconi P, et al. Forward and reverse gradient-based hyperparameter optimization. In ICML, 2017: 1165-1173.Franceschi L, Frasconi P, Salzo S, et al. Bilevel Programming for Hyperparameter Optimization and Meta-Learning. In ICML, 2018: 1563-1572.
◆ Gradients and learning rate. Andrychowicz M, Denil M, Gomez S, et al. Learning to learn by gradient descent by gradient descent. In NeurIPS, 2016.Baydin A G, Cornish R, Rubio D M, et al. Online learning rate adaptation with hypergradient descent. In ICLR, 2018.Jacobsen A, Schlegel M, Linke C, et al. Meta-descent for Online, Continual Prediction. In AAAI. 2019.Metz L,, et al. Understanding and correcting pathologies in the training of learned optimizers. In ICML,2019:4556-4565.Xu Z, Dai A M, Kemp J, et al. Learning an Adaptive Learning Rate Schedule. arXiv preprint arXiv:1909.09712, 2019.
◆ Sample reweighing.
Jiang L, Zhou Z, Leung T, et al. MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels. In ICML, 2018: 2309-2318.Ren M, Zeng W, Yang B, et al. Learning to Reweight Examples for Robust Deep Learning. In ICML, 2018: 4331-4340.Shu J, Xie Q, Yi L, et al. Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting. In NeurIPS, 2019.Zhao S, Fard M M, Narasimhan H, et al. Metric-Optimized Example Weights. In ICML 2019: 7533-7542.
• Deep Learning
• Robust
• Meta-learning
Generalized Cross Entropy
Symmetric Cross Entropy
Bi-Tempered logistic Loss
Polynomial SoftWeighting loss
Zhao, et al., AAAI, 2015
Amid, et al., NeurIPS, 2019
Wang, et al., ICCV, 2019
Zhang, et al., NeurIPS, 2018
Adaptively Learning the Robust Loss
Training loss Meta loss
Hyperparameter Learning by Meta Learning
Shu, et al., submitted, 2019
Experimental Results
Shu, et al., submitted, 2019
Experimental Results
✓ The hyper-parameter adaptively learned by meta-learning actually not the optimal one for the original loss, with fixed hyper-parameter throughout its iteration.
✓Meta learning adaptively finds a proper hyper-parameter and simultaneously explores a good initialization network parameter under its current hyper-parameter in a dynamical way.
✓ Such adaptive learning manner should be more suitable for simultaneously obtain optimal values for both of them rather than only updating one under the other fixed.
Shu, et al., submitted, 2019
When Model Contains Large Amount of Hyperparameters?
➢ Overfitting issue easily occurs (similar to conventional machine learning)➢ How to alleviate this issue?➢ Build parametric prior representation (neither too large nor too small) for
hyperparameters (similar to conventional machine learning)➢ Learner VS meta-learner➢ Need to deeply understand the data as well as the learning problem!
✓ Multi-view learning, multi-task learning (parameter - similar)
✓ Subspace learning (matrix – low rank)
Training loss Meta loss
When Model Contains Large Amount of Hyperparameters?
• Deep Learning
• Robust
• Meta-learning
Deep Learning with Training Data Bias
Problem: big data often come with noisy labels or class imbalance.
Deep Networks tend to overfit to Training Data!
Deep neural networks easily fit(memorizing) random labels.
Zhang C, Bengio S, Hardt M, et al. Understanding deep learning requires rethinking generalization. ICLR 2017. best paper
Zhang et al. (2017) found that:
How to robustly train deep networks on training data bias to improve the generalization performance?
Related work: Learning with Training Data Bias◆ Sample weighting methods
✓dataset resampling(Chawla et al., 2002) ✓instance re-weight (Zadrozny, 2004)✓AdaBoost method (Freund & Schapire, 1997)✓Hard example mining (Malisiewicz et al., 2011)✓focal loss (Lin et al., 2018)✓self-paced learning (Kumar et al., 2010)✓Iterative reweighting strategy (Fernando & Mkchael, 2003; Zhang & Sabuncu, 2018)✓prediction variance method (Chang et al., 2017)
◆Meta learning methods✓FWL (Dehghani et al.,2018)✓learning to teach (Fan et al., 2018; Wu et al., 2018)✓MentorNet (Jiang et al., 2018)✓L2RW (Ren et al., 2018)
◆Other methods✓GLC (Hendrycks et al., 2018)✓Reed (Reed et al., 2015)✓Co-teaching (Han et al., 2018)✓D2L (Ma et al.,2018)✓S-Model (Goldberger & Ben-Reuven, 2017)
Sample weighting methods
Existing studies define a curriculum as a function(hand-design) for specific tasks and extra hyper-parameter setting.
Strategy Regularzer 𝑮 Weight 𝒗∗
Self-paced [Kumar et al. NIPS 2010] − 𝒗 1 𝒗∗ = 𝕀(𝒍𝒊 ≤ 𝝀)
Linear weighting [Jiang et al. AAAI 2015]𝟏
𝟐
𝒊=𝟏
𝒏
(𝒗𝒊𝟐 − 𝟐𝒗𝒊) 𝒗∗ = 𝐦𝐚𝐱 (𝟎, 𝟏 −
𝟏
𝝀𝒍𝒊)
Focal Loss [Lin et al., ICCV 2017] − 𝒗∗ = 𝟏 − 𝒆𝒙𝒑 −𝒍𝒊𝜶
Hard example mining [Malisiewicz et al., ICCV 2011] − 𝒗∗ = 𝕀(𝒍𝒊 > 𝝀(𝟏 − 𝒚𝒊))
Prediction variance [Chang et al., NIPS 2017] − 𝒗∗ =𝟏
𝒁𝑽𝒂𝒓 𝒍𝒊 +
𝑽𝒂𝒓(𝒍𝒊)
|𝒍𝒊|
Strategy Regularzer 𝑮 Weight 𝒗∗
Self-paced [Kumar et al. NIPS 2010] − 𝒗 1 𝒗∗ = 𝕀(𝒍𝒊 ≤ 𝝀)
Linear weighting [Jiang et al. AAAI 2015]𝟏
𝟐
𝒊=𝟏
𝒏
(𝒗𝒊𝟐 − 𝟐𝒗𝒊) 𝒗∗ = 𝐦𝐚𝐱 (𝟎, 𝟏 −
𝟏
𝝀𝒍𝒊)
Focal Loss [Lin et al., ICCV 2017] − 𝒗∗ = 𝟏 − 𝒆𝒙𝒑 −𝒍𝒊𝜶
Hard example mining [Malisiewicz et al., ICCV 2011]
− 𝒗∗ = 𝕀(𝒍𝒊 > 𝝀(𝟏 − 𝒚𝒊))
Prediction variance [Chang et al., NIPS 2017] − 𝒗∗ =𝟏
𝒁𝑽𝒂𝒓 𝒍𝒊 +
𝑽𝒂𝒓(𝒍𝒊)
|𝒍𝒊|
⚫ Need to pre-specify the form of weighting function
⚫ Need to manually set hyper-parameters
Sample weighting methods
Meta Data and Meta Loss
Meta DataTraining Data
L2RW [Ren et al., ICML 2018]
Directly learning weights from training and meta data
Meta Data and Meta Loss
Meta DataTraining Data
Training Loss
Input Structure
Meta Loss
MentorNet [Jiang et al., ICML 2018]
The meta-learner is complex, hard to be reproduced.
Very Complex InputVery Complex Theta
Our work
Meta-Weight-Net
Input: LossTheta: MLP
Our work
Inner loop:
Outer loop:
Notation:
◆ Θ: Parameters of teacher◆ 𝑤: Parameters of student
Meta-Weight-NetShu, et al., NeurIPS, 2019
Our work
Step 5:
Step 6:
Step 7:
Shu, et al., NeurIPS, 2019
Our work
Shu, et al., NeurIPS, 2019
Experiments
Experimental Setup: Class Imbalance
Datasets: CIFAR-10 & CIFAR-100
Shu, et al., NeurIPS, 2019
Experimental Setup: Noisy Label
Datasets: CIFAR-10 & CIFAR-100
Shu, et al., NeurIPS, 2019
Stable analysis of Meta-Weight-Net
Shu, et al., NeurIPS, 2019
Real Data Experiment
Shu, et al., NeurIPS, 2019
Insight: Adaptively Learn the Weight Function
Shu, et al., NeurIPS, 2019
Future research
◆Extension to other semi/weakly-supervised learning problems
◆More amelioration to the Meta-Weight-Net
◆Multi-view learning, ensemble learning, domain adaptation
◆General hyper-parameter learning (meta-learner designing)
Jun Shu, Qian Zhao, Keyu Chen, Zongben Xu, Deyu Meng. Learning Adaptive Loss for Robust Learning with Noisy Labels. arXiv:2002.06482, 2020.
Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, ZongbenXu, Deyu Meng. Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting. NeurIPS, 2019.
Top Related