LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and...
Transcript of LIMITINGTHE DEEP LEARNING - Stanford Universityyplu/limitingDeepLearning.pdfon Computer Vision and...
LIMITING THE DEEP LEARNINGYIPING LU PEKING UNIVERSITY
DEEP LEARNING IS SUCCESSFUL, BUT…
GPU
PHD
Professor
PHD
LARGER THE BETTER?
Test error
decreasing
LARGER THE BETTER?
Test error
decreasing
Why Not Limiting To 𝒑
𝒏= +∞
TWO LIMITING
Depth Limiting
Wid
th L
imit
DEPTH LIMITING: ODE
[He et al. 2015]
[E. 2017] [Haber et al. 2017] [Lu et al. 2017]
[Chen et al. 2018]
BEYOND FINITE LAYER NEURAL NETWORK
Going into
infinite layer
Differential Equation
ALSO THEORETICAL GUARANTEED
TRADITIONAL WISDOM IN DEEP LEARNING
?
BECOMING HOT NOW…
NeurIPS2018 Best Paper Award
BRIDGING CONTROL AND LEARNING
Feedback is the
learning loss
Neural Network
Optimization
BRIDGING CONTROL AND LEARNING
Feedback is the
learning loss
Neural Network
Optimization
Interpretability? PDE-Net ICML2018
BRIDGING CONTROL AND LEARNING
Feedback is the
learning loss
Neural Network
Optimization
A better model?LM-ResNet ICML2018
DURR ICLR2019
Macaroon submitted
Interpretability? PDE-Net ICML2018
Chang B, Meng L, Haber E, et al. Reversible
architectures for arbitrarily deep residual neural
networks[C]//Thirty-Second AAAI Conference
on Artificial Intelligence. 2018.
Tao Y, Sun Q, Du Q, et al. Nonlocal Neural
Networks, Nonlocal Diffusion and Nonlocal
Modeling[C]//Advances in Neural Information
Processing Systems. 2018: 496-506.
BRIDGING CONTROL AND LEARNING
Feedback is the
learning loss
Neural Network
Optimization
A better model?
New Algorithm?
LM-ResNet ICML2018
DURR ICLR2019
Macaroon submitted
YOPO Submitted
Interpretability? PDE-Net ICML2018
An W, Wang H, Sun Q, et al. A pid controller approach for stochastic
optimization of deep networks[C]//Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition. 2018: 8522-8531.
Chen T Q, Rubanova Y, Bettencourt J, et al. Neural ordinary differential
equations[C]//Advances in Neural Information Processing Systems.
2018: 6571-6583.
Li Q, Hao S. An optimal control approach to deep learning and
applications to discrete-weight neural networks[J]. arXiv preprint
arXiv:1803.01299, 2018.
Chang B, Meng L, Haber E, et al. Multi-level residual networks from
dynamical systems view[J]. arXiv preprint arXiv:1710.10348, 2017.
Chang B, Meng L, Haber E, et al. Reversible
architectures for arbitrarily deep residual neural
networks[C]//Thirty-Second AAAI Conference
on Artificial Intelligence. 2018.
Tao Y, Sun Q, Du Q, et al. Nonlocal Neural
Networks, Nonlocal Diffusion and Nonlocal
Modeling[C]//Advances in Neural Information
Processing Systems. 2018: 496-506.
HOW DIFFERENTIAL EQUATION VIEW HELPS
NEURAL NETWORK DESIGNINGPRINCIPLED NEURAL ARCHITECTURE DESIGN
NEURAL NETWORK AS SOLVING ODES
Dynamic System Nueral Network
Continuous limit Numerical Approximation
WRN, ResNeXt, Inception-ResNet, PolyNet, SENet etc…… :
New scheme to Approximate the right hand side term
Why not change the way to discrete u_t?
MULTISTEP ARCHITECTURE?
𝒙𝒏+𝟏 = 𝒙𝒏 + 𝒇(𝒙𝒏)
𝒙𝒕 = 𝒇(𝒙)
MULTISTEP ARCHITECTURE?
𝒙𝒏+𝟏 = 𝒙𝒏 + 𝒇(𝒙𝒏)
𝒙𝒕 = 𝒇(𝒙) 𝒙𝒏+𝟏 = (𝟏 − 𝒌𝒏)𝒙𝒏 + 𝒌𝒏𝒙𝒏−𝟏 + 𝒇(𝒙𝒏)Linear Multi-step Scheme
Linear Multi-step Residual Network
MULTISTEP ARCHITECTURE?
𝒙𝒏+𝟏 = 𝒙𝒏 + 𝒇(𝒙𝒏)
𝒙𝒕 = 𝒇(𝒙) 𝒙𝒏+𝟏 = (𝟏 − 𝒌𝒏)𝒙𝒏 + 𝒌𝒏𝒙𝒏−𝟏 + 𝒇(𝒙𝒏)Linear Multi-step Scheme
Linear Multi-step Residual Network
Only One More
Parameter
EXPERIMENT
PLOT THE MOMENTUM
Analysis by zero stability
MODIFIED EQUATION VIEW
𝟏 + 𝒌𝒏 ሶ𝒖 + 𝟏 − 𝒌𝒏𝚫𝒕
𝟐ሷ𝒖𝒏 + 𝒐 𝚫𝒕𝟑 = 𝒇(𝒖)
Learn A Momentum
𝒙𝒏+𝟏 = (𝟏 − 𝒌𝒏)𝒙𝒏+𝒌𝒏𝒙𝒏−𝟏 + 𝚫𝐭𝒇(𝒙𝒏)
[Su et al. 2016] [Dong et al. 2017]
UNDERSTANDING DROPOUT
Noise can avoid overfit? ሶ𝑋 𝑡 = 𝑓 𝑋 𝑡 , 𝑎 𝑡 + 𝑔(𝑋 𝑡 , 𝑡)𝑑𝐵𝑡 , 𝑋 0 = 𝑋0
The numerical scheme is only need
to be weak convergence!
𝑬𝒅𝒂𝒕𝒂(𝑙𝑜𝑠𝑠(𝑋 𝑇 )
STOCHASTIC DEPTH AS AN EXAMPLE
𝒙𝒏+𝟏 = 𝒙𝒏 + 𝜼𝒏𝒇 𝒙
= 𝒙𝒏 + 𝑬𝜼𝒏𝒇 𝒙𝒏 + 𝜼𝒏 − 𝑬𝜼𝒏 𝒇(𝒙𝒏)
Natural Language
UNDERSTANDING NATURAL LANGUAGE PROCESSING
Using a Neural Network to extract the feature of document!
Emb Emb Emb Emb Emb
Deep Neural Network
y1 y2 y3 y4 y5
Idea: Consider every word in a document as a particle in the n-body system.
TRANSFORMER AS A SPLITTING SCHEME
FFN: advection
Self-Attention: particle interaction
Label
[Vaswani et al.2017]
[Devlin et al.2017]
A BETTER SPLITTING SCHEME
RESULT
Computer Vision
ONE NOISE LEVEL ONE NET
Network 1 𝝈 = 𝟐𝟓
ONE NOISE LEVEL ONE NET
Network 1 𝝈 = 𝟐𝟓
Network 2 𝝈 = 𝟑𝟓
WE WANT
One Model
WE ALSO WANT GENERALIZATION
BM3D DnCNN
RETHINKING TRADITIONAL FILTERING APPROACH
Perona-Malik Equation
input
processing
RETHINKING TRADITIONAL FILTERING APPROACH
Perona-Malik Equation
input
processing
Noisy
RETHINKING TRADITIONAL FILTERING APPROACH
Perona-Malik Equation
input
processing
Noisy
RETHINKING TRADITIONAL FILTERING APPROACH
Perona-Malik Equation
input
processing
Over-smoothNoisy
MOVING ENDPOINT CONTROL VS FIXED ENDPOINT CONTROL
Control dynamic:
Physic Law
𝝉 is the time
arrives the moon
minන0
𝜏
𝑅 𝑤 𝑡 , 𝑡 𝑑𝑡
𝑠. 𝑡. ሶ𝑋 = 𝑓 𝑋 𝑡 , 𝑤 𝑡 ,
𝜏 is the first time that dynamic meets X
DURR
GENERALIZE TO UNSEEN NOISE LEVEL
DURR
DnCNN
Physics
PHYSICS DISCOVERY
Our Work
CONV FILTERS AS DIFFERENTIAL OPERATORS
1
1 -4 1
1
Δ𝑢 = 𝑢𝑥𝑥 + 𝑢𝑦𝑦
PDE-NET
PDE-NET
PDE-NET 2.0
- Constrain the function space
- Theoretical Recover Guarantee(Coming Soon)
- Symbolic Discovery
ON GOING WORK
Shock Wave Architecture SearcHYufeiWang, Ziju Shen, Zichao Long, Bin Dong Learning to
Solve Conservation Laws
HOW DIFFERENTIAL EQUATION VIEW HELPS
OPTIMIZATIONCONSIDERING THE NEURAL NETWORK STRUCTURE
RELATED WORKS
Maximal Principle [Li et al. 2018a] [li et al.2018b]
Adjoin Method [Chen et al.2018]
Multigrid [Chang et al.2017]
Domain decomposition
Our Work: ODE captures the composition structure of neural network!
TOWARDS DEEP LEARNING MODELS RESISTANT TO ADVERSARIAL
ATTACKS
Problem:
- More capacity and stronger adversaries decrease transferability. Always 10 times wider
- PGD training is expansive!
Can adversarial
training cheaper
Robust Optimization
TAKE NEURAL NETWORK ARCHITECTURE INTO CONSIDERATION
Adversary updater
Black box
Previous Work
Heavy gradient calculation
TAKE NEURAL NETWORK ARCHITECTURE INTO CONSIDERATION
Adversary updater
Adversary updater
Black box
Previous Work
YOPO
Heavy gradient calculation
DIFFERENTIAL GAME
Player1
Player2Trajectory
DIFFERENTIAL GAME
Goal
Player1
Player2Trajectory
DIFFERENTIAL GAME
Trajectory
Goal
Player1
Player2
Network
Layer2Network
Layer1
Composition
Structure
DECOUPLE TRAINING
Synthetic gradients [Jaderberg et al.2017]
Lifted Neural Network [Askari et al.2018] [Gu et al.2018] [li et al.2019]
Delayed Gradient [Huo et al.2018]
Block Coordinate Descent Approach [Lau et al. 2018]
Can Control perspective helps us to understand decoupling?
Our idea: Decouple the gradient back propagation with the adversary updating.
𝒑𝒔𝒕
𝒙𝒔𝒕
First Iteration
𝒑𝒔𝟏
YOPO(YOU ONLY PROPAGATE ONCE)
𝒑𝒔𝒕
𝒙𝒔𝒕
𝒑𝒔𝟏
First Iteration Other Iteration
copy
𝒑𝒔𝟏
YOPO(YOU ONLY PROPAGATE ONCE)
𝒑𝒔𝒕
𝒙𝒔𝒕
𝒑𝒔𝟏
First Iteration Other Iteration
copy
𝒑𝒔𝟏
YOPO(YOU ONLY PROPAGATE ONCE)
WHY DECOUPLING
RESULT
SO, WHAT IS INFINITE WIDE NEURAL NETWORK
Neural Network As particle system Linearized Neural Network
Rotskoff G M, Vanden-Eijnden E. Neural networks as interacting particle
systems: Asymptotic convexity of the loss landscape and universal
scaling of the approximation error[J]. arXiv preprint arXiv:1805.00915,
2018.
Mei S, Montanari A, Nguyen P M. A mean field view of the landscape of
two-layer neural networks[J]. Proceedings of the National Academy of
Sciences, 2018, 115(33): E7665-E7671.
Infinite wide neural network=Kernel Method
Radom Feature + Neural Tangent Kernel
Jacot A, Gabriel F, Hongler C. Neural tangent kernel: Convergence and
generalization in neural networks[C]//Advances in neural information
processing systems. 2018: 8571-8580.
Du S S, Lee J D, Li H, et al. Gradient descent finds global minima of deep
neural networks[J]. arXiv preprint arXiv:1811.03804, 2018.
Allen-Zhu Z, Li Y, Song Z. A convergence theory for deep learning via
over-parameterization[J]. arXiv preprint arXiv:1811.03962, 2018.
SO, WHAT IS INFINITE WIDE NEURAL NETWORK
Neural Network As particle system Linearized Neural Network
Rotskoff G M, Vanden-Eijnden E. Neural networks as interacting particle
systems: Asymptotic convexity of the loss landscape and universal
scaling of the approximation error[J]. arXiv preprint arXiv:1805.00915,
2018.
Mei S, Montanari A, Nguyen P M. A mean field view of the landscape of
two-layer neural networks[J]. Proceedings of the National Academy of
Sciences, 2018, 115(33): E7665-E7671.
Infinite wide neural network=Kernel Method
Radom Feature + Neural Tangent Kernel
Jacot A, Gabriel F, Hongler C. Neural tangent kernel: Convergence and
generalization in neural networks[C]//Advances in neural information
processing systems. 2018: 8571-8580.
Du S S, Lee J D, Li H, et al. Gradient descent finds global minima of deep
neural networks[J]. arXiv preprint arXiv:1811.03804, 2018.
Allen-Zhu Z, Li Y, Song Z. A convergence theory for deep learning via
over-parameterization[J]. arXiv preprint arXiv:1811.03962, 2018.
TWO DIFFERENTIAL EQUATION FOR DEEP LEARNING
Feature Space
Data Space
Neural ODE
Cla
ssifie
r
Nonlocal PDE
Dog
Cat
Evolving Kernel
WHAT FEATURE DOES NONLOCAL PDE CAPTURES
WHAT FEATURE DOES NONLOCAL PDE CAPTURES
WHAT FEATURE DOES NONLOCAL PDE CAPTURES
[Osher et al.2016]
Harmonic Equation -> Dimensionality
WHAT FEATURE DOES NONLOCAL PDE CAPTURES
[Osher et al.2016]
Harmonic Equation -> Dimensionality
Dimension optimization is not
enough
WHAT FEATURE DOES NONLOCAL PDE CAPTURES
[Osher et al.2016]
Harmonic Equation -> Dimensionality
Dimension optimization is not
enough
Biharmonic Equation -> Curvature
SEMI-SUPERVISED LEARNING
IMAGE INPAINTING
Weighted Nonlocal Laplacian
PSNR=23.51dB,SSIM=0.62
Weighted Curvature Regularzation
PSNR=24.07dB,SSIM=0.68
IMAGE INPAINTING
GOING INTO NEURAL!
LOW FREQUENCY FIRST
Multigrid Method [Xu 1996]
Inverse Scale Space [Scherzer et al. 2001] [Burfer et al. 2006]
Ridge Regression [Smale et al. 2007] [Yao et al. 2007]
Experiment In Neural Network [Rahaman et al.2018] [Xu et al 2019]
TEACHER STUDENT OPTIMIZATION
Train
TEACHER STUDENT OPTIMIZATION
Train
TEACHER STUDENT OPTIMIZATION
Train
Train
TEACHER STUDENT OPTIMIZATION
Train
Train
Dark Knowledge
NO EARLY STOPPING, NO DISTILLATION
Teach
er te
st accu
racy
NO EARLY STOPPING, NO DISTILLATION
Teach
er te
st accu
racy
NO EARLY STOPPING, NO DISTILLATION
Teach
er te
st accu
racy
NO EARLY STOPPING, NO DISTILLATION
Teach
er te
st accu
racy
NO EARLY STOPPING, NO DISTILLATION
Stu
den
t test a
ccu
racy
Teach
er te
st accu
racy
NO EARLY STOPPING, NO DISTILLATION
Stu
den
t test a
ccu
racy
Teach
er te
st accu
racy
LABEL DENOISING
NOISY LABLEinterpolate
NOISY LABLEinterpolate
NOISY LABLEinterpolate
THEORY CAN GUARANTEE OUR METHOD!
Separate
Clusterable Dataset
…
Width enough neural netwrok
THEORY CAN GUARANTEE OUR METHOD!
Separate
Clusterable Dataset
…
Width enough neural netwrok
Ensures margin
LABEL DENOISING
We use a extremely large
network for experiment!
FUTURE WORK: META-LEARING
Aggregating AIR and GT Label
[Tanaka et al.2018] [Sun et al. 2018]
[Han et al.2018] [Han et al.2019]
FUTURE WORK: META-LEARING
Aggregating AIR and GT Label
[Tanaka et al.2018] [Sun et al. 2018]
[Han et al.2018] [Han et al.2019]
Meta-learner
Aggregated signal
Supervise Signal: Validation set Robustness
NOISY LABEL LEADS TO ADVERSARIAL EXAMPLE?
NOISY LABEL LEADS TO ADVERSARIAL EXAMPLE?
NOISY LABEL LEADS TO ADVERSARIAL EXAMPLE?
initialization
Good Lip Constant
NOISY LABEL LEADS TO ADVERSARIAL EXAMPLE?
initialization
Clean label training
Good Lip Constant
NOISY LABEL LEADS TO ADVERSARIAL EXAMPLE?
initialization
Clean label training
Noisy label training
Bad Lip Constant
Good Lip Constant
TAKE HOME MESSAGE
Feature Space
Data Space
Neural ODE
Cla
ssifie
r
Nonlocal PDE
Dog
Cat
Evolving Kernel
Optimization
Architecture
Geometry Feature
Implicit
Regularization
Depth Limit Width Limit
RESEARCH INTEREST
Geometry Of Data
Control View Of Deep Learning
Kernel Learning
PDEs On Graphs
Learning with limited data(noisy, semi-supervise, weak supervise)
Computational Photography, Image Processing, Graphics.
THANK YOU AND QUESTIONS?
Always looking forward to
cooperation opportunities
Contact: [email protected]
Lu Y, Zhong A, Li Q, Dong B. Beyond Finite Layer Neural Networks: Bridging Deep
Architectures and Numerical Differential Equations arXiv:1710.10121. ICML2018
Long Z, Lu Y, Ma X, Dong B. PDE-Net: Learning PDEs from Data arXiv:1710.09668.
ICML2018
Zhang S, Lu Y, Liu J, Dong B. Dynamically Unfolding Recurrent Restorer: A Moving
Endpoint Control Method for Image Restoration arXiv:1805.07709. ICLR2019
Long Z, Lu Y, Dong B. " PDE-Net 2.0: Learning PDEs from Data with A Numeric-
Symbolic Hybrid Deep Network“arXiv:1812.04426. Major Revision JCP.
Dinghuai Zhang, Tianyuan Zhang, Yiping Lu, Zhanxing Zhu, Bin Dong. You Only
Propogate Once: Accelerating Adversarial Training via Maximal Principle
arXiv:1905.00877
Yiping Lu, Di He, Zhuohan Li, Zhiqing Sun, Bin Dong, Tao Qin, Liwei Wang, Tie-
yan Liu. OreoNet: Understanding NLP from a neural ODE viewpoint. (Submitted)
Bin Dong, Haochen Ju, Yiping Lu, Zuoqiang Shi. CURE: Curvature Regularization
For Missing Data Recovery arXiv preprint arXiv:1901.09548, 2019.
Bin Dong, Jikai Hou, Yiping Lu, Zhihua Zhang. Distillation $\approx$ Early Stopping?
Extracting Knowledge Utilizing Anisotropic Information Retrieval.(Submitted)
Aoxiao Zhong Zhiqing Sun Zhanxing Zhu
Acknowledgement
Bin Dong Di He Xiaoshuai Zhang Liwei WangZhuohan LiJikai Hou
Tianyuan Zhang
Dinghuai Zhang
Quanzheng Li Zhihua Zhang