Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights...
Transcript of Training Simplification and Model Simplification for Deep ... · Update only 1–4% of the weights...
Training Simplification and Model Simplification for Deep Learning:
A Minimal Effort Back Propagation Method
Xu Sun
Peking University
Research Background & Challenges
Introduction
machine learning for Natural Language Processing
Language Generation
Machine Learning
Language Understanding
Outline
Minimal Back Propagation
meProp
meSimp
Others
Outline
Minimal Back Propagation
meProp
meSimp
Others
How to identify a taxi?
Motivation: about overfitting
Feature 1 (shape): essential feature
Feature 3 (color): non-essential feature
Feature 2 (wheel): essential feature
How to identify a taxi?
Motivation: about overfitting
feature 3 is helpless, and could be harmful simply “memorize” the label based on many non-essential features the essential feature could be insufficiently trained
Feature 1 (shape): essential feature
Feature 3 (color): non-essential feature
Feature 2 (wheel): essential feature
The question: how to identify the essential features?
This is actually quite challenging for a learning system
We try to use the back propagation information to find those essential features (and the related neurons)
In deep learning, back propagation computes the "importance" of a input feature
Thus, a feature with higher magnitude in the back propagation indicates it is more essential for this sample
Motivation: about overfitting
An illustration of meProp
Proposal: minimal effort backprop (meProp)
Minimal effort only update the “essential” features/parameters
Benefit: only some neurons are related
Hello World
I
Love Natural
Language
Processing
The computational cost of back propagation can also be reduced
POS-Tag (Adam) Ov. FP time Ov. BP time Ov. time
LSTM (h=500) 7,334s 16,522s 23,856s
Parsing (Adam) Ov. FP time Ov. BP time Ov. time
MLP (h=500) 3,906s 9,114s 13,020s
MINST (Adam) Ov. FP time Ov. BP time Ov. time
MLP (h=500) 69s 171s 240s
Benefit: computational cost
Forward propagation (FP) time vs. Backward propagation (BP) time
Back propagation is costly in deep learning
Back propagation is costly in deep learning
Motivation 2: computational cost
Forward propagation (FP) time vs. Backward propagation (BP) time
0% 20% 40% 60% 80% 100%
POS-Tag
Parsing
MNIST
Ov. FP time Ov. BP time
Back propagation is costly in deep learning
Motivation 2: computational cost
Forward propagation (FP) time vs. Backward propagation (BP) time
0% 20% 40% 60% 80% 100%
POS-Tag
Parsing
MNIST
Ov. FP time Ov. BP time
Back propagation has the major computational cost
An illustration of meProp (ICML 2017)
Method
An illustration of meProp (ICML 2017)
Method
An illustration of meProp (ICML 2017)
Method
Method
Original backprop
Method
meProp: Top-k sparsified backprop
Computation cost is proportional to n
Consider a basic computation unit
Back propagation
Method
Computation cost is proportional to n
Consider a basic computation unit
Back propagation
Top-k sparsifying leads to a linear reduction in the computation cost
Method
An illustration of meProp on a mini-batch learning setting
Method
An illustration of meProp on a mini-batch learning setting
Method
Results based on various models (LSTM/MLP) and different optimizers (AdaGrad/Adam)
Experiments
Results based on various models (LSTM/MLP) and different optimizers (AdaGrad/Adam)
Experiments
Overall forward propagation time vs. overall back propagation time.
Experiments
Overall forward propagation time vs. overall back propagation time.
Experiments
Varying the number of hidden layers
Experiments
Acceleration results on MNIST using GPU.
Acceleration results on the matrix multiplication synthetic data using GPU.
Speedup on GPU (significant for heavy models, i.e., with large h)
Experiments
Experiments
Accuracy of MLP vs. meProp’s backprop ratio.
Further analysis
Experiments
Experiments
Accuracy of MLP vs. meProp’s backprop ratio.
Results of top-k meProp vs. random meProp.
Further analysis
Experiments
Experiments
Accuracy of MLP vs. meProp’s backprop ratio.
Results of top-k meProp vs. random meProp.
Results of top-k meProp vs. baseline with the hidden dimension h.
Further analysis
Propose a very sparsified back propagation method,
Update only 1–4% of the weights at each backprop pass
Does not result in a larger number of training iterations
The accuracy is actually improved rather than degraded
Conclusions
Sun et al. meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting. ICML 2017 Github: https://github.com/lancopku/meProp
Outline
Minimal Back Propagation
meProp
meSimp
Others
What is the proper size for a network w.r.t. a task?
Motivation 1: neural network size
Size of the neural network has a huge impact, and need to adjust for each task accordingly Smaller network is faster to train, but can underfit Bigger network is slower to train, and can overfit
Motivation: rare features
Hello World
I
Love Natural
Language
Processing
Meaningless neurons given current data
The question: how to automatically determine the size?
This is very important for deep learning, where tuning the size is costly
We use the back propagation information to automatically determine the network size network
Eliminating the redundant neurons
If a feature is essential for most of the sample, the related neuron plays a more important role in the layer
Motivation 1: neural network size
An illustration of meSimp
Proposal: minimal effort simplification (meSimp)
Minimal effort only keep the “essential” features/parameters
forward propagation (original)
model simplification
activeness collected
from multiple
examples
inactive neurons
are eliminated
back propagation (meProp)
The computational cost of forward propagation can also be reduced
POS-Tag (Adam) Ov. FP time Ov. BP time Ov. time
LSTM (h=500) 7,334s 16,522s 23,856s
Parsing (Adam) Ov. FP time Ov. BP time Ov. time
MLP (h=500) 3,906s 9,114s 13,020s
MINST (Adam) Ov. FP time Ov. BP time Ov. time
MLP (h=500) 69s 171s 240s
Motivation 2: computational cost
Forward propagation (FP) time vs. Backward propagation (BP) time
When a trained model is deployed, only forward propagation is executed
An illustration of meSimp
Method
Minimal effort only keep the “essential” features/parameters
forward propagation (original)
activeness collected
from multiple
examples
back propagation (meProp)
An illustration of meSimp
Method
Minimal effort only keep the “essential” features/parameters
forward propagation (original)
model simplification
activeness collected
from multiple
examples
inactive neurons
are eliminated
back propagation (meProp)
Computation cost is proportional to n
Consider a basic computation unit
Forward propagation
Method
Computation cost is proportional to n
Consider a basic computation unit
Forward propagation
Eliminating the inactive neurons leads to a linear reduction in the computation cost
Method
Results based on various models (LSTM/MLP)
Experiments
Results based on various models (LSTM/MLP)
Experiments
The size of the network can be reduced by about 9x on two NLP tasks. So is the computational cost of forward propagation. It indicates that forward propagation can be substantially accelerated.
Results based on various models (LSTM/MLP)
Experiments
The performance of the model can be improved. Automatically determine the appropriate dimension for a task.
Experiments
Automatically determine the appropriate dimension for each layer in the neural network.
Experiments
Automatically determine the appropriate dimension for each layer in the neural network.
Parsing: Better than the model of the same dimension
Experiments
Experiments
We propose to test the claim using meAct the inactive neurons w.r.t. each examples are deactivated after 10 epochs
Neural network needs an appropriate size meSimp automatically determine the size by the minimal effort back
propagation
Further analysis: Redundant neurons are fitting to noise
Accuracy rises sharply Weight update dropped suddenly
Propose a model simplification method, based on activeness of the neurons
The size of the neural network can be reduced up to 9x
The accuracy of the simplified model is actually improved
Conclusions
Sun et al. meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting. ICML 2017 Sun et al. Training Simplification and Model Simplification for Deep Learning: A Minimal Effort Back Propagation Method. Arxiv 2017. Github: https://github.com/lancopku/meProp
Outline
Minimal Back Propagation
meProp
meSimp
Others
How to model the complex structural dependencies in natural language?
Designing models of higher structural complexity
Typically, employing complex output structure
e.g. third-order tags
StructReg: Motivation
StructReg: Motivation
More Powerful Higher Complexity Less
Stable
However, models of higher structural complexity can actually hurt accuracy
What is the relation between structural complexity and generalization?
Theoretical analysis
→ Structure regularization decoding methods
StructReg: Motivation
Test F1 Chunking Eng-NER BLSTM-1st-tag 93.97 87.65 BLSTM-2nd-tag 93.24 (-0.73) 87.59 (-0.06) BLSTM-3rd-tag 92.50 (-1.47) 87.16 (-0.49)
StructReg: Theoretical Analysis
empirical risk overfit-bound
StructReg: Theoretical Analysis
Conclusions from our analysis: 1. Complex structure low empirical risk & high overfitting risk 2. Simple structure high empirical risk & low overfitting risk
3. Need a balanced complexity of structures
empirical risk overfit-bound
Proposal: Structure Regularization Decoding
Using two models of different structure complexity
Using the simple structure model to regularize the complex structure model
Advantage
Reduce the structural overfitting bound
Maintain the low empirical risk of the complex structure model
StructReg: Method
Apply to linear-chain structures
Neural model based on BLSTM
StructReg: Experiments
complex structures often hurts accuracy
Apply to linear-chain structures
Neural model based on BLSTM
StructReg: Experiments
SR decoding improves accuracy
Apply to hierarchical structures
Joint empty category detection and dependency parsing
StructReg: Experiments
complex structures often hurts accuracy
Apply to hierarchical structures
Joint empty category detection and dependency parsing
StructReg: Experiments
complex structures often hurts accuracy
SR decoding improves accuracy
A structural complexity regularization framework
Reduce error rate by 36.4% for the third-order models
StructReg: Conclusions
X. Sun. Structure Regularization for Structured Prediction. NIPS 2014 Sun et al. Complex Structure Leads to Overfitting: A Structure Regularization Decoding Method for Natural Language Processing. Arxiv 2017
Label Embedding for Soft Training for Neural Networks
Adaptively learn meaningful embedding for labels
Using the learned embedding to soften the training
Label Embedding
SoftTrain SoftTrain with LabelEmb
Label Embedding for Soft Training for Neural Networks
Substantial improvements for both CV and NLP tasks
Label Embedding
summarization
Image Recognition
Label Embedding for Soft Training for Neural Networks Learned label embedding is very meaningful: NLP
Label Embedding
Label similarity for Summarization
Label similarity for Machine Traslation
Label Embedding for Soft Training for Neural Networks Learned label embedding is very meaningful: CV
Label Embedding
Sun et al. Label Embedding Network: Learning Label Representation for Soft Training of Deep Networks. arXiv 2017
Chinese social media text summarization Existing models are based on the
encoder-decoder framework
The generated summaries are similar to source texts literally
But they have low semantic relevance
SRB: Motivation
Semantic Relevance Based neural model (ACL 2017)
encourage high semantic similarity between texts and summaries
It consists of decoder (above), encoder (below) and cosine similarity function.
Text Representation
Source text representation 𝑉𝑡 = ℎ𝑁
Generated summary representation 𝑉𝑠 = 𝑠𝑀 − ℎ𝑁
Semantic Relevance 𝑐𝑐𝑠 𝑉𝑠,𝑉𝑡 = 𝑉𝑡∙𝑉𝑠
𝑉𝑡 𝑉𝑠
Training Objective function 𝐿 = −𝑝 𝑦 𝑥;𝜃 − 𝜆 𝑐𝑐𝑠 𝑉𝑠,𝑉𝑡
SRB: Method
Large Scale Chinese Short Text Summarization Dataset (LCSTS)
SRB: Experiments
Our models achieve substantial improvement of all ROUGE scores over baseline systems. (W: Word level; C: Character level).
Example of SRB Generated Summary
SRB: Experiments
Proposal: Semantic Relevance Based model
transform the text and the summary into a dense vector
encourage high similarity of their representation.
The generated summary has higher semantic relevance
SRB: Conclusions
Ma et al. Improving Semantic Relevance for Sequence-to-Sequence Learning of Chinese Social Media Text Summarization. ACL 2017
Thanks!
Also thanks to collaborators Xuancheng Ren, Shuming Ma, Binzhen Wei