Sachin Ravi Hugo Larochelle - GitHub Pages › deep2Read › talks › 20171102-beilun.pdf2017/11/02...

Post on 05-Jul-2020

2 views 0 download

Transcript of Sachin Ravi Hugo Larochelle - GitHub Pages › deep2Read › talks › 20171102-beilun.pdf2017/11/02...

Optimization as a model for few-shot learning

Sachin Ravi1 Hugo Larochelle1

1Twitter

ICLR, 2017Presenter: Beilun Wang

Sachin Ravi, Hugo Larochelle (Twitter) Optimization as a model for few-shot learningICLR, 2017 Presenter: Beilun Wang 1 /

18

Outline

1 IntroductionMotivationPrevious SolutionsContributions

2 Proposed Methodsgradient descent and LSTMThe Proposed Method

3 Summary

Sachin Ravi, Hugo Larochelle (Twitter) Optimization as a model for few-shot learningICLR, 2017 Presenter: Beilun Wang 2 /

18

Outline

1 IntroductionMotivationPrevious SolutionsContributions

2 Proposed Methodsgradient descent and LSTMThe Proposed Method

3 Summary

Sachin Ravi, Hugo Larochelle (Twitter) Optimization as a model for few-shot learningICLR, 2017 Presenter: Beilun Wang 3 /

18

Motivation

Motivation:

Deep Learning has shown great success in a variety of tasks with largeamounts of labeled data.

Perform poorly on few-shot learning tasks

This paper uses an LSTM based meta-learner model to learn theexact optimization algorithm.

Sachin Ravi, Hugo Larochelle (Twitter) Optimization as a model for few-shot learningICLR, 2017 Presenter: Beilun Wang 4 /

18

Problem Setting:

Problem Setting:

Input: meta-sets D . For each D ∈ D has a split of Dtrain and Dtest.

Target: an LSTM-based meta-learner.

Output: a neural network

Sachin Ravi, Hugo Larochelle (Twitter) Optimization as a model for few-shot learningICLR, 2017 Presenter: Beilun Wang 5 /

18

Outline

1 IntroductionMotivationPrevious SolutionsContributions

2 Proposed Methodsgradient descent and LSTMThe Proposed Method

3 Summary

Sachin Ravi, Hugo Larochelle (Twitter) Optimization as a model for few-shot learningICLR, 2017 Presenter: Beilun Wang 6 /

18

Previous Solutions

gradient-based optimization

momentumadagradAdadeltaADAM

learning to learn

no strong guarantees of speed of convergence

meta-learning

quick acquisition of knowledge within each separate task presentedslower extraction of information learned across all the tasks.

Sachin Ravi, Hugo Larochelle (Twitter) Optimization as a model for few-shot learningICLR, 2017 Presenter: Beilun Wang 7 /

18

Outline

1 IntroductionMotivationPrevious SolutionsContributions

2 Proposed Methodsgradient descent and LSTMThe Proposed Method

3 Summary

Sachin Ravi, Hugo Larochelle (Twitter) Optimization as a model for few-shot learningICLR, 2017 Presenter: Beilun Wang 8 /

18

Contributions

An LSTM based meta-learner model

Achieve better performance in few-shot learning task

Sachin Ravi, Hugo Larochelle (Twitter) Optimization as a model for few-shot learningICLR, 2017 Presenter: Beilun Wang 9 /

18

Outline

1 IntroductionMotivationPrevious SolutionsContributions

2 Proposed Methodsgradient descent and LSTMThe Proposed Method

3 Summary

Sachin Ravi, Hugo Larochelle (Twitter) Optimization as a model for few-shot learningICLR, 2017 Presenter: Beilun Wang 10

/ 18

Gradient-based method

θt = θt−1 − αt∇θt−1Lt

Sachin Ravi, Hugo Larochelle (Twitter) Optimization as a model for few-shot learningICLR, 2017 Presenter: Beilun Wang 11

/ 18

the Update for the cell state in an LSTM

ct = ft � ct−1 + it · c̃t

if ft = 1, ct−1 = θt−1, it = αt , and c̃t = −∇θt−1LtThen it equals to gradient-based approach.

Sachin Ravi, Hugo Larochelle (Twitter) Optimization as a model for few-shot learningICLR, 2017 Presenter: Beilun Wang 12

/ 18

Outline

1 IntroductionMotivationPrevious SolutionsContributions

2 Proposed Methodsgradient descent and LSTMThe Proposed Method

3 Summary

Sachin Ravi, Hugo Larochelle (Twitter) Optimization as a model for few-shot learningICLR, 2017 Presenter: Beilun Wang 13

/ 18

The formulation of the meta-learner

learning rate it :

it = σ(WI · [∇θt−1Lt ,Lt , θt−1, it−1] + bI )

ft :

ft = σ(WF · [∇θt−1Lt ,Lt , θt−1, ft−1] + bF )

Sachin Ravi, Hugo Larochelle (Twitter) Optimization as a model for few-shot learningICLR, 2017 Presenter: Beilun Wang 14

/ 18

Parameter sharing and Normalization

Share parameters across the coordinates of the learner gradient

Each dimension has its own hidden and cell state values but theLSTM parameters are the same across all coordinates.

Normalization the gradients and the losses across different dimensions

x →

{( log(|x |)p , sign(x)) if |x | ≥ e−p

(−1, epx) otherwise(1)

Sachin Ravi, Hugo Larochelle (Twitter) Optimization as a model for few-shot learningICLR, 2017 Presenter: Beilun Wang 15

/ 18

Summary Figure

Sachin Ravi, Hugo Larochelle (Twitter) Optimization as a model for few-shot learningICLR, 2017 Presenter: Beilun Wang 16

/ 18

Experiment Results–average classification accuracy

Sachin Ravi, Hugo Larochelle (Twitter) Optimization as a model for few-shot learningICLR, 2017 Presenter: Beilun Wang 17

/ 18

Summary

This paper proposes an LSTM based meta-learner model.

It improves the performance of training deep Neural networks infew-shot learning tasks.

Sachin Ravi, Hugo Larochelle (Twitter) Optimization as a model for few-shot learningICLR, 2017 Presenter: Beilun Wang 18

/ 18