A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks...
Transcript of A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks...
![Page 1: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/1.jpg)
A Survey on Meta-Learning
Xiang Li
Nanyang Technological University
September 10, 2019
1 / 53
![Page 2: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/2.jpg)
What’s Not Included in This Survey?
I Meta-Learning for Reinforcement Learning
I Beyesian-Based Meta-Learning
2 / 53
![Page 3: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/3.jpg)
Overview
Problem DefinitionFew-shot Learning
ApproachesNon-Parametric Methods (Metric Learning)Model-Based Methods (Black-Box Adaptation)Optimization-Based MethodsOther Methods
Summary
3 / 53
![Page 4: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/4.jpg)
Problem Definition
I Over a task distribution p(T ):
θ∗ ← argminθ
∑t∼p(T )
Lt(θ)
I Common tasks:I Few-shot LearningI RegressionI Reinforcement Learning
4 / 53
![Page 5: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/5.jpg)
Few-shot Learning: Definition
I Training set Dmeta−train = {(x1, y1), ..., (xk, yk)}I Test set Dmeta−test = {D1, ..., Dn},Di = {(xi1, yi1), ..., (xim, yim)}
I N-way, K-shot problem (usually 5-way, 5-shot or 5-way,1-shot)
Support Set (Label Given)
Query Set(To Predict Label)
Ravi et al. ’17
5 / 53
![Page 6: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/6.jpg)
Few-shot Learning: Dataset
I Omniglot dataset (Lake et al. ’15)I 1623 characters, 20 instances of each characterI ”transpose” of MNIST
I Mini-ImageNetI subset of ImageNet
I Other datasets: CIFAR, CUB, CelebA and others
6 / 53
![Page 7: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/7.jpg)
Problem DefinitionFew-shot Learning
ApproachesNon-Parametric Methods (Metric Learning)Model-Based Methods (Black-Box Adaptation)Optimization-Based MethodsOther Methods
Summary
7 / 53
![Page 8: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/8.jpg)
Non-Parametric Methods (Metric Learning)
I Idea:Simply compare query images to images in support set infeature space
I Challenge:I Feature space (compare in what space?)I Distance metric
8 / 53
![Page 9: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/9.jpg)
Siamese Network (Koch et al. ’15)
9 / 53
![Page 10: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/10.jpg)
I Can we make the training condition match the test condition?
I Can the feature space be conditioned on specific task?
10 / 53
![Page 11: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/11.jpg)
Matching Networks for One Shot Learning (Vinyals et al.’16)
I Sampling Strategy:test and trainingconditions mustmatchI Sample ”episodes”
in a training batch
I Attention KernelI y =∑k
i=1 a (x, xi) yiI a (x, xi) =
ec(f(x),g(xi))∑kj=1 e
c(f(x),g(xj))
I Full ContextEmbeddings
𝑔𝜃 is a bidirectional LSTM to encode 𝑥𝑖in context of 𝑆
𝑓𝜃 is a LSTM to encode 𝑥 in context of 𝑆
11 / 53
![Page 12: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/12.jpg)
Other Metric-Based Works
I Prototypical Networks for Few-shot Learning (Snell et al. ’17)
I ck = 1|Sk|
∑(xi,yi)∈Sk fφ (xi)
I pφ(y = k|x) = exp(−d(fφ(x),ck))∑k′ exp(−d(fφ(x),ck′ ))
I Learning to Compare: Relation Network for Few-ShotLearning (Sung et al. ’18)
12 / 53
![Page 13: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/13.jpg)
Performance
Omniglot
Mini-ImageNet
13 / 53
![Page 14: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/14.jpg)
Metric-Based Methods: Summary
I Meta Knowledge: feature space & distance metric
I Task Knowledge: None
I Advantage:I Simple and computational fastI Entirely feed-forward
I Disadvantage:I Hard to scale very large KI Limited to classification problem
14 / 53
![Page 15: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/15.jpg)
Problem on Metric-Based Methods
Adaptation for each specific task?
I Model-Based Methods
I Optimization-Based Methods
15 / 53
![Page 16: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/16.jpg)
Problem DefinitionFew-shot Learning
ApproachesNon-Parametric Methods (Metric Learning)Model-Based Methods (Black-Box Adaptation)Optimization-Based MethodsOther Methods
Summary
16 / 53
![Page 17: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/17.jpg)
Model-Based Methods (Black-Box Adaptation)
I Idea:I Adaptation φi for task iI Train a network (θ) to represent p
(φi|Dsupport
i , θ)
17 / 53
![Page 18: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/18.jpg)
Model-Based Methods (Black-Box Adaptation)
How to aggregate samples insupport set (form of fθ)?
I Average (PrototypicalNetwork)
I LSTM (Matching Network)
I Memory-Augmented NeuralNetworks
18 / 53
![Page 19: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/19.jpg)
Neural Turing Machines (NTMs) (Graves et al. ’14)
I Attention-based
I Memory matrix M
I Read:
{r←−
∑iw(i)M(i)∑
iw(i) = 1
I Write:{Mt(i)←−Mt−1(i) [1− w(i)e]Mt(i)←− Mt(i) + w(i)a
19 / 53
![Page 20: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/20.jpg)
Neural Turing Machines (NTMs) (Graves et al. ’14)
Advantages:
I Large memory
I Addressable
20 / 53
![Page 21: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/21.jpg)
Target: p(φi|Dsupport
i , θ)
What formations can φi be?
I Contain task-specific information
I Contain only useful information
Solution 1: feature representations of samples in Dsupport
21 / 53
![Page 22: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/22.jpg)
Meta-Learning with Memory Augmented Neural Networks(Santoro et al. ’16)
I Read Memory:
ri =
N∑i=1
wrt (i)Mt(i),
where
wrt (i) = softmax
(kt ·Mt(i)
‖kt‖ · ‖Mt(i)‖
)I Least Recently Used
Access (LRUA)
I Strategy: 1 timeoff-set
22 / 53
![Page 23: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/23.jpg)
Dynamic Few-Shot Visual Learning without Forgetting(Gidaris et al. ’18)
Use cosine similarity in the classification layer instead ofdot-product
Generate prototypical vector of each class using:I Average of features in support set (like prototypical network)I Attention-based inference from memory of base classes
featuresI Base class features are like a dictionaryI A memory matrix MI A key matrix KI w′att =
1N ′
∑N ′
i=1
∑Kbaseb=1 Att (φqz
′i, kb) · wb
23 / 53
![Page 24: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/24.jpg)
Target: p(φi|Dsupport
i , θ)
What formations can φi be?
I Contain task-specific information
I Contain only useful information
Solution 2: network weights
24 / 53
![Page 25: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/25.jpg)
Meta Networks (Munkhdalai et al. ’17)I Slow weight + Fast weightI meta feature extractor fm (key embedding) with slow weightM , base feature extractor fb with slow weight B, meta weightgenerator gm, base weight generator gb
I Meta-information: gradients
25 / 53
![Page 26: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/26.jpg)
Meta Networks (Munkhdalai et al. ’17)I Generate fast weight for meta feature extractor:
I Sample T examples (x′i, y′i) in support set
I For each i: ∇i ← ∇MLembed(fm(M,x′i), y′i)
I M∗ ← gm({∇i}) (then meta feature extractor fm can useslow weight M + fast weight M∗)
I Store support set base fast weights in NTM:I For all samples (x′i, y
′i) in support set:
I ∇i ← ∇BLtask(fb(B, x′i), y′i)
I B∗i ← gb(∇i) (slow weight for base extractor)I Store B∗i in memory U(i)I r′i ← fm(M,M∗, x′i) and store in key matrix K(i)
I Use fast weight from memory to extract features for querysamples:I For all samples (xi, yi) in query set:
I ri ← fm(M,M∗, xi) (key)I Access memory using ri and get W ∗i (fast weight for fb)I Extract feature fb(W,W ∗i , xi) and compute total task loss
I Use total loss to update slow weights and weights for weightgenerators
26 / 53
![Page 27: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/27.jpg)
Meta Networks (Munkhdalai et al. ’17)
I Remaining question: can we find a better meta-information?(in this work: gradients)
27 / 53
![Page 28: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/28.jpg)
Model-Based Methods: Summary
I Meta Knowledge: Slow weight
I Task Knowledge: Generated weights & support set features
I Advantage:I Applicable to many learning problems
I Disadvantage:I Complicated model (model & architecture intertwined)I Difficult to optimize
28 / 53
![Page 29: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/29.jpg)
Target: p(φi|Dsupport
i , θ)
Since generating weights (φi) is very computational cost, why notupdating existing weights θ?
fine-tune!
29 / 53
![Page 30: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/30.jpg)
Problem DefinitionFew-shot Learning
ApproachesNon-Parametric Methods (Metric Learning)Model-Based Methods (Black-Box Adaptation)Optimization-Based MethodsOther Methods
Summary
30 / 53
![Page 31: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/31.jpg)
Optimization-Based Methods
I Idea:I Fine-tune to a specific taskI At test time, given a task t, θt ← g(θ,Dsupport). Then we
have y = f (θt, x)
I What can we do in fine-tune?
31 / 53
![Page 32: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/32.jpg)
Take a look into Fine-Tune Process (Gradient Descent)
𝜃′ = 𝜃 − 𝛼∇𝜃𝐿
Compute a more effective “gradients”?
Have a better initial parameters?
Update only partial weights?
32 / 53
![Page 33: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/33.jpg)
Optimization as a Model for Few-Shot Learning (Ravi etal. ’17)
I An observation:I Gradient descent:
θt = θt−1 − αt∇θt−1Lt
I Cell update in LSTM:
ct = ft � ct−1 + it � ct
I If ft = 1, it = at and ct = ∇θt−1Lt:
ct can be treated as θt
I Dynamic ft and ct:
it = σ(WI ·
[∇θt−1Lt,Lt, θt−1, it−1
]+ bI
)ft = σ
(WF ·
[∇θt−1Lt,Lt, θt−1, ft−1
]+ bF
)I Weight Sharing
33 / 53
![Page 34: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/34.jpg)
Optimization as a Model for Few-Shot Learning (Ravi etal. ’17)
34 / 53
![Page 35: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/35.jpg)
Optimization as a Model for Few-Shot Learning (Ravi etal. ’17)
Utilize Loss (Lt), Gradient (∇θt−1Lt), network weight (θi) andprevious state (it−1&ft−1) as meta-information.
35 / 53
![Page 36: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/36.jpg)
Take a look into Fine-Tune Process (Gradient Descent)
𝜃′ = 𝜃 − 𝛼∇𝜃𝐿
Compute a more effective “gradients”?
Have a better initial parameters?
Update only partial weights?
36 / 53
![Page 37: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/37.jpg)
Model-Agnostic Meta-Learning for Fast Adaptation ofDeep Networks (Finn et al. ’17)
I Learn initial parameters that is easy to adapt to all tasks inthe task distributionI Initial parameters: prior knowledge (meta knowledge)I Easy: one step or few steps
37 / 53
![Page 38: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/38.jpg)
Model-Agnostic Meta-Learning for Fast Adaptation ofDeep Networks (Finn et al. ’17)
Task-specific Adaption
Meta-knowledge Update
38 / 53
![Page 39: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/39.jpg)
Model-Agnostic Meta-Learning for Fast Adaptation ofDeep Networks (Finn et al. ’17)
Does optimal point in parameter space exsits?
39 / 53
![Page 40: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/40.jpg)
Take a look into Fine-Tune Process (Gradient Descent)
𝜃′ = 𝜃 − 𝛼∇𝜃𝐿
Compute a more effective “gradients”?
Have a better initial parameters?
Update only partial weights?
40 / 53
![Page 41: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/41.jpg)
Fast Context Adaptation via Meta-Learning (Zintgraf et al.’19)
I Meta parameters φ and task-specific parameters θ
I Divide parameters to meta and task-specific, similar to LSTMmeta-learner (Ravi et al. ’17) and Meta Networks(Munkhdalai et al. ’17)
41 / 53
![Page 42: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/42.jpg)
Performance
42 / 53
![Page 43: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/43.jpg)
Optimization-Based Methods: Summary
I Meta Knowledge: Outer loop optimization
I Task Knowledge: Inner loop optimization
I Advantages:I Applicable to many kinds of tasksI Model-agnostic
I Disadvantages:I Second-order optimization (first-order MAML & Reptile)
43 / 53
![Page 44: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/44.jpg)
Comparison
Metric-Based Model-Based Optimization-Based
Applicable Tasks Classification orverficiation
All All
Applicable Models All feature ex-tractors
DesignedModel
All BP-basedmodels
Computational Cost Low High High
Optimization Easy Hard Hard
Meta Information Feature space& distancemetric
Slow weight Outer loop op-timized weights
Task Information None Generated fea-tures & weights
Inner loop opti-mized weights
44 / 53
![Page 45: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/45.jpg)
Problem DefinitionFew-shot Learning
ApproachesNon-Parametric Methods (Metric Learning)Model-Based Methods (Black-Box Adaptation)Optimization-Based MethodsOther Methods
Summary
45 / 53
![Page 46: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/46.jpg)
MetaGAN: An Adversarial Approach to Few-Shot Learning(Zhang et al. ’18)
LD = Ex∼QuTlog pD(y ≤ N |x) + Ex∼pTG
log pD(N + 1|x)
Explanation:
I Enrich task samples withoutliers
I cat & car > cat & dog →better decision boundary
46 / 53
![Page 47: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/47.jpg)
A Simple Neural Attentive Meta-Learner (Mishra et al. 18)Temporal Generation:
p(x) =
T∏t=1
p (xt|x1, . . . , xt−1)
Causal Temporal Convolutional Layers:
Dilated Causal Convolutional Layers (van den Oord et al. ’16):
47 / 53
![Page 48: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/48.jpg)
A Simple Neural Attentive Meta-Learner (Mishra et al. 18)
p(ytest|xtest, Xsupport, Y supprot
)
Only Coarse access of previousinputs (support set)
48 / 53
![Page 49: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/49.jpg)
A Simple Neural Attentive Meta-Learner (Mishra et al. 18)
Attention is all you need!
Dilated Layer
Attention Layer
49 / 53
![Page 50: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/50.jpg)
Problem DefinitionFew-shot Learning
ApproachesNon-Parametric Methods (Metric Learning)Model-Based Methods (Black-Box Adaptation)Optimization-Based MethodsOther Methods
Summary
50 / 53
![Page 51: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/51.jpg)
Summary: Common Ideas
I Knowledge DesignI Meta-knowledgeI Task-specific knowledge (support set)
I Network DesignI Discriminate meta-knowledge and task-specific knowledgeI Combine meta-knowledge and task-specific knowledge
I Traning StrategyI Joint training or mimicing the test case?I Outloop and innerloop for updating different knowledge
51 / 53
![Page 52: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/52.jpg)
Summary: Challenges
I OverfitI Sample-level overfitI Meta-overfit
I Optimization
I Ambiguity
52 / 53
![Page 53: A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax](https://reader034.fdocuments.us/reader034/viewer/2022042216/5ed46c42638f1c7113662b2a/html5/thumbnails/53.jpg)
Thanks for your listening!
53 / 53