The Meta-Learning Problem & Black-Box...
Transcript of The Meta-Learning Problem & Black-Box...
CS 330
The Meta-Learning Problem & Black-Box Meta-Learning
Logistics
Homework 1 posted today, due Wednesday, September 30
Project guidelines will be posted by tomorrow.
Plan for TodayTransfer Learning - Problem formulation - Fine-tuning
Meta-Learning - Problem formulation - General recipe of meta-learning algorithms - Black-box adaptation approaches - Case study of GPT-3 (time-permitting)
Goals for by the end of lecture: - Differences between multi-task learning, transfer learning, and meta-learning problems - Basics of transfer learning via fine-tuning - Training set-up for few-shot meta-learning algorithms - How to implement black-box meta-learning techniques
Topic of Homework 1!}
Multi-Task Learning vs. Transfer Learning
4
minθ
T
∑i=1
ℒi(θ, 𝒟i)
Multi-Task Learning
Solve multiple tasks at once.𝒯1, ⋯, 𝒯T
Transfer Learning
Solve target task after solving source task 𝒯b 𝒯aby transferring knowledge learned from 𝒯a
Key assumption: Cannot access data during transfer.𝒟a
Transfer learning is a valid solution to multi-task learning.(but not vice versa)
Question: What are some problems/applications where transfer learning might make sense?(answer in chat or raise hand)
when is very large (don’t want to retain & retrain on )
𝒟a𝒟a
when you don’t care about solving & simultaneously𝒯a 𝒯b
training data for new task 𝒯b
Parameters pre-trained on 𝒟a
� ✓ � ↵r✓L(✓,Dtr)
5
Wheredoyougetthepre-trainedparameters? - ImageNet classifica8on - Models trained on large language corpora (BERT, LMs) - Other unsupervised learning techniques - Whatever large, diverse dataset you might have
(typically for many gradient steps)
Somecommonprac6ces - Fine-tune with a smaller learning rate - Smaller learning rate for earlier layers - Freeze earlier layers, gradually unfreeze - Reini8alize last layer - Search over hyperparameters via cross-val - Architecture choices maMer (e.g. ResNets)
Pre-trained models oOen available online.
WhatmakesImageNetgoodfortransferlearning? Huh, Agrawal, Efros. ‘16
Transfer learning via fine-tuning
6
UniversalLanguageModelFine-TuningforTextClassifica6on. Howard, Ruder. ‘18
Fine-tuning doesn’t work well with small target task datasets
This is where meta-learning can help.
Plan for TodayTransfer Learning - Problem formulation - Fine-tuning
Meta-Learning - Problem formulation - General recipe of meta-learning algorithms - Black-box adaptation approaches - Case study of GPT-3 (time-permitting)
The Meta-Learning Problem Statement(that we will consider in this class)
Two ways to view meta-learning algorithms
Mechanisticview
➢ Deepnetworkthatcanreadinanentiredatasetandmakepredictionsfornewdatapoints
➢ Trainingthisnetworkusesameta-dataset,whichitselfconsistsofmanydatasets,eachforadifferenttask
Probabilisticview
➢ Extractpriorknowledgefromasetoftasksthatallowsefficientlearningofnewtasks
➢ Learninganewtaskusesthispriorand(small)trainingsettoinfermostlikelyposteriorparameters
Today:Focusprimarilyonthemechanisticview.
(Bayes will come back later)
How does meta-learning work? An example.Given 1 example of 5 classes: Classify new examples
training data test set
How does meta-learning work? An example.
meta-trainingtraining classes
… …
meta-testing Ttest
Given 1 example of 5 classes: Classify new examples
training data test set
regression, languagegenera6on, skilllearning,anyMLproblemCan replace image classifica8on with:
The Meta-Learning Problem
Given data from , quickly solve new task 𝒯1, …, 𝒯n 𝒯test
Key assumption: meta-training tasks and meta-test task drawn i.i.d. from same task distribution , 𝒯1, …, 𝒯n ∼ p(𝒯) 𝒯j ∼ p(𝒯)
What do the tasks correspond to?- recognizing handwritten digits from different languages (see homework 1!)
- spam filter for different users
- classifying species in different regions of the world
- a robot performing different tasks
How many tasks do you need? The more the better.
Like before, tasks must share structure.
(analogous to more data in ML)
“support set” Dtestitask test dataset
Some terminology
Dtritask training set
“query set”
k-shot learning: learning with k examples per class N-way classification: choosing between N classes(or k examples total for regression)
Question: What are k and N for the above example? (answer in chat)
minθ
T
∑i=1
ℒi(θ, 𝒟i)
Multi-Task Learning
Solve multiple tasks at once.𝒯1, ⋯, 𝒯T
Transfer Learning
Solve target task after solving source task 𝒯b 𝒯aby transferring knowledge learned from 𝒯a
The Meta-Learning Problem
Given data from , quickly solve new task 𝒯1, …, 𝒯n 𝒯test
In all settings: tasks must share structure.
In transfer learning and meta-learning: generally impractical to access prior tasks
Problem Settings Recap
Plan for TodayTransfer Learning - Problem formulation - Fine-tuning
Meta-Learning - Problem formulation - General recipe of meta-learning algorithms - Black-box adaptation approaches - Case study of GPT-3 (time-permitting)
GeneralrecipeHowtoevaluateameta-learningalgorithm
the “transpose” of MNIST
manyclasses, fewexamples1623characters from 50differentalphabets
theOmniglotdataset Lake et al. Science 2015
…sta8s8cs more reflec8ve
of therealworld20instances of each character
Proposes both few-shotdiscrimina6ve & few-shotgenera6ve problems
Ini8al few-shot learning approaches w/ Bayesianmodels,non-parametricsFei-Fei et al. ‘03 Salakhutdinov et al. ‘12Lake et al. ‘11 Lake et al. ‘13
Other datasets used for few-shotimagerecogni6on: 8eredImageNet, CIFAR, CUB, CelebA, othersOther benchmarks: molecular property predic8on (Ngyugen et al. ’20), object pose predic8on (Yin et al. ICLR ’20)
Another View on the Meta-Learning Problem
Inputs: Outputs:
Supervised Learning:
Meta Supervised Learning:
Inputs: Outputs:
Data:
Data:
Why is this view useful? Reduces the meta-learning problem to the design & optimization of h.
{
Finn. Learning to Learn with Gradients. PhD Thesis 2018
Dtr<latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">AAACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4MM/PCVApDvv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ//6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j33aLftkfwftLggkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU//MaGUWVVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421IINIZh9+S+pH5YDyy+PiienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyyX3C+vgHtwKVA</latexit><latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">AAACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4MM/PCVApDvv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ//6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j33aLftkfwftLggkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU//MaGUWVVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421IINIZh9+S+pH5YDyy+PiienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyyX3C+vgHtwKVA</latexit><latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">AAACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4MM/PCVApDvv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ//6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j33aLftkfwftLggkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU//MaGUWVVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421IINIZh9+S+pH5YDyy+PiienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyyX3C+vgHtwKVA</latexit><latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">AAACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4MM/PCVApDvv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ//6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j33aLftkfwftLggkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU//MaGUWVVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421IINIZh9+S+pH5YDyy+PiienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyyX3C+vgHtwKVA</latexit>
GeneralrecipeHowtodesignameta-learningalgorithm
1. Choose a form of
2. Choose how to op8mize w.r.t. max-likelihood objec8ve using meta-training data✓
meta-parameters
Plan for TodayTransfer Learning - Problem formulation - Fine-tuning
Meta-Learning - Problem formulation - General recipe of meta-learning algorithms - Black-box adaptation approaches - Case study of GPT-3 (time-permitting)
Dtri
�i
Keyidea:Train a neural network to represent
Train with standard supervised learning!
L(�i,Dtesti )
Dtesti
f✓
g�i
yts
xts
Black-BoxAdapta6on
max✓
X
Ti
L(f✓(Dtri ),Dtest
i )
max✓
X
Ti
X
(x,y)⇠Dtesti
log g�i(y|x)
�i = f✓(Dtri )
Predict test points with yts = g�i(xts)
<latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">AAACtHicbVFRa9swEJa9de28bs22x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s77wMXu2+frPXe/vuxhaVETAShSrMbcItKKlhhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGGGSA/CAaszKQzSOMVs9nngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit><latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">AAACtHicbVFRa9swEJa9de28bs22x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s77wMXu2+frPXe/vuxhaVETAShSrMbcItKKlhhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGGGSA/CAaszKQzSOMVs9nngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit><latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">AAACtHicbVFRa9swEJa9de28bs22x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s77wMXu2+frPXe/vuxhaVETAShSrMbcItKKlhhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGGGSA/CAaszKQzSOMVs9nngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit><latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">AAACtHicbVFRa9swEJa9de28bs22x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s77wMXu2+frPXe/vuxhaVETAShSrMbcItKKlhhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGGGSA/CAaszKQzSOMVs9nngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit>
“learner"
Black-BoxAdapta6on
1. Sample task Ti2. Sample disjoint datasets Dtr
i ,Dtesti from Di
(or mini batch of tasks)
Di
Dtri Dtest
i
Dtri
�i
Dtesti
f✓
g�i
yts
xts
Keyidea:Train a neural network to represent . �i = f✓(Dtri )
Black-BoxAdapta6on
1. Sample task Ti2. Sample disjoint datasets Dtr
i ,Dtesti from Di
(or mini batch of tasks)
3. Compute �i f✓(Dtri )
4. Update ✓ using r✓L(�i,Dtesti )
Dtri Dtest
i
Dtri
�i
Dtesti
f✓
g�i
yts
xts
Keyidea:Train a neural network to represent . �i = f✓(Dtri )
Black-BoxAdapta6on
ChallengeOutpuhng all neural net parameters does not seem scalable?
�i
Dtri
f✓
represents contextual task informa8onhi
low-dimensional vector hi
generalform:
Dtesti
yts
xts
g�i
yts = f✓(Dtri , x
ts)
(Santoro et al. MANN, Mishra et al. SNAIL)Idea: Do not need to output all parameters of neural net, only sufficient sta8s8cs
�i = {hi, ✓g}
recall:
23Whatarchitectureshouldweusefor ?f✓
Keyidea:Train a neural network to represent . �i = f✓(Dtri )
Black-BoxAdapta6on
24
Meta-LearningwithMemory-AugmentedNeuralNetworks Santoro, Bartunov, Botvinick, Wierstra, Lillicrap. ICML ‘16
LSTMs or Neural turing machine (NTM)
Feedforward + average
Condi6onalNeuralProcesses. Garnelo, Rosenbaum, Maddison, Ramalho, Saxton, Shanahan, Teh, Rezende, Eslami. ICML ‘18
Other external memory mechanisms
MetaNetworks Munkhdalai, Yu. ICML ‘17
Convolu8ons & aMen8on
ASimpleNeuralA[en6veMeta-Learner Mishra, Rohaninejad, Chen, Abbeel. ICLR ‘18
Question: Why might feedforward+average be better than a recurrent model? (raise your hand)
HW 1: - implement data processing - implement simple black-box meta-learner - train few-shot Omniglot classifier
How else can we represent ?
Black-BoxAdapta6on
+ expressive + easy to combine with varietyoflearningproblems (e.g. SL, RL)
- complexmodel w/ complextask: challengingopBmizaBon problem - oOen data-inefficientDtr
i
�i
Dtesti
f✓
g�i
Next/me(Wednesday):What if we treat it as an opBmizaBon procedure?
yts
xts
Keyidea:Train a neural network to represent . �i = f✓(Dtri )
�i = f✓(Dtri )
Plan for TodayTransfer Learning - Problem formulation - Fine-tuning
Meta-Learning - Problem formulation - General recipe of meta-learning algorithms - Black-box adaptation approaches - Case study of GPT-3 (time-permitting)
Case Study: GPT-3
May 2020
What is GPT-3?
black-box meta-learner trained on language generation tasks
architecture: giant “Transformer” network 175 billion parameters, 96 layers, 3.2M batch size
: sequence of characters𝒟tri : the following sequence of characters𝒟tsi
[meta-training] dataset: crawled data from the internet, English-language Wikipedia, two books corpora
a language model
What do different tasks correspond to? spelling correctionsimple math problemstranslating between languagesa variety of other tasks
How can those tasks all be solved by a single architecture?
How can those tasks all be solved by a single architecture? Put them all in the form of text!
spelling correctionsimple math problems translating between languages
Why is that a good idea? Very easy to get a lot of meta-training data.
Some ResultsOne-shot learning from dictionary definitions:
Few-shot language editing:
Non-few-shot learning tasks:
Other Cool Use-Cases
English to LaTeX
Source: https://twitter.com/sh_reya/status/1284746918959239168
General Notes & TakeawaysThe results are extremely impressive.
The model is far from perfect. The model fails in unintuitive ways.
Source: https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html
Source: https://github.com/shreyashankar/gpt3-sandbox/blob/master/docs/priming.md
The choice of at test time is important.𝒟tri
Plan for TodayTransfer Learning - Problem formulation - Fine-tuning
Meta-Learning - Problem formulation - General recipe of meta-learning algorithms - Black-box adaptation approaches - Case study of GPT-3 (time-permitting)
Goals for by the end of lecture: - Differences between multi-task learning, transfer learning, and meta-learning problems - Basics of transfer learning via fine-tuning - Training set-up for few-shot meta-learning algorithms - How to implement black-box meta-learning techniques
Topic of Homework 1!}
Reminders
Homework 1 posted today, due Wednesday, September 30
Project guidelines will be posted by tomorrow.
Next time: Optimization-based meta-learning