The Meta-Learning Problem & Black-Box...

CS 330

The Meta-Learning Problem & Black-Box Meta-Learning

Logistics

Homework 1 posted today, due Wednesday, September 30

Project guidelines will be posted by tomorrow.

Plan for TodayTransfer Learning - Problem formulation - Fine-tuning

Meta-Learning - Problem formulation - General recipe of meta-learning algorithms - Black-box adaptation approaches - Case study of GPT-3 (time-permitting)

Goals for by the end of lecture: - Differences between multi-task learning, transfer learning, and meta-learning problems - Basics of transfer learning via fine-tuning - Training set-up for few-shot meta-learning algorithms - How to implement black-box meta-learning techniques

Topic of Homework 1!}

Multi-Task Learning vs. Transfer Learning

4

minθ

T

∑i=1

ℒi(θ, 𝒟i)

Multi-Task Learning

Solve multiple tasks at once.𝒯1, ⋯, 𝒯T

Transfer Learning

Solve target task after solving source task 𝒯b 𝒯aby transferring knowledge learned from 𝒯a

Key assumption: Cannot access data during transfer.𝒟a

Transfer learning is a valid solution to multi-task learning.(but not vice versa)

Question: What are some problems/applications where transfer learning might make sense?(answer in chat or raise hand)

when is very large (don’t want to retain & retrain on )

𝒟a𝒟a

when you don’t care about solving & simultaneously𝒯a 𝒯b

training data for new task 𝒯b

Parameters pre-trained on 𝒟a

� ✓ � ↵r✓L(✓,Dtr)

5

Wheredoyougetthepre-trainedparameters? - ImageNet classifica8on - Models trained on large language corpora (BERT, LMs) - Other unsupervised learning techniques - Whatever large, diverse dataset you might have

(typically for many gradient steps)

Somecommonprac6ces - Fine-tune with a smaller learning rate - Smaller learning rate for earlier layers - Freeze earlier layers, gradually unfreeze - Reini8alize last layer - Search over hyperparameters via cross-val - Architecture choices maMer (e.g. ResNets)

Pre-trained models oOen available online.

WhatmakesImageNetgoodfortransferlearning? Huh, Agrawal, Efros. ‘16

Transfer learning via fine-tuning

6

UniversalLanguageModelFine-TuningforTextClassifica6on. Howard, Ruder. ‘18

Fine-tuning doesn’t work well with small target task datasets

This is where meta-learning can help.

The Meta-Learning Problem Statement(that we will consider in this class)

Two ways to view meta-learning algorithms

Mechanisticview

➢ Deepnetworkthatcanreadinanentiredatasetandmakepredictionsfornewdatapoints

➢ Trainingthisnetworkusesameta-dataset,whichitselfconsistsofmanydatasets,eachforadifferenttask

Probabilisticview

➢ Extractpriorknowledgefromasetoftasksthatallowsefficientlearningofnewtasks

➢ Learninganewtaskusesthispriorand(small)trainingsettoinfermostlikelyposteriorparameters

Today:Focusprimarilyonthemechanisticview.

(Bayes will come back later)

How does meta-learning work? An example.Given 1 example of 5 classes: Classify new examples

training data test set

How does meta-learning work? An example.

meta-trainingtraining classes

… …

meta-testing Ttest

Given 1 example of 5 classes: Classify new examples

training data test set

regression, languagegenera6on, skilllearning,anyMLproblemCan replace image classifica8on with:

The Meta-Learning Problem

Given data from , quickly solve new task 𝒯1, …, 𝒯n 𝒯test

Key assumption: meta-training tasks and meta-test task drawn i.i.d. from same task distribution , 𝒯1, …, 𝒯n ∼ p(𝒯) 𝒯j ∼ p(𝒯)

What do the tasks correspond to?- recognizing handwritten digits from different languages (see homework 1!)

- spam filter for different users

- classifying species in different regions of the world

- a robot performing different tasks

How many tasks do you need? The more the better.

Like before, tasks must share structure.

(analogous to more data in ML)

“support set” Dtestitask test dataset

Some terminology

Dtritask training set

“query set”

k-shot learning: learning with k examples per class N-way classification: choosing between N classes(or k examples total for regression)

Question: What are k and N for the above example? (answer in chat)

minθ

T

∑i=1

ℒi(θ, 𝒟i)

Multi-Task Learning

Solve multiple tasks at once.𝒯1, ⋯, 𝒯T

Transfer Learning

Solve target task after solving source task 𝒯b 𝒯aby transferring knowledge learned from 𝒯a

The Meta-Learning Problem

Given data from , quickly solve new task 𝒯1, …, 𝒯n 𝒯test

In all settings: tasks must share structure.

In transfer learning and meta-learning: generally impractical to access prior tasks

Problem Settings Recap

GeneralrecipeHowtoevaluateameta-learningalgorithm

the “transpose” of MNIST

manyclasses, fewexamples1623characters from 50differentalphabets

theOmniglotdataset Lake et al. Science 2015

…sta8s8cs more reflec8ve

of therealworld20instances of each character

Proposes both few-shotdiscrimina6ve & few-shotgenera6ve problems

Ini8al few-shot learning approaches w/ Bayesianmodels,non-parametricsFei-Fei et al. ‘03 Salakhutdinov et al. ‘12Lake et al. ‘11 Lake et al. ‘13

Other datasets used for few-shotimagerecogni6on: 8eredImageNet, CIFAR, CUB, CelebA, othersOther benchmarks: molecular property predic8on (Ngyugen et al. ’20), object pose predic8on (Yin et al. ICLR ’20)

Another View on the Meta-Learning Problem

Inputs: Outputs:

Supervised Learning:

Meta Supervised Learning:

Inputs: Outputs:

Data:

Data:

Why is this view useful? Reduces the meta-learning problem to the design & optimization of h.

{

Finn. Learning to Learn with Gradients. PhD Thesis 2018

Dtr<latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">AAACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4MM/PCVApDvv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ//6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j33aLftkfwftLggkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU//MaGUWVVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421IINIZh9+S+pH5YDyy+PiienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyyX3C+vgHtwKVA</latexit><latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">AAACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4MM/PCVApDvv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ//6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j33aLftkfwftLggkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU//MaGUWVVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421IINIZh9+S+pH5YDyy+PiienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyyX3C+vgHtwKVA</latexit><latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">AAACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4MM/PCVApDvv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ//6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j33aLftkfwftLggkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU//MaGUWVVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421IINIZh9+S+pH5YDyy+PiienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyyX3C+vgHtwKVA</latexit><latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">AAACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4MM/PCVApDvv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ//6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j33aLftkfwftLggkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU//MaGUWVVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421IINIZh9+S+pH5YDyy+PiienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyyX3C+vgHtwKVA</latexit>

GeneralrecipeHowtodesignameta-learningalgorithm

1. Choose a form of

2. Choose how to op8mize w.r.t. max-likelihood objec8ve using meta-training data✓

meta-parameters

Dtri

�i

Keyidea:Train a neural network to represent

Train with standard supervised learning!

L(�i,Dtesti )

Dtesti

f✓

g�i

yts

xts

Black-BoxAdapta6on

max✓

X

Ti

L(f✓(Dtri ),Dtest

i )

max✓

X

Ti

X

(x,y)⇠Dtesti

log g�i(y|x)

�i = f✓(Dtri )

Predict test points with yts = g�i(xts)

<latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">AAACtHicbVFRa9swEJa9de28bs22x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s77wMXu2+frPXe/vuxhaVETAShSrMbcItKKlhhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGGGSA/CAaszKQzSOMVs9nngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit><latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">AAACtHicbVFRa9swEJa9de28bs22x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s77wMXu2+frPXe/vuxhaVETAShSrMbcItKKlhhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGGGSA/CAaszKQzSOMVs9nngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit><latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">AAACtHicbVFRa9swEJa9de28bs22x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s77wMXu2+frPXe/vuxhaVETAShSrMbcItKKlhhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGGGSA/CAaszKQzSOMVs9nngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit><latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">AAACtHicbVFRa9swEJa9de28bs22x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s77wMXu2+frPXe/vuxhaVETAShSrMbcItKKlhhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGGGSA/CAaszKQzSOMVs9nngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit>

“learner"

Black-BoxAdapta6on

1. Sample task Ti2. Sample disjoint datasets Dtr

i ,Dtesti from Di

(or mini batch of tasks)

Di

Dtri Dtest

i

Dtri

�i

Dtesti

f✓

g�i

yts

xts

Keyidea:Train a neural network to represent . �i = f✓(Dtri )

Black-BoxAdapta6on

1. Sample task Ti2. Sample disjoint datasets Dtr

i ,Dtesti from Di

(or mini batch of tasks)

3. Compute �i f✓(Dtri )

4. Update ✓ using r✓L(�i,Dtesti )

Dtri Dtest

i

Dtri

�i

Dtesti

f✓

g�i

yts

xts


Black-BoxAdapta6on

ChallengeOutpuhng all neural net parameters does not seem scalable?

�i

Dtri

f✓

represents contextual task informa8onhi

low-dimensional vector hi

generalform:

Dtesti

yts

xts

g�i

yts = f✓(Dtri , x

ts)

(Santoro et al. MANN, Mishra et al. SNAIL)Idea: Do not need to output all parameters of neural net, only sufficient sta8s8cs

�i = {hi, ✓g}

recall:

23Whatarchitectureshouldweusefor ?f✓


Black-BoxAdapta6on

24

Meta-LearningwithMemory-AugmentedNeuralNetworks Santoro, Bartunov, Botvinick, Wierstra, Lillicrap. ICML ‘16

LSTMs or Neural turing machine (NTM)

Feedforward + average

Condi6onalNeuralProcesses. Garnelo, Rosenbaum, Maddison, Ramalho, Saxton, Shanahan, Teh, Rezende, Eslami. ICML ‘18

Other external memory mechanisms

MetaNetworks Munkhdalai, Yu. ICML ‘17

Convolu8ons & aMen8on

ASimpleNeuralA[en6veMeta-Learner Mishra, Rohaninejad, Chen, Abbeel. ICLR ‘18

Question: Why might feedforward+average be better than a recurrent model? (raise your hand)

HW 1: - implement data processing - implement simple black-box meta-learner - train few-shot Omniglot classifier

How else can we represent ?

Black-BoxAdapta6on

+ expressive + easy to combine with varietyoflearningproblems (e.g. SL, RL)

- complexmodel w/ complextask: challengingopBmizaBon problem - oOen data-inefficientDtr

i

�i

Dtesti

f✓

g�i

Next/me(Wednesday):What if we treat it as an opBmizaBon procedure?

yts

xts


�i = f✓(Dtri )

Case Study: GPT-3

May 2020

What is GPT-3?

black-box meta-learner trained on language generation tasks

architecture: giant “Transformer” network 175 billion parameters, 96 layers, 3.2M batch size

: sequence of characters𝒟tri : the following sequence of characters𝒟tsi

[meta-training] dataset: crawled data from the internet, English-language Wikipedia, two books corpora

a language model

What do different tasks correspond to? spelling correctionsimple math problemstranslating between languagesa variety of other tasks

How can those tasks all be solved by a single architecture?

How can those tasks all be solved by a single architecture? Put them all in the form of text!

spelling correctionsimple math problems translating between languages

Why is that a good idea? Very easy to get a lot of meta-training data.

Some ResultsOne-shot learning from dictionary definitions:

Few-shot language editing:

Non-few-shot learning tasks:

Other Cool Use-Cases

English to LaTeX

Source: https://twitter.com/sh_reya/status/1284746918959239168

General Notes & TakeawaysThe results are extremely impressive.

The model is far from perfect. The model fails in unintuitive ways.

Source: https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html

Source: https://github.com/shreyashankar/gpt3-sandbox/blob/master/docs/priming.md

The choice of at test time is important.𝒟tri



Goals for by the end of lecture: - Differences between multi-task learning, transfer learning, and meta-learning problems - Basics of transfer learning via fine-tuning - Training set-up for few-shot meta-learning algorithms - How to implement black-box meta-learning techniques

Topic of Homework 1!}

Reminders

Homework 1 posted today, due Wednesday, September 30

Project guidelines will be posted by tomorrow.

Next time: Optimization-based meta-learning

The Meta-Learning Problem & Black-Box...

Documents

Transcript of The Meta-Learning Problem & Black-Box...