A Simple Framework for Contrastive Learning of Visual...

40
A Simple Framework for Contrastive Learning of Visual R epresentations Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton Presented by Yien Xu and Lichengxi Huang 1

Transcript of A Simple Framework for Contrastive Learning of Visual...

Page 1: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

A Simple Framework for

Contrastive Learning of Visual Representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton

Presented by Yien Xu and Lichengxi Huang

1

Page 2: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Previously…

Learning without human supervision is a long-standing problem

Two mainstream approachesGenerative

+ Learns to generate model pixels in the input space - Computationally expensive - May not be necessary for representation learning

Discriminative+ Learns representations using objective functions + Train networks to perform pretext tasks+ Both the inputs and labels are derived from an unlabeled dataset

2

Page 3: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

SimCLR

3

Page 4: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Experiment Results

Evaluated on ImageNetSimCLR achieves 76.5% top-1 accuracy 7% relative improvement over previous state-of-the-art

When fine-tuned with only 1% of the ImageNet labels SimCLR achieves 85.8% top-5 accuracy10% relative improvement over previous state-of-the-art

When fine-tuned on other natural image classification datasets SimCLR performs on par with or better than a strong supervised baseline On 10 out of 12 datasets

4

Page 5: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Outline

MotivationFrameworkEvaluationConclusion

5

Page 6: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Framework

Data Augmentation

Base Encoder Network

Projection HeadContrastive

LossFunction

Batch Size

6

Page 7: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Framework

Data Augmentation

Base Encoder Network

Projection HeadContrastive

LossFunction

Batch Size

7

Page 8: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Data Augmentation

Transforms any given data example randomly Results in two correlated views of the same example

Three augmentations applied sequentiallyRandom croppingRandom color distortionsRandom Gaussian blur

8

Page 9: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Data Augmentation

Why? - Data augmentation defines predictive tasksPrevious approaches

Define contrastive prediction tasks by changing the architectureglobal-to-local view prediction via constraining the receptive field in the network architecture neighboring view prediction via a fixed image splitting procedure and a context aggregation network

Can be avoided by performing simple random cropping (with resizing)Broader contrastive prediction tasks can be defined

By extending the family of augmentationsBy composing them stochastically

9

Page 10: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Data Augmentation

10

Page 11: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Composition of Data Augmentation

To investigate which data augmentation to performApply augmentations individually or in pairs

Always apply crop and resize images (since ImageNet are of diff. size)On one branch

apply the targeted transformation(s)

On the other branchLeave it as identity (𝑡 𝒙! = 𝒙!)

11

Page 12: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Composition of Data Augmentation

12

Page 13: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Composition of Data Augmentation

No single transformation suffices to learn good representations

When composing augmentationsThe quality of representation improves

Note a composition of augmentations:random croppingrandom color distortion

13

Page 14: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Composition of Data Augmentation

Why Color Distortion?

Before Random cropping of images share similar color distributionsNN may take this shortcut

AfterSuffices to distinguish images

14

Page 15: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Stronger Data Augmentation

Unsupervised contrastive learning benefits from Stronger (color) data augmentation than supervised learning

15

Page 16: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Framework

Data Augmentation

Base Encoder Network

Projection HeadContrastive

LossFunction

Batch Size

16

Page 17: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Base Encoder

Extracts representation vectors from augmented data examples

Framework allows various choices of the network architecture

SimCLR chooses ResNet𝒉! = 𝑓 &𝒙! = 𝑅𝑒𝑠𝑁𝑒𝑡 &𝒙!

17

Page 18: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Base Encoder

Performance gap shrinks as model size increasesUnsupervised learning benefits more from bigger models

18

Page 19: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Framework

Data Augmentation

Base Encoder Network

Projection HeadContrastive

LossFunction

Batch Size

19

Page 20: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Projection Head

Maps representations to the space where contrastive loss is applied

A small neural networkMultilayer Perceptron (MLP)

𝜎 is ReLU (non-linearity)

20

Page 21: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Projection Head

Non-Linear > Linear >> None

21

Page 22: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

h vs g(h)

h > g(h)

Conjecture: due to loss of information induced by the contrastive lossg can remove information that may be useful for the downstream task

22

Page 23: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Framework

Data Augmentation

Base Encoder Network

Projection HeadContrastive

LossFunction

Batch Size

23

Page 24: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Contrastive Loss Function

Given a set !𝒙! , the contrastive prediction task aims toIdentify &𝒙" in &𝒙# #$! for a given &𝒙!

Randomly sample a minibatch of N examples Define the task on pairs of augmented examples (2N images)Pick out one pair (2) as positivesTreat the other 2(N − 1) augmented examples as negatives

NT-Xent (the normalized temperature-scaled cross entropy loss)

SimCLR computes the loss from all positive pairs in a mini-batch 24

Page 25: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Framework

Data Augmentation

Base Encoder Network

Projection HeadContrastive

LossFunction

Batch Size

25

Page 26: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Batch Size

Vary the training batch size N from 256 to 8192

Use the LARS optimizer (You et al., 2017)

26

Page 27: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Batch Size vs Training

Contrastive learning benefits from larger batch sizes and longer training

27

Page 28: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

28

Page 29: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Outline

MotivationFrameworkEvaluationConclusion

29

Page 30: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Evaluation ProtocolDataset: ImageNet ILSVRC-2012

Also evaluated on CIFAR-10 and others (for transfer learning)

Protocol: linear evaluation A linear classifier is trained on top of the frozen base network Test accuracy is used as a proxy for representation quality

Data Augmentation: crop & resize, color distortion, and Gaussian blurOptimizer: LARS with LR=4.8Batch size: 4096Epochs: 100

30

Page 31: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Comparison with State-of-the-art

31

Page 32: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Semi-Supervised Learning

Sample 1% or 10% of the labelled ImageNet training datasetsClass-balanced12.8 and 128 images per class respectively

32

Page 33: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Semi-Supervised Learning

33

Page 34: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Transfer Learning

34

Page 35: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Outline

MotivationFrameworkEvaluationConclusion

35

Page 36: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Conclusion

Composition of multiple data augmentation operations is crucial

Nonlinear transformation (g) substantially improves the quality

Larger batch sizes and more training steps bring more benefits

36

Page 37: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Quiz Questions

37

Page 38: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Which of the following statements are true about effective visual representation learning?

• Heuristics may limit generality of representations.

• Generative approaches train networks using pretext tasks with unlabeled data.

• Discriminative approaches are more widely used than generative approaches.

• Pixel-level generation is expensive in computation.

38

Page 39: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Which of the following statements are true about SimCLR?

• Given batch size N, SimCLR’s learning algorithm requires computing similarities between all pairs of 2N projections.

• SimCLR’s learning algorithm computes contrastive loss across all positive and all negative pairs.

• The representation h is achieved after max pooling.

• It is beneficial to define loss on z instead of h due to a nonlinear projection head can improve the representation quality of the layer before it.

39

Page 40: A Simple Framework for Contrastive Learning of Visual ...pages.cs.wisc.edu/.../slides/presentation11.pdfMotivation Framework Evaluation Conclusion 29 Evaluation Protocol Dataset: ImageNet

Which of the following statements are true on project heads?

• Nonlinear projection is better than linear projection.

• Nonlinear projection is worse than linear projection.

• Linear projection is better than no projection.

• Linear projection is worse than no projection.

40