Self-supervised Learning for Visual Recognition

Hamed Pirsiavash

University of Maryland, Baltimore County

Significant progress in recognition due to large annotated datasets

14 million images

10 million images

450 hours of video

1.7 million question/answers

Self supervised learning

Zhang et al. ECCV’16

Input Output

Chair: 0

Dog: 1

Car: 0.

Supervised Learning(classification)

Input image

Chair: 0

Dog: 1

Car: 0.

Chair: 1

Dog: 0

Car: 0.

Input image

Chair: 1

Dog: 0

Car: 0.

Input image

Transfer to another task

Supervised Learning(counting)

Input image

Chair: 0

Dog: 2

Car: 0.

Inference on counting network

Constraint in the output

Two constraints in learning

Annotation...

Two constraints in learning

Annotation...

D ◦ x

T1 ◦ x

T2 ◦ x

T3 ◦ x

T4 ◦ x

D ◦ y

max{ 0, M − |c − t |2}

|d − t |2

4.5T1 ◦ x

T2 ◦ x

T3 ◦ x

T4 ◦ x

max{ 0, M − |c − t |2}

|d − t |2

D ◦ x

T1 ◦ x

T2 ◦ x

T3 ◦ x

T4 ◦ x

D ◦ y

max{ 0, M − |c − t |2}

|d − t |2

D ◦ x

T1 ◦ x

T2 ◦ x

T3 ◦ x

T4 ◦ x

D ◦ y

max{ 0, M − |c − t |2}

|d − t |2

D ◦ x

T1 ◦ x

T2 ◦ x

T3 ◦ x

T4 ◦ x

D ◦ y

max{ 0, M − |c − t |2}

D ◦ x

T1 ◦ x

T2 ◦ x

T3 ◦ x

T4 ◦ x

D ◦ y

max{ 0, M − |c − t |2}

|d − t |2

D ◦ x

T1 ◦ x

T2 ◦ x

T3 ◦ x

T4 ◦ x

D ◦ y

max{ 0, M − |c − t |2}

|d − t |2

Trained on ImageNet without annotation

Unit 1

Unit 2

Unit 3

Images with largest activation

Trained on COCO without annotation

Unit 1

Unit 2

Unit 3

Images with largest activation

Trained on ImageNet without annotation

query retrieved

Nearest neighbor search

Trained on COCO without annotation

query retrieved

Nearest neighbor search

Feature network(e.g., AlexNet)

Pretext task(e.g., counting)

Dataset (no labels)

Fine-tuning

Target task(e.g., object detection)

Dataset (no labels)

Dataset (with labels)Feature network

(e.g., AlexNet)

Method Class. Det. Segm.

Supervised 79.9 57.1 48.0

Random 53.3 43.4 19.8

Sound 54.4 44.0 -

Video 63.1 47.2 -

Split-Brain 67.1 46.7 36.0

Watching-Objects 61.0 52.2 -

Jigsaw(new version) 67.6 53.2 37.6

Counting (Ours) 67.7 52.4 36.6

Fine-tuning on PASCAL VOC07

Results on transfer learning

Doerch et al. ICCV’15 Noroozi and Favaro ECCV’16

Zhang et al. ECCV’16 Pathak et al. CVPR’16

Wang and Gupta ICCV’15 Pathak et al. CVPR’17

Jayaraman and Grauman ICCV’15

Agrawal et al. ICCV’15

Owens et al. ECCV’16

Mirsa et al.ECCV’16 29

Agenda

• Self supervised learning by counting

• Boosting self-supervised learning by knowledge transfer

Fine-tuning

Dataset (no labels)

(e.g., AlexNet)

Fine-tuning

(e.g., AlexNet)

More complicated Pretext task

Larger Dataset (no labels)

More complicatedFeature network

(e.g., VGG)

(e.g., AlexNet)

Transferring

(e.g., VGG)

(e.g., AlexNet)

Transferring

(e.g., VGG)

(e.g., AlexNet)

(e.g., VGG)

Dataset (with labels)

(e.g., VGG)

Dataset (no labels)

(e.g., VGG)

Dataset (no labels)

Pseudo labels

(e.g., VGG)

Dataset (no labels)

Dataset (with labels) Fine-tuning

Pseudo labels

Jigsaw

#1024CVPR 2018 Submission #1024. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

(a) Self-Super vised Learning Pre-Training. Suppose that

we are given a pretext task, a model and a dataset. Our first

step in SSL is to train our model on the pretext task with

the given dataset (see Fig. 2 (a)). Typically, the models of

choiceareconvolutional neural networks, and oneconsiders

as feature the output of some intermediate layer (shown as

a grey rectangle in Fig. 2 (a)).

(b) Cluster ing. Our next step is to compute feature vectors

for all the images in our dataset. Then, we use the k-means

algorithm with theEuclidean distance to cluster thefeatures

(see Fig. 2 (b)). Ideally, when performing this clustering on

ImageNet images, wewant theclusterscenters to bealigned

with object categories. In the experiments, we typically use

2,000 clusters.

(c) Extracting Pseudo-Labels. The cluster centers com-

puted in the previous section can be considered as virtual

categories. Indeed, we can assign feature vectors to the

closest cluster center to determineapseudo-label associated

to the chosen cluster. This operation is illustrated in Fig. 2

(c). Notice that the dataset used in this operation might be

different from that used in the clustering step or in the SSL

pre-training.

(d) Cluster Classification. Finally, we train a simple clas-

sifier using the architecture of the target task so that, given

an input image (from thedataset used to extract thepseudo-

labels), predicts the corresponding pseudo-label (see Fig. 2

(d)). Thisclassifier learns anew representation in the target

architecture that maps images that were originally close to

each other in the pre-trained feature space to close points.

4. The Jigsaw++ Pretext Task

Recent work [7, 31] has shown that deeper architec-

tures can help in SSL with PASCAL recognition tasks (e.g.,

ResNet). However, those methods use the same deep ar-

chitecture for both SSL and fine-tuning. Hence, they are

not comparable with previous methods that use a simpler

AlexNet architecture in fine-tuning. We are interested in

knowing how far one can improve the SSL pre-training of

AlexNet for PASCAL tasks. Since in our framework the

SSL task is not restricted to use the same architecture as in

the final supervised task, we can increase the difficulty of

theSSL task along with thecapacity of thearchitecture and

still use AlexNet at the fine-tuning stage.

To this aim, we extend the jigsaw [20] task and call it

the jigsaw++ task. The original pretext task [20] is to find

a reordering of tiles from a 3⇥ 3 grid of a square region

cropped from an image. In jigsaw++, we replace a random

number of tiles in the grid (up to 2) with (occluding) tiles

from another random image (see Fig. 3). The number of

tiles (0, 1 or 2 in our experiments) as well as their location

are randomly selected. The occluding tiles make the task

remarkably more complex. First, the model needs to detect

Figure 3: The j igsaw++ task. (a) the main image. (b) a

random image. (c) a puzzle from the original formulation

of [20], where all tiles come from the same image. (d) a

puzzle in the jigsaw++ task, where at most 2 tiles can come

from a random image.

the occluding tiles and second, it needs to solve the jigsaw

problem by using only theremaining patches. To makesure

we are not adding ambiguities to the task, we remove sim-

ilar permutations so that the minimum Hamming distance

between any two permutations is at least 3. In this way,

there is a unique solution to the jigsaw task for any num-

ber of occlusions in our training setting. Our final training

permutation set includes 701 permutations, in which theav-

erage and minimum Hamming distance is .86 and 3 respec-

tively. In addition to themean and std normalization of each

patch independently, as it wasdonein theoriginal paper, we

train thenetwork 70% of the time on thegray scale images.

In this way, we prevent the network from using low level

statistics to detect occlusions and solve the jigsaw task.

Wetrain the jigsaw++ task on both VGG16 and AlexNet

architectures. By having a larger capacity with VGG16, the

network isbetter equipped to handle theincreased complex-

ity of the jigsaw++ task and is capable of extracting better

representations from the data.

Following our pipeline in Fig. 2, we train our models

agrey rectangle in Fig. 2 (a)).

2,000 clusters.

pre-training.

Permute and then predict the permutation

Noroozi, Mehdi, and Paolo Favaro. "Unsupervised learning of visual representations by solving jigsaw puzzles." ECCV 2016.

Jigsaw++324

2,000 clusters.

pre-training.

2,000 clusters.

pre-training.

(d)). This classifier learns anew representation in the target

turescan help in SSL with PASCAL recognition tasks (e.g.,

still use AlexNet at thefine-tuning stage.

patch independently, asit wasdonein theoriginal paper, we

with object categories. In the experiments, we typical ly use

2,000 clusters.

pre-training.

turescan help in SSL with PASCAL recognition tasks (e.g.,

train the network 70% of the time on thegray scale images.

network isbetter equipped to handletheincreased complex-

• Add distracting patches

• Increase number of permutations

Clusters on Jigsaw++

Random 53.3 43.4 19.8

Sound 54.4 44.0 -

Video 63.1 47.2 -

Split-Brain 67.1 46.7 36.0

Random 53.3 43.4 19.8

Sound 54.4 44.0 -

Video 63.1 47.2 -

Split-Brain 67.1 46.7 36.0

Jigsaw++ (Ours) 72.5 56.5 42.6

Random 53.3 43.4 19.8

Sound 54.4 44.0 -

Video 63.1 47.2 -

Split-Brain 67.1 46.7 36.0

Jigsaw++ (Ours) 72.5 56.5 42.6

RotNet (ICLR’18) 72.9 54.4 39.1

Deep clustering (ECCV’18) 73.7 55.4 45.1

(e.g., VGG)

Dataset (no labels)

Pseudo labels

Dataset (no labels)

Pseudo labels

Random 53.3 43.4 19.8

Sound 54.4 44.0 -

Video 63.1 47.2 -

Split-Brain 67.1 46.7 36.0

Counting (ours) 67.7 52.4 36.6

Jigsaw++ (ours) 72.5 56.5 42.6

HOG (ours) 70.2 53.2 39.2

Kaiming He Ross Girshick Piotr Dollar, “Rethinking ImageNet Pre-training”, arXiv, Nov 2018.

Visualization of conv1 filters

From scratch

CC on VGG-Jigsaw++

CC onHOG

Thanks to

Mehdi Noroozi Paolo FavaroAnanth Kavalkazhani

Thanks!

Self-supervised Learning for Visual Recognition

Documents

Transcript of Self-supervised Learning for Visual Recognition

Weakly Supervised Scale-Invariant Learning of …fergus/papers/fergus_ijcv.pdfWeakly Supervised Scale-Invariant Learning of Models for Visual Recognition R. Fergus1 P. Perona2 A. Zisserman1

Visual Aircraft Recognition

Supervised object recognition, unsupervised object ...courses.csail.mit.edu/6.869/lectnotes/lect18/lect18-slides.pdfSupervised object recognition, unsupervised object recognition then

Visual Recognition using Embedded Feature Selection for …papers.nips.cc/paper/4799-visual-recognition-using-embedded-featur… · Visual Recognition using Embedded Feature Selection

ECS 289G: Visual Recognitionweb.cs.ucdavis.edu/~yjlee/teaching/ecs289g-winter2018/research_overview.pdfOverview: Weakly-supervised visual recognition. 1. Learning object detectors

VIETNAMESE CHARACTERS RECOGNITION BASED ON SUPERVISED …jkalita/work/StudentResearch/NguyenMSProject2013.pdf · VIETNAMESE CHARACTERS RECOGNITION BASED ON SUPERVISED ... Chapter

Brahmi word recognition by supervised techniques

Learning Action Representations for Self-supervised Visual ...andrea/papers/2019_ICRA__LearningAction... · Learning Action Representations for Self-supervised Visual Exploration

(Un)supervised Approaches to Metonymy Recognition

Weakly Supervised Coupled Networks for Visual …openaccess.thecvf.com/content_cvpr_2018/papers/Yang...Weakly Supervised Coupled Networks for Visual Sentiment Analysis Jufeng Yang†,

Weakly-supervised Discovery of Visual Pattern Configurationspapers.nips.cc/paper/5284-weakly-supervised-discovery-of... · 2014-12-03 · Weakly-supervised Discovery of Visual Pattern

CS 3710: Visual Recognition Recognition Basics

Visual Object Recognition

Scene Recognition and Weakly Supervised Object Localization with

SEED: SELF SUPERVISED DISTILLATION FOR VISUAL R

Visual pattern recognition

Weakly Supervised Scale-Invariant Learning of Models for ...people.csail.mit.edu/fergus/papers/fergus_ijcv.pdf · Weakly Supervised Scale-Invariant Learning of Models for Visual Recognition

On Visual Recognition

EE 583 PATTERN RECOGNITION Supervised Learning

Semi-Supervised Named Entity Recognition: