Self-supervised Learning for Visual Recognition
Transcript of Self-supervised Learning for Visual Recognition
Self-supervised Learning for Visual Recognition
Hamed Pirsiavash
University of Maryland, Baltimore County
1
Significant progress in recognition due to large annotated datasets
14 million images
10 million images
450 hours of video
1.7 million question/answers
Self supervised learning
3
Zhang et al. ECCV’16
Input Output
Chair: 0
Dog: 1
Car: 0.
.
.
Supervised Learning(classification)
Input image
4
Label
Supervised Learning(classification)
Input image
5
Chair: 0
Dog: 1
Car: 0.
.
.
Label
Chair: 1
Dog: 0
Car: 0.
.
.
Supervised Learning(classification)
Input image
Label
6
Chair: 1
Dog: 0
Car: 0.
.
.
Supervised Learning(classification)
Input image
Label
7
Transfer to another task
Supervised Learning(counting)
Input image
8
Chair: 0
Dog: 2
Car: 0.
.
.
Label
9
Inference on counting network
10
Constraint in the output
11
Constraint in the output
12
Constraint in the output
13
Two constraints in learning
Annotation...
14
Two constraints in learning
Annotation...
0
4.5
0
4.5
0
4.5
0
4.5
D ◦ x
T1 ◦ x
T2 ◦ x
T3 ◦ x
T4 ◦ x
0
4.5
+
y
x
D ◦ y
φ
φ
φ
φ
φ
0
4.5φ
max{ 0, M − |c − t |2}
|d − t |2
t
t
d
c
Self supervised learning
15
0
4.5
0
4.5
0
4.5
0
4.5T1 ◦ x
T2 ◦ x
T3 ◦ x
T4 ◦ x
0
4.5
+
y
xφ
φ
φ
φ
φ
0
4.5φ
max{ 0, M − |c − t |2}
|d − t |2
t
t
d
c
16
Self supervised learning
0
4.5
0
4.5
0
4.5
0
4.5
D ◦ x
T1 ◦ x
T2 ◦ x
T3 ◦ x
T4 ◦ x
0
4.5
+
y
x
D ◦ y
φ
φ
φ
φ
φ
0
4.5φ
max{ 0, M − |c − t |2}
|d − t |2
t
t
d
c
17
Self supervised learning
0
4.5
0
4.5
0
4.5
0
4.5
D ◦ x
T1 ◦ x
T2 ◦ x
T3 ◦ x
T4 ◦ x
0
4.5
+
y
x
D ◦ y
φ
φ
φ
φ
φ
0
4.5φ
max{ 0, M − |c − t |2}
|d − t |2
t
t
d
c
18
Self supervised learning
0
4.5
0
4.5
0
4.5
0
4.5
D ◦ x
T1 ◦ x
T2 ◦ x
T3 ◦ x
T4 ◦ x
0
4.5
+
y
x
D ◦ y
φ
φ
φ
φ
φ
0
4.5φ
max{ 0, M − |c − t |2}
19
Self supervised learning
0
4.5
0
4.5
0
4.5
0
4.5
D ◦ x
T1 ◦ x
T2 ◦ x
T3 ◦ x
T4 ◦ x
0
4.5
+
y
x
D ◦ y
φ
φ
φ
φ
φ
0
4.5φ
max{ 0, M − |c − t |2}
|d − t |2
t
d
20
Self supervised learning
0
4.5
0
4.5
0
4.5
0
4.5
D ◦ x
T1 ◦ x
T2 ◦ x
T3 ◦ x
T4 ◦ x
0
4.5
+
y
x
D ◦ y
φ
φ
φ
φ
φ
0
4.5φ
max{ 0, M − |c − t |2}
|d − t |2
t
t
d
c
21
Self supervised learning
Trained on ImageNet without annotation
22
Unit 1
Unit 2
Unit 3
Images with largest activation
Trained on COCO without annotation
23
Unit 1
Unit 2
Unit 3
Images with largest activation
Trained on ImageNet without annotation
24
query retrieved
Nearest neighbor search
Trained on COCO without annotation
25
query retrieved
Nearest neighbor search
26
Feature network(e.g., AlexNet)
Pretext task(e.g., counting)
Dataset (no labels)
27
Fine-tuning
Feature network(e.g., AlexNet)
Pretext task(e.g., counting)
Target task(e.g., object detection)
Dataset (no labels)
Dataset (with labels)Feature network
(e.g., AlexNet)
Method Class. Det. Segm.
Supervised 79.9 57.1 48.0
Random 53.3 43.4 19.8
Sound 54.4 44.0 -
Video 63.1 47.2 -
Split-Brain 67.1 46.7 36.0
Watching-Objects 61.0 52.2 -
Jigsaw(new version) 67.6 53.2 37.6
Counting (Ours) 67.7 52.4 36.6
Fine-tuning on PASCAL VOC07
28
Results on transfer learning
Doerch et al. ICCV’15 Noroozi and Favaro ECCV’16
Zhang et al. ECCV’16 Pathak et al. CVPR’16
Wang and Gupta ICCV’15 Pathak et al. CVPR’17
Jayaraman and Grauman ICCV’15
Agrawal et al. ICCV’15
Owens et al. ECCV’16
Mirsa et al.ECCV’16 29
Doerch et al. ICCV’15 Noroozi and Favaro ECCV’16
Zhang et al. ECCV’16 Pathak et al. CVPR’16
Wang and Gupta ICCV’15 Pathak et al. CVPR’17
Jayaraman and Grauman ICCV’15
Agrawal et al. ICCV’15
Owens et al. ECCV’16
Mirsa et al.ECCV’16 30
Doerch et al. ICCV’15 Noroozi and Favaro ECCV’16
Zhang et al. ECCV’16 Pathak et al. CVPR’16
Wang and Gupta ICCV’15 Pathak et al. CVPR’17
Jayaraman and Grauman ICCV’15
Agrawal et al. ICCV’15
Owens et al. ECCV’16
Mirsa et al.ECCV’16 31
Doerch et al. ICCV’15 Noroozi and Favaro ECCV’16
Zhang et al. ECCV’16 Pathak et al. CVPR’16
Wang and Gupta ICCV’15 Pathak et al. CVPR’17
Jayaraman and Grauman ICCV’15
Agrawal et al. ICCV’15
Owens et al. ECCV’16
Mirsa et al.ECCV’16 32
Doerch et al. ICCV’15 Noroozi and Favaro ECCV’16
Zhang et al. ECCV’16 Pathak et al. CVPR’16
Wang and Gupta ICCV’15 Pathak et al. CVPR’17
Jayaraman and Grauman ICCV’15
Agrawal et al. ICCV’15
Owens et al. ECCV’16
Mirsa et al.ECCV’16 33
Doerch et al. ICCV’15 Noroozi and Favaro ECCV’16
Zhang et al. ECCV’16 Pathak et al. CVPR’16
Wang and Gupta ICCV’15 Pathak et al. CVPR’17
Jayaraman and Grauman ICCV’15
Owens et al. ECCV’16
Mirsa et al.ECCV’16 34
Agenda
• Self supervised learning by counting
• Boosting self-supervised learning by knowledge transfer
35
36
Fine-tuning
Feature network(e.g., AlexNet)
Pretext task(e.g., counting)
Target task(e.g., object detection)
Dataset (no labels)
Dataset (with labels)Feature network
(e.g., AlexNet)
37
Fine-tuning
Feature network(e.g., AlexNet)
Target task(e.g., object detection)
Dataset (with labels)Feature network
(e.g., AlexNet)
More complicated Pretext task
Larger Dataset (no labels)
38
More complicatedFeature network
(e.g., VGG)
Target task(e.g., object detection)
Larger Dataset (no labels)
Dataset (with labels)Feature network
(e.g., AlexNet)
More complicated Pretext task
39
Transferring
More complicatedFeature network
(e.g., VGG)
Target task(e.g., object detection)
Larger Dataset (no labels)
Dataset (with labels)Feature network
(e.g., AlexNet)
More complicated Pretext task
40
Transferring
More complicatedFeature network
(e.g., VGG)
Target task(e.g., object detection)
Larger Dataset (no labels)
Dataset (with labels)Feature network
(e.g., AlexNet)
More complicated Pretext task
41
More complicatedFeature network
(e.g., VGG)
More complicated Pretext task
Target task(e.g., object detection)
Larger Dataset (no labels)
Dataset (with labels)
42
More complicatedFeature network
(e.g., VGG)
Target task(e.g., object detection)
Dataset (no labels)
More complicated Pretext task
Dataset (with labels)
43
More complicatedFeature network
(e.g., VGG)
Target task(e.g., object detection)
Dataset (no labels)
Dataset (with labels)
Pseudo labels
More complicated Pretext task
44
More complicatedFeature network
(e.g., VGG)
Target task(e.g., object detection)
Dataset (no labels)
Dataset (with labels) Fine-tuning
Pseudo labels
More complicated Pretext task
45
Jigsaw
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
CVPR
#1024
CVPR
#1024CVPR 2018 Submission #1024. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
(a) Self-Super vised Learning Pre-Training. Suppose that
we are given a pretext task, a model and a dataset. Our first
step in SSL is to train our model on the pretext task with
the given dataset (see Fig. 2 (a)). Typically, the models of
choiceareconvolutional neural networks, and oneconsiders
as feature the output of some intermediate layer (shown as
a grey rectangle in Fig. 2 (a)).
(b) Cluster ing. Our next step is to compute feature vectors
for all the images in our dataset. Then, we use the k-means
algorithm with theEuclidean distance to cluster thefeatures
(see Fig. 2 (b)). Ideally, when performing this clustering on
ImageNet images, wewant theclusterscenters to bealigned
with object categories. In the experiments, we typically use
2,000 clusters.
(c) Extracting Pseudo-Labels. The cluster centers com-
puted in the previous section can be considered as virtual
categories. Indeed, we can assign feature vectors to the
closest cluster center to determineapseudo-label associated
to the chosen cluster. This operation is illustrated in Fig. 2
(c). Notice that the dataset used in this operation might be
different from that used in the clustering step or in the SSL
pre-training.
(d) Cluster Classification. Finally, we train a simple clas-
sifier using the architecture of the target task so that, given
an input image (from thedataset used to extract thepseudo-
labels), predicts the corresponding pseudo-label (see Fig. 2
(d)). Thisclassifier learns anew representation in the target
architecture that maps images that were originally close to
each other in the pre-trained feature space to close points.
4. The Jigsaw++ Pretext Task
Recent work [7, 31] has shown that deeper architec-
tures can help in SSL with PASCAL recognition tasks (e.g.,
ResNet). However, those methods use the same deep ar-
chitecture for both SSL and fine-tuning. Hence, they are
not comparable with previous methods that use a simpler
AlexNet architecture in fine-tuning. We are interested in
knowing how far one can improve the SSL pre-training of
AlexNet for PASCAL tasks. Since in our framework the
SSL task is not restricted to use the same architecture as in
the final supervised task, we can increase the difficulty of
theSSL task along with thecapacity of thearchitecture and
still use AlexNet at the fine-tuning stage.
To this aim, we extend the jigsaw [20] task and call it
the jigsaw++ task. The original pretext task [20] is to find
a reordering of tiles from a 3⇥ 3 grid of a square region
cropped from an image. In jigsaw++, we replace a random
number of tiles in the grid (up to 2) with (occluding) tiles
from another random image (see Fig. 3). The number of
tiles (0, 1 or 2 in our experiments) as well as their location
are randomly selected. The occluding tiles make the task
remarkably more complex. First, the model needs to detect
(a)
(c)
(b)
(d)
Figure 3: The j igsaw++ task. (a) the main image. (b) a
random image. (c) a puzzle from the original formulation
of [20], where all tiles come from the same image. (d) a
puzzle in the jigsaw++ task, where at most 2 tiles can come
from a random image.
the occluding tiles and second, it needs to solve the jigsaw
problem by using only theremaining patches. To makesure
we are not adding ambiguities to the task, we remove sim-
ilar permutations so that the minimum Hamming distance
between any two permutations is at least 3. In this way,
there is a unique solution to the jigsaw task for any num-
ber of occlusions in our training setting. Our final training
permutation set includes 701 permutations, in which theav-
erage and minimum Hamming distance is .86 and 3 respec-
tively. In addition to themean and std normalization of each
patch independently, as it wasdonein theoriginal paper, we
train thenetwork 70% of the time on thegray scale images.
In this way, we prevent the network from using low level
statistics to detect occlusions and solve the jigsaw task.
Wetrain the jigsaw++ task on both VGG16 and AlexNet
architectures. By having a larger capacity with VGG16, the
network isbetter equipped to handle theincreased complex-
ity of the jigsaw++ task and is capable of extracting better
representations from the data.
Following our pipeline in Fig. 2, we train our models
4
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
CVPR
#1024
CVPR
#1024CVPR 2018 Submission #1024. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
(a) Self-Super vised Learning Pre-Training. Suppose that
we are given a pretext task, a model and a dataset. Our first
step in SSL is to train our model on the pretext task with
the given dataset (see Fig. 2 (a)). Typically, the models of
choiceareconvolutional neural networks, and oneconsiders
as feature the output of some intermediate layer (shown as
agrey rectangle in Fig. 2 (a)).
(b) Cluster ing. Our next step is to compute feature vectors
for all the images in our dataset. Then, we use the k-means
algorithm with theEuclidean distance to cluster thefeatures
(see Fig. 2 (b)). Ideally, when performing this clustering on
ImageNet images, wewant theclusterscenters to bealigned
with object categories. In the experiments, we typically use
2,000 clusters.
(c) Extracting Pseudo-Labels. The cluster centers com-
puted in the previous section can be considered as virtual
categories. Indeed, we can assign feature vectors to the
closest cluster center to determineapseudo-label associated
to the chosen cluster. This operation is illustrated in Fig. 2
(c). Notice that the dataset used in this operation might be
different from that used in the clustering step or in the SSL
pre-training.
(d) Cluster Classification. Finally, we train a simple clas-
sifier using the architecture of the target task so that, given
an input image (from thedataset used to extract thepseudo-
labels), predicts the corresponding pseudo-label (see Fig. 2
(d)). Thisclassifier learns anew representation in the target
architecture that maps images that were originally close to
each other in the pre-trained feature space to close points.
4. The Jigsaw++ Pretext Task
Recent work [7, 31] has shown that deeper architec-
tures can help in SSL with PASCAL recognition tasks (e.g.,
ResNet). However, those methods use the same deep ar-
chitecture for both SSL and fine-tuning. Hence, they are
not comparable with previous methods that use a simpler
AlexNet architecture in fine-tuning. We are interested in
knowing how far one can improve the SSL pre-training of
AlexNet for PASCAL tasks. Since in our framework the
SSL task is not restricted to use the same architecture as in
the final supervised task, we can increase the difficulty of
theSSL task along with thecapacity of thearchitecture and
still use AlexNet at the fine-tuning stage.
To this aim, we extend the jigsaw [20] task and call it
the jigsaw++ task. The original pretext task [20] is to find
a reordering of tiles from a 3⇥ 3 grid of a square region
cropped from an image. In jigsaw++, we replace a random
number of tiles in the grid (up to 2) with (occluding) tiles
from another random image (see Fig. 3). The number of
tiles (0, 1 or 2 in our experiments) as well as their location
are randomly selected. The occluding tiles make the task
remarkably more complex. First, the model needs to detect
(a)
(c)
(b)
(d)
Figure 3: The j igsaw++ task. (a) the main image. (b) a
random image. (c) a puzzle from the original formulation
of [20], where all tiles come from the same image. (d) a
puzzle in the jigsaw++ task, where at most 2 tiles can come
from a random image.
the occluding tiles and second, it needs to solve the jigsaw
problem by using only theremaining patches. To makesure
we are not adding ambiguities to the task, we remove sim-
ilar permutations so that the minimum Hamming distance
between any two permutations is at least 3. In this way,
there is a unique solution to the jigsaw task for any num-
ber of occlusions in our training setting. Our final training
permutation set includes 701 permutations, in which theav-
erage and minimum Hamming distance is .86 and 3 respec-
tively. In addition to themean and std normalization of each
patch independently, as it wasdonein theoriginal paper, we
train thenetwork 70% of the time on thegray scale images.
In this way, we prevent the network from using low level
statistics to detect occlusions and solve the jigsaw task.
Wetrain the jigsaw++ task on both VGG16 and AlexNet
architectures. By having a larger capacity with VGG16, the
network isbetter equipped to handle theincreased complex-
ity of the jigsaw++ task and is capable of extracting better
representations from the data.
Following our pipeline in Fig. 2, we train our models
4
Permute and then predict the permutation
Noroozi, Mehdi, and Paolo Favaro. "Unsupervised learning of visual representations by solving jigsaw puzzles." ECCV 2016.
46
Jigsaw++324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
CVPR
#1024
CVPR
#1024CVPR 2018 Submission #1024. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
(a) Self-Super vised Learning Pre-Training. Suppose that
we are given a pretext task, a model and a dataset. Our first
step in SSL is to train our model on the pretext task with
the given dataset (see Fig. 2 (a)). Typically, the models of
choiceareconvolutional neural networks, and oneconsiders
as feature the output of some intermediate layer (shown as
a grey rectangle in Fig. 2 (a)).
(b) Cluster ing. Our next step is to compute feature vectors
for all the images in our dataset. Then, we use the k-means
algorithm with theEuclidean distance to cluster thefeatures
(see Fig. 2 (b)). Ideally, when performing this clustering on
ImageNet images, wewant theclusterscenters to bealigned
with object categories. In the experiments, we typically use
2,000 clusters.
(c) Extracting Pseudo-Labels. The cluster centers com-
puted in the previous section can be considered as virtual
categories. Indeed, we can assign feature vectors to the
closest cluster center to determineapseudo-label associated
to the chosen cluster. This operation is illustrated in Fig. 2
(c). Notice that the dataset used in this operation might be
different from that used in the clustering step or in the SSL
pre-training.
(d) Cluster Classification. Finally, we train a simple clas-
sifier using the architecture of the target task so that, given
an input image (from thedataset used to extract thepseudo-
labels), predicts the corresponding pseudo-label (see Fig. 2
(d)). Thisclassifier learns anew representation in the target
architecture that maps images that were originally close to
each other in the pre-trained feature space to close points.
4. The Jigsaw++ Pretext Task
Recent work [7, 31] has shown that deeper architec-
tures can help in SSL with PASCAL recognition tasks (e.g.,
ResNet). However, those methods use the same deep ar-
chitecture for both SSL and fine-tuning. Hence, they are
not comparable with previous methods that use a simpler
AlexNet architecture in fine-tuning. We are interested in
knowing how far one can improve the SSL pre-training of
AlexNet for PASCAL tasks. Since in our framework the
SSL task is not restricted to use the same architecture as in
the final supervised task, we can increase the difficulty of
theSSL task along with thecapacity of thearchitecture and
still use AlexNet at the fine-tuning stage.
To this aim, we extend the jigsaw [20] task and call it
the jigsaw++ task. The original pretext task [20] is to find
a reordering of tiles from a 3⇥ 3 grid of a square region
cropped from an image. In jigsaw++, we replace a random
number of tiles in the grid (up to 2) with (occluding) tiles
from another random image (see Fig. 3). The number of
tiles (0, 1 or 2 in our experiments) as well as their location
are randomly selected. The occluding tiles make the task
remarkably more complex. First, the model needs to detect
(a)
(c)
(b)
(d)
Figure 3: The j igsaw++ task. (a) the main image. (b) a
random image. (c) a puzzle from the original formulation
of [20], where all tiles come from the same image. (d) a
puzzle in the jigsaw++ task, where at most 2 tiles can come
from a random image.
the occluding tiles and second, it needs to solve the jigsaw
problem by using only theremaining patches. To makesure
we are not adding ambiguities to the task, we remove sim-
ilar permutations so that the minimum Hamming distance
between any two permutations is at least 3. In this way,
there is a unique solution to the jigsaw task for any num-
ber of occlusions in our training setting. Our final training
permutation set includes 701 permutations, in which theav-
erage and minimum Hamming distance is .86 and 3 respec-
tively. In addition to themean and std normalization of each
patch independently, as it wasdonein theoriginal paper, we
train thenetwork 70% of the time on thegray scale images.
In this way, we prevent the network from using low level
statistics to detect occlusions and solve the jigsaw task.
Wetrain the jigsaw++ task on both VGG16 and AlexNet
architectures. By having a larger capacity with VGG16, the
network isbetter equipped to handle theincreased complex-
ity of the jigsaw++ task and is capable of extracting better
representations from the data.
Following our pipeline in Fig. 2, we train our models
4
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
CVPR
#1024
CVPR
#1024CVPR 2018 Submission #1024. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
(a) Self-Super vised Learning Pre-Training. Suppose that
we are given a pretext task, a model and a dataset. Our first
step in SSL is to train our model on the pretext task with
the given dataset (see Fig. 2 (a)). Typically, the models of
choiceareconvolutional neural networks, and oneconsiders
as feature the output of some intermediate layer (shown as
a grey rectangle in Fig. 2 (a)).
(b) Cluster ing. Our next step is to compute feature vectors
for all the images in our dataset. Then, we use the k-means
algorithm with theEuclidean distance to cluster thefeatures
(see Fig. 2 (b)). Ideally, when performing this clustering on
ImageNet images, wewant theclusterscenters to bealigned
with object categories. In the experiments, we typically use
2,000 clusters.
(c) Extracting Pseudo-Labels. The cluster centers com-
puted in the previous section can be considered as virtual
categories. Indeed, we can assign feature vectors to the
closest cluster center to determineapseudo-label associated
to the chosen cluster. This operation is illustrated in Fig. 2
(c). Notice that the dataset used in this operation might be
different from that used in the clustering step or in the SSL
pre-training.
(d) Cluster Classification. Finally, we train a simple clas-
sifier using the architecture of the target task so that, given
an input image (from thedataset used to extract thepseudo-
labels), predicts the corresponding pseudo-label (see Fig. 2
(d)). This classifier learns anew representation in the target
architecture that maps images that were originally close to
each other in the pre-trained feature space to close points.
4. The Jigsaw++ Pretext Task
Recent work [7, 31] has shown that deeper architec-
turescan help in SSL with PASCAL recognition tasks (e.g.,
ResNet). However, those methods use the same deep ar-
chitecture for both SSL and fine-tuning. Hence, they are
not comparable with previous methods that use a simpler
AlexNet architecture in fine-tuning. We are interested in
knowing how far one can improve the SSL pre-training of
AlexNet for PASCAL tasks. Since in our framework the
SSL task is not restricted to use the same architecture as in
the final supervised task, we can increase the difficulty of
theSSL task along with thecapacity of thearchitecture and
still use AlexNet at thefine-tuning stage.
To this aim, we extend the jigsaw [20] task and call it
the jigsaw++ task. The original pretext task [20] is to find
a reordering of tiles from a 3⇥ 3 grid of a square region
cropped from an image. In jigsaw++, we replace a random
number of tiles in the grid (up to 2) with (occluding) tiles
from another random image (see Fig. 3). The number of
tiles (0, 1 or 2 in our experiments) as well as their location
are randomly selected. The occluding tiles make the task
remarkably more complex. First, the model needs to detect
(a)
(c)
(b)
(d)
Figure 3: The j igsaw++ task. (a) the main image. (b) a
random image. (c) a puzzle from the original formulation
of [20], where all tiles come from the same image. (d) a
puzzle in the jigsaw++ task, where at most 2 tiles can come
from a random image.
the occluding tiles and second, it needs to solve the jigsaw
problem by using only theremaining patches. To makesure
we are not adding ambiguities to the task, we remove sim-
ilar permutations so that the minimum Hamming distance
between any two permutations is at least 3. In this way,
there is a unique solution to the jigsaw task for any num-
ber of occlusions in our training setting. Our final training
permutation set includes 701 permutations, in which theav-
erage and minimum Hamming distance is .86 and 3 respec-
tively. In addition to themean and std normalization of each
patch independently, asit wasdonein theoriginal paper, we
train thenetwork 70% of the time on thegray scale images.
In this way, we prevent the network from using low level
statistics to detect occlusions and solve the jigsaw task.
Wetrain the jigsaw++ task on both VGG16 and AlexNet
architectures. By having a larger capacity with VGG16, the
network isbetter equipped to handle theincreased complex-
ity of the jigsaw++ task and is capable of extracting better
representations from the data.
Following our pipeline in Fig. 2, we train our models
4
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
CVPR
#1024
CVPR
#1024CVPR 2018 Submission #1024. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
(a) Self-Super vised Learning Pre-Training. Suppose that
we are given a pretext task, a model and a dataset. Our first
step in SSL is to train our model on the pretext task with
the given dataset (see Fig. 2 (a)). Typically, the models of
choiceareconvolutional neural networks, and oneconsiders
as feature the output of some intermediate layer (shown as
a grey rectangle in Fig. 2 (a)).
(b) Cluster ing. Our next step is to compute feature vectors
for all the images in our dataset. Then, we use the k-means
algorithm with theEuclidean distance to cluster thefeatures
(see Fig. 2 (b)). Ideally, when performing this clustering on
ImageNet images, wewant theclusterscenters to bealigned
with object categories. In the experiments, we typical ly use
2,000 clusters.
(c) Extracting Pseudo-Labels. The cluster centers com-
puted in the previous section can be considered as virtual
categories. Indeed, we can assign feature vectors to the
closest cluster center to determineapseudo-label associated
to the chosen cluster. This operation is illustrated in Fig. 2
(c). Notice that the dataset used in this operation might be
different from that used in the clustering step or in the SSL
pre-training.
(d) Cluster Classification. Finally, we train a simple clas-
sifier using the architecture of the target task so that, given
an input image (from thedataset used to extract thepseudo-
labels), predicts the corresponding pseudo-label (see Fig. 2
(d)). Thisclassifier learns anew representation in the target
architecture that maps images that were originally close to
each other in the pre-trained feature space to close points.
4. The Jigsaw++ Pretext Task
Recent work [7, 31] has shown that deeper architec-
turescan help in SSL with PASCAL recognition tasks (e.g.,
ResNet). However, those methods use the same deep ar-
chitecture for both SSL and fine-tuning. Hence, they are
not comparable with previous methods that use a simpler
AlexNet architecture in fine-tuning. We are interested in
knowing how far one can improve the SSL pre-training of
AlexNet for PASCAL tasks. Since in our framework the
SSL task is not restricted to use the same architecture as in
the final supervised task, we can increase the difficulty of
theSSL task along with thecapacity of thearchitecture and
still use AlexNet at the fine-tuning stage.
To this aim, we extend the jigsaw [20] task and call it
the jigsaw++ task. The original pretext task [20] is to find
a reordering of tiles from a 3⇥ 3 grid of a square region
cropped from an image. In jigsaw++, we replace a random
number of tiles in the grid (up to 2) with (occluding) tiles
from another random image (see Fig. 3). The number of
tiles (0, 1 or 2 in our experiments) as well as their location
are randomly selected. The occluding tiles make the task
remarkably more complex. First, the model needs to detect
(a)
(c)
(b)
(d)
Figure 3: The j igsaw++ task. (a) the main image. (b) a
random image. (c) a puzzle from the original formulation
of [20], where all tiles come from the same image. (d) a
puzzle in the jigsaw++ task, where at most 2 tiles can come
from a random image.
the occluding tiles and second, it needs to solve the jigsaw
problem by using only theremaining patches. To makesure
we are not adding ambiguities to the task, we remove sim-
ilar permutations so that the minimum Hamming distance
between any two permutations is at least 3. In this way,
there is a unique solution to the jigsaw task for any num-
ber of occlusions in our training setting. Our final training
permutation set includes 701 permutations, in which theav-
erage and minimum Hamming distance is .86 and 3 respec-
tively. In addition to themean and std normalization of each
patch independently, as it wasdonein theoriginal paper, we
train the network 70% of the time on thegray scale images.
In this way, we prevent the network from using low level
statistics to detect occlusions and solve the jigsaw task.
Wetrain the jigsaw++ task on both VGG16 and AlexNet
architectures. By having a larger capacity with VGG16, the
network isbetter equipped to handletheincreased complex-
ity of the jigsaw++ task and is capable of extracting better
representations from the data.
Following our pipeline in Fig. 2, we train our models
4
• Add distracting patches
• Increase number of permutations
47
Clusters on Jigsaw++
Method Class. Det. Segm.
Supervised 79.9 57.1 48.0
Random 53.3 43.4 19.8
Sound 54.4 44.0 -
Video 63.1 47.2 -
Split-Brain 67.1 46.7 36.0
Watching-Objects 61.0 52.2 -
Jigsaw(new version) 67.6 53.2 37.6
Counting (Ours) 67.7 52.4 36.6
Fine-tuning on PASCAL VOC07
50
Results on transfer learning
Method Class. Det. Segm.
Supervised 79.9 57.1 48.0
Random 53.3 43.4 19.8
Sound 54.4 44.0 -
Video 63.1 47.2 -
Split-Brain 67.1 46.7 36.0
Watching-Objects 61.0 52.2 -
Jigsaw(new version) 67.6 53.2 37.6
Counting (Ours) 67.7 52.4 36.6
Jigsaw++ (Ours) 72.5 56.5 42.6
Fine-tuning on PASCAL VOC07
51
Results on transfer learning
Method Class. Det. Segm.
Supervised 79.9 57.1 48.0
Random 53.3 43.4 19.8
Sound 54.4 44.0 -
Video 63.1 47.2 -
Split-Brain 67.1 46.7 36.0
Watching-Objects 61.0 52.2 -
Jigsaw(new version) 67.6 53.2 37.6
Counting (Ours) 67.7 52.4 36.6
Jigsaw++ (Ours) 72.5 56.5 42.6
RotNet (ICLR’18) 72.9 54.4 39.1
Deep clustering (ECCV’18) 73.7 55.4 45.1
Fine-tuning on PASCAL VOC07
52
Results on transfer learning
53
More complicatedFeature network
(e.g., VGG)
Target task(e.g., object detection)
Dataset (no labels)
Dataset (with labels) Fine-tuning
Pseudo labels
More complicated Pretext task
54
Target task(e.g., object detection)
Dataset (no labels)
Dataset (with labels) Fine-tuning
Pseudo labels
HOG
Method Class. Det. Segm.
Supervised 79.9 57.1 48.0
Random 53.3 43.4 19.8
Sound 54.4 44.0 -
Video 63.1 47.2 -
Split-Brain 67.1 46.7 36.0
Watching-Objects 61.0 52.2 -
Jigsaw(new version) 67.6 53.2 37.6
Counting (ours) 67.7 52.4 36.6
Jigsaw++ (ours) 72.5 56.5 42.6
HOG (ours) 70.2 53.2 39.2
Fine-tuning on PASCAL VOC07
55
Results on transfer learning
Kaiming He Ross Girshick Piotr Dollar, “Rethinking ImageNet Pre-training”, arXiv, Nov 2018.
Visualization of conv1 filters
56
From scratch
CC on VGG-Jigsaw++
CC onHOG
57
Thanks to
Mehdi Noroozi Paolo FavaroAnanth Kavalkazhani
58
Thanks!