Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2...

97
Object Recognition II Linda Shapiro ECE/CSE 576 with CNN slides from Ross Girshick 1

Transcript of Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2...

Page 1: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Object Recognition II

Linda Shapiro

ECE/CSE 576with CNN slides from Ross Girshick

1

Page 2: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Outline

• Object detection• the task, evaluation, datasets

• Convolutional Neural Networks (CNNs)• overview and history

• Region-based Convolutional Networks (R-CNNs)

• You Only Look Once (YOLO)

2

Page 3: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Image classification

• 𝐾 classes

• Task: assign correct class label to the whole image

Digit classification (MNIST) Object recognition (Caltech-101)

3

Page 4: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Classification vs. Detection

Dog

DogDog

4

Page 5: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Problem formulation

person

motorbike

Input Desired output

{ airplane, bird, motorbike, person, sofa }

5

Page 6: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Evaluating a detector

Test image (previously unseen)

Page 7: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

First detection ...

‘person’ detector predictions

0.9

Page 8: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Second detection ...

0.9

0.6

‘person’ detector predictions

Page 9: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Third detection ...

0.9

0.6

0.2

‘person’ detector predictions

Page 10: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Compare to ground truth

ground truth ‘person’ boxes

0.9

0.6

0.2

‘person’ detector predictions

Page 11: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Sort by confidence

... ... ... ... ...

✓ ✓ ✓

0.9 0.8 0.6 0.5 0.2 0.1

truepositive(high overlap)

falsepositive(no overlap,low overlap, or duplicate)

X X X

Page 12: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Evaluation metric

... ... ... ... ...

0.9 0.8 0.6 0.5 0.2 0.1

✓ ✓ ✓X X X

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@𝑡 =#𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠@𝑡

#𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠@𝑡 + #𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠@𝑡

𝑟𝑒𝑐𝑎𝑙𝑙@𝑡 =#𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠@𝑡

#𝑔𝑟𝑜𝑢𝑛𝑑 𝑡𝑟𝑢𝑡ℎ 𝑜𝑏𝑗𝑒𝑐𝑡𝑠

𝑡

✓✓ + X

Page 13: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Evaluation metric

Average Precision (AP)0% is worst100% is best

mean AP over classes(mAP)

... ... ... ... ...

0.9 0.8 0.6 0.5 0.2 0.1

✓ ✓ ✓X X X

Page 14: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Pedestrians

Histograms of Oriented Gradients for Human Detection, Dalal and Triggs, CVPR 2005

AP ~77%More sophisticated methods: AP ~90%

(a) average gradient image over training examples(b) each “pixel” shows max positive SVM weight in the block centered on that pixel(c) same as (b) for negative SVM weights(d) test image(e) its R-HOG descriptor(f) R-HOG descriptor weighted by positive SVM weights(g) R-HOG descriptor weighted by negative SVM weights

14

Page 15: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Overview of HOG Method

1. Compute gradients in the region to be described

2. Put them in bins according to orientation

3. Group the cells into large blocks

4. Normalize each block

5. Train classifiers to decide if these are parts of a human

15

Page 16: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Details

• Gradients[-1 0 1] and [-1 0 1]T were good enough filters.

• Cell HistogramsEach pixel within the cell casts a weighted vote for an orientation-based histogram channel based on the values found in the gradient computation. (9 channels worked)

• BlocksGroup the cells together into larger blocks, either R-HOGblocks (rectangular) or C-HOG blocks (circular).

16

Page 17: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

More Details

• Block Normalization

• If you think of the block as a vector v, then the normalized block is v/norm(v)

They tried 4 different kinds of normalization.

• L1-norm

• sqrt of L1-norm

• L2 norm

• L2-norm followed by clipping

17

Page 18: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Example: Dalal-Triggs pedestrian detector

1. Extract fixed-sized (64x128 pixel) window at each position and scale

2. Compute HOG (histogram of gradient) features within each window

3. Score the window with a linear SVM classifier

4. Perform non-maxima suppression to remove overlapping detections with lower scores

Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR0518

Page 19: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR0519

Page 20: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

uncentered

centered

cubic-corrected

diagonal

Sobel

Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

Outperforms

20

Page 21: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

• Histogram of gradient orientations

• Votes weighted by magnitude

• Bilinear interpolation between cells

Orientation: 9 bins (for unsigned angles)

Histograms in 8x8 pixel cells

Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR0521

Page 22: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Normalize with respect to surrounding cells

Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR0522

Page 23: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

X=

Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

# features = 15 x 7 x 9 x 4 = 3780

# cells

# orientations

# normalizations by neighboring cells

23

Page 24: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Training set

24

Page 25: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05

pos w neg w

25

Page 26: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

pedestrian

Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR0526

Page 27: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Detection examples

27

Page 28: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

30

Page 29: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Deformable Parts Model

• Takes the idea a little further

• Instead of one rigid HOG model, we have multiple HOG models in a spatial arrangement

• One root part to find first and multiple other parts in a tree structure.

31

Page 30: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

The Idea

Articulated parts model• Object is configuration of parts

• Each part is detectable

Images from Felzenszwalb32

Page 31: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Deformable objects

Images from Caltech-256

Slide Credit: Duan Tran 33

Page 32: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Deformable objects

Images from D. Ramanan’s datasetSlide Credit: Duan Tran

34

Page 33: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

How to model spatial relations?• Tree-shaped model

35

Page 34: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

36

Page 35: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Hybrid template/parts model

Detections

Template Visualization

Felzenszwalb et al. 200837

Page 36: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Pictorial Structures Model

Appearance likelihood Geometry likelihood38

Page 37: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Results for person matching

39

Page 38: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Results for person matching

40

Page 39: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

BMVC 200941

Page 40: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

2012 State-of-the-art Detector:Deformable Parts Model (DPM)

42Felzenszwalb et al., 2008, 2010, 2011, 2012

Lifetime

Achievement

1. Strong low-level features based on HOG2. Efficient matching algorithms for deformable part-based

models (pictorial structures)3. Discriminative learning with latent variables (latent SVM)

Page 41: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Why did gradient-based models work?

Average gradient image

43

Page 42: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Generic categories

Can we detect people, chairs, horses, cars, dogs, buses, bottles, sheep …?PASCAL Visual Object Categories (VOC) dataset

44

Page 43: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Generic categoriesWhy doesn’t this work (as well)?

Can we detect people, chairs, horses, cars, dogs, buses, bottles, sheep …?PASCAL Visual Object Categories (VOC) dataset

45

Page 44: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Quiz time(Back to Girshick)

46

Page 45: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Warm up

This is an average image of which object class?

47

Page 46: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Warm up

pedestrian

48

Page 47: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

A little harder

?

49

Page 48: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

A little harder

?Hint: airplane, bicycle, bus, car, cat, chair, cow, dog, dining table

50

Page 49: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

A little harder

bicycle (PASCAL)

51

Page 50: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

A little harder, yet

?

52

Page 51: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

A little harder, yet

?Hint: white blob on a green background

53

Page 52: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

A little harder, yet

sheep (PASCAL)

54

Page 53: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Impossible?

?

55

Page 54: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Impossible?

dog (PASCAL)

56

Page 55: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Impossible?

dog (PASCAL)Why does the mean look like this?

There’s no alignment between the examples!How do we combat this?

57

Page 56: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

PASCAL VOC detection history

0%

10%

20%

30%

40%

50%

60%

70%

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

mea

n A

vera

ge P

reci

sio

n (

mA

P)

year

DPM

DPM,HOG+BOW

DPM,MKL

DPM++DPM++,MKL,SelectiveSearch

Selective Search,DPM++,MKL

41%41%37%

28%23%

17%

Page 57: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Part-based models & multiple features (MKL)

0%

10%

20%

30%

40%

50%

60%

70%

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

mea

n A

vera

ge P

reci

sio

n (

mA

P)

year

DPM

DPM,HOG+BOW

DPM,MKL

DPM++DPM++,MKL,SelectiveSearch

Selective Search,DPM++,MKL

41%41%37%

28%23%

17%

Page 58: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Kitchen-sink approaches

0%

10%

20%

30%

40%

50%

60%

70%

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

mea

n A

vera

ge P

reci

sio

n (

mA

P)

year

DPM

DPM,HOG+BOW

DPM,MKL

DPM++DPM++,MKL,SelectiveSearch

Selective Search,DPM++,MKL

41%41%37%

28%23%

17%

increasing complexity & plateau

Page 59: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Region-based Convolutional Networks (R-CNNs)

0%

10%

20%

30%

40%

50%

60%

70%

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

mea

n A

vera

ge P

reci

sio

n (

mA

P)

year

DPM

DPM,HOG+BOW

DPM,MKL

DPM++DPM++,MKL,SelectiveSearch

Selective Search,DPM++,MKL

41%41%37%

28%23%

17%

53%

62%

R-CNN v1

R-CNN v2

[R-CNN. Girshick et al. CVPR 2014]

Page 60: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

0%

10%

20%

30%

40%

50%

60%

70%

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

mea

n A

vera

ge P

reci

sio

n (

mA

P)

year

~1 year

~5 years

Region-based Convolutional Networks (R-CNNs)

[R-CNN. Girshick et al. CVPR 2014]

Page 61: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Convolutional Neural Networks

• Overview

63

Page 62: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Standard Neural Networks

𝒙 = 𝑥1, … , 𝑥784𝑇 𝑧𝑗 = 𝑔(𝒘𝑗

𝑇𝒙) 𝑔 𝑡 =1

1 + 𝑒−𝑡

“Fully connected”

64

Page 63: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

From NNs to Convolutional NNs

• Local connectivity

• Shared (“tied”) weights

• Multiple feature maps

• Pooling

65

Page 64: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Convolutional NNs

• Local connectivity

• Each green unit is only connected to (3)neighboring blue units

compare

66

Page 65: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Convolutional NNs

• Shared (“tied”) weights

• All green units share the same parameters 𝒘

• Each green unit computes the same function,but with a different input window

𝑤1

𝑤2

𝑤3

𝑤1

𝑤2

𝑤3

67

Page 66: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Convolutional NNs

• Convolution with 1-D filter: [𝑤3, 𝑤2, 𝑤1]

• All green units share the same parameters 𝒘

• Each green unit computes the same function,but with a different input window

𝑤1

𝑤2

𝑤3

68

Page 67: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Convolutional NNs

• Convolution with 1-D filter: [𝑤3, 𝑤2, 𝑤1]

• All green units share the same parameters 𝒘

• Each green unit computes the same function,but with a different input window

𝑤1

𝑤2

𝑤3

69

Page 68: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Convolutional NNs

• Convolution with 1-D filter: [𝑤3, 𝑤2, 𝑤1]

• All green units share the same parameters 𝒘

• Each green unit computes the same function,but with a different input window

𝑤1

𝑤2

𝑤3

70

Page 69: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Convolutional NNs

• Convolution with 1-D filter: [𝑤3, 𝑤2, 𝑤1]

• All green units share the same parameters 𝒘

• Each green unit computes the same function,but with a different input window

𝑤1

𝑤2

𝑤3

71

Page 70: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Convolutional NNs

• Convolution with 1-D filter: [𝑤3, 𝑤2, 𝑤1]

• All green units share the same parameters 𝒘

• Each green unit computes the same function,but with a different input window𝑤1

𝑤2

𝑤3

72

Page 71: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Convolutional NNs

• Multiple feature maps

• All orange units compute the same functionbut with a different input windows

• Orange and green units compute different functions

𝑤1

𝑤2

𝑤3

𝑤′1𝑤′2𝑤′3

Feature map 1(array of greenunits)

Feature map 2(array of orangeunits)

73

Page 72: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Convolutional NNs

• Pooling (max, average)

1

4

0

3

4

3

• Pooling area (how much to pool): 2 units

• Pooling stride (how far to move): 2 units

• Subsamples feature maps

74

Page 73: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Image

Pooling

Convolution

2D input

75

Page 74: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Historical perspective – 1980Hubel and Wiesel1962

Included basic ingredients of ConvNets, but no supervised learning algorithm76

Page 75: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Supervised learning – 1986

Early demonstration that error backpropagation can be usedfor supervised training of neural nets (including ConvNets)

Gradient descent training with error backpropagation

77

Page 76: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

1989

Backpropagation applied to handwritten zip code recognition, Lecun et al., 1989 78

Page 77: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Practical ConvNets

Gradient-Based Learning Applied to Document Recognition, Lecun et al., 1998

79

Page 78: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Core idea of “deep learning”

• Input: the “raw” signal (image, waveform, …)

• Features: hierarchy of features is learned from the raw input

80

Page 79: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

What’s new since the 1980s?

• More layers• LeNet-3 and LeNet-5 had 3 and 5 learnable layers

• Current models have 8 – 20+

• “ReLU” non-linearities (Rectified Linear Unit)• 𝑔 𝑥 = max 0, 𝑥

• Gradient doesn’t vanish

• “Dropout” regularization (randomly selects neurons to remove during training epochs, reduces overfitting)

• Fast GPU implementations

• More data

𝑥

𝑔(𝑥)

81

Page 80: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Demo

• http://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html

• ConvNetJS by Andrej Karpathy (Ph.D. student at Stanford)

Software libraries

• Caffe (C++, python, matlab)

• Torch7 (C++, lua)

• Theano (python)

82

Page 81: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

What else? Object Proposals• Sliding window based object detection

• Object proposals• Fast execution

• High recall with low # of candidate boxes

ImageFeature

ExtractionClassificaiton

Iterate over window size, aspect ratio, and location

ImageFeature

ExtractionClassificaiton

Object Proposal

83

Page 82: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

The number of contours wholly enclosed by a bounding box is indicative of the likelihood of the box containing an object.

84

Page 83: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Ross’s Own System: Region CNNs

Page 84: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Competitive Results

Page 85: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Top Regions for Six Object Classes

Page 86: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

What came Next?

• Faster R-CNN (Girshick, 2016)

• YOLO (Redmon, Girshick, Divvala, Farhadi, 2016)

You Only Look Once (People’s Choice Award)

• YOLO9000: Better, Faster, Stronger (Redmon, Farhadi, CVPR 2017) Runner up for Best Paper

• YOLOv3: more improvements

88

Page 87: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

89

Page 88: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

90

Page 89: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

91

Page 90: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

92

Page 91: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

93

Page 92: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

94

Page 93: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

95

Page 94: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

96

Page 95: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

97

Page 96: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

More YOLO (second paper)

• At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007.

• At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art methods like Faster R-CNN with ResNet and SSD while still running significantly faster

• A new methodology that jointly trains for detection and classification produced YOLO9000.

• YOLO9000 can detect more than 9000 object categories in real time.

• And YOLOv3 has come out.

98

Page 97: Object Recognition II - courses.cs.washington.eduMore YOLO (second paper) •At 67 FPS, YOLOv2 gets76.8 mAP on VOC 2007. •At 40 FPS, YOLOv2 gets 78.6mAP, outperforming state-of-the-art

Finale

• Object recognition has moved rapidly in the last 12 years to becoming very appearance based.

• The HOG descriptor lead to fast recognition of specific views of generic objects, starting with pedestrians and using SVMs.

• Deformable parts models extended that to allow more objects with articulated limbs, but still specific views.

• CNNs have become the method of choice; they learn from huge amounts of data and can learn multiple views of each object class.

• YOLO is the current winner (I think). CVPR 19 is coming up in June.

99