Lecture 03 internet video search

52
6: Location and context

Transcript of Lecture 03 internet video search

Page 1: Lecture 03 internet video search

6: Location and context

Page 2: Lecture 03 internet video search

What makes a cow a cow?

Google knows because other people know

We think we know

“because it has four legs” But the fact of the matter: not all cows show four legs nor are they brown … not all…

How do you know?

Page 3: Lecture 03 internet video search

What is the object in the middle?

No segmentation … Not even the pixel values of the object …

Page 4: Lecture 03 internet video search

Where is evidence for an object?

Uijlings IJCV 2011

Page 5: Lecture 03 internet video search

Where is evidence for an object?

Uijlings IJCV 2011

Page 6: Lecture 03 internet video search

What is the visual extent of an object?

Uijlings IJCV 2012

Page 7: Lecture 03 internet video search

Where: exhaustive search

Look everywhere for the object window Imposes computational constraints on

Very many locations and windows (coarse grid/fixed aspect ratio) Evaluation cost per location (weak features/classifiers)

Impressive but takes long.

Viola IJCV 2004 Dalal CVPR 2005 Felzenszwalb PAMI 2010 Vedaldi ICCV 2009 7

Page 8: Lecture 03 internet video search

Where: the need for a hierarchy

An image is intrinsically hierarchical.

Gu CVPR 2009

Page 9: Lecture 03 internet video search

Selective search

Van de Sande ICCV 2011

Windows formed by hierarchical grouping. Adjacent grouping on color/texture/shape cues. Felzenszwalb 2004

Page 10: Lecture 03 internet video search

Selective search example

Page 11: Lecture 03 internet video search

11

Selective search example

Page 12: Lecture 03 internet video search

Average best overlap ~88%

… looks like this

High recall cat

Page 13: Lecture 03 internet video search

Pairs of concepts

Uijlings ICCV demo 2012

Page 14: Lecture 03 internet video search

6 Conclusion

Selective search gives good localization. Localization needed to understand pairs of concepts.

Page 15: Lecture 03 internet video search

7 Data and metadata

http://bit.ly/visualsearchengines

Page 16: Lecture 03 internet video search

How many concepts?

Li Fei Fei slide. Biederman, Psychological Rev. 1987

Page 17: Lecture 03 internet video search

How many examples?

Once you are over 100 – 1000 examples, success is there.

Page 18: Lecture 03 internet video search

Russell IJCV 2008

LabelMe 290,000 object annotations

Amateur labeling

Page 19: Lecture 03 internet video search

Amateur labeling

Page 20: Lecture 03 internet video search

Amateur labeling

Page 21: Lecture 03 internet video search

Xirong Li, TMM 2009

Tag relevance by social annotation

Consistency in tagging between users on similar images.

Page 22: Lecture 03 internet video search

Tag relevance by social annotation

Pretty good for snow not so good for rainbow.

Page 23: Lecture 03 internet video search

Social negative bootstrapping

Xirong Li ACM MM 2009

Negative images are as important as positive images to learn. Not just random negative images, but close ones. • We want to learn positive

example from an expert, and obtain as many negative samples as we like for free from the web.

• We iteratively aim for the hardest negatives.

Page 24: Lecture 03 internet video search

Social negative bootstrapping

Xirong Li ICMR 2011

Page 25: Lecture 03 internet video search

Knowledge ontology ImageNet

Page 26: Lecture 03 internet video search

acknowledgement WordNet friends

Christiane Fellbaum Dan Osherson

Princeton Kai Li

Princeton Alex Berg Columbia

Jia Deng Princeton/Stanford

Hao Su Stanford

Page 27: Lecture 03 internet video search

PASCAL VOC

The PASCAL Visual Object Classes (VOC). 500,000 Images downloaded from flickr. Queries like “car”, “vehicle”, “street”, “downtown”. 10,000 objects, 25,000 labels. Mark Everingham, Luc Van Gool, Chris Williams, John Winn, Andrew Zisserman

Page 28: Lecture 03 internet video search

7. Conclusion

Data is king. The data are beginning to reflect the human cognition capacity [at a basic level]. Harvesting social data requires advanced computer vision control.

Page 29: Lecture 03 internet video search

8 Performance

Page 30: Lecture 03 internet video search

PASCAL 2010 Aeroplane

Bus

Bicycle Bird Boat Bottle

Car Cat Chair Cow

Page 31: Lecture 03 internet video search

True Positives - Person UOCTTI_LSVM_MDPM

NLPR_HOGLBP_MC_LCEGCHLC

NUS_HOGLBP_CTX_CLS_RESCORE_V2

Page 32: Lecture 03 internet video search

False Positives - Person UOCTTI_LSVM_MDPM

NLPR_HOGLBP_MC_LCEGCHLC

NUS_HOGLBP_CTX_CLS_RESCORE_V2

Page 33: Lecture 03 internet video search

Non-birds & non-boats

Non-bird images: Highest ranked

Non-boat images: Highest ranked

Water texture and scene composition?

Page 34: Lecture 03 internet video search

Non-chair

Page 35: Lecture 03 internet video search

True Positives - Motorbike MITUCLA_HIERARCHY

NLPR_HOGLBP_MC_LCEGCHLC

NUS_HOGLBP_CTX_CLS_RESCORE_V2

Page 36: Lecture 03 internet video search

False Positives - Motorbike MITUCLA_HIERARCHY

NLPR_HOGLBP_MC_LCEGCHLC

NUS_HOGLBP_CTX_CLS_RESCORE_V2

Page 37: Lecture 03 internet video search

Object localization 2008-2010

Results on 2008 data improve for 2010 methods for all categories, by over 100% for some categories.

0

10

20

30

40

50

60

aerop

lane

bicyc

le bird

boat

bottle bu

s car cat

chair cow

dining

table dog

horse

motor

bike

perso

n

potte

dplan

t

shee

pso

fa

train

tvmon

itor

Max A

P (%

)

200820092010

Page 38: Lecture 03 internet video search

TRECvid evaluation standard

Page 39: Lecture 03 internet video search

Concept detection

Aircraft

Beach

Mountain

People marching

Police/Security

Flower

Page 40: Lecture 03 internet video search

Measuring performance

• Precision

Set of retrieved items

Set of relevant items

Set of relevant retrieved items

inverse relationship Recall

1.

2.

3.

4.

5.

Results

Page 41: Lecture 03 internet video search

UvA-MediaMill@TRECVID

• other systems

Snoek et al, TRECVID 04-10

Page 42: Lecture 03 internet video search

Performance doubled in just 3 years

• 36 concept detectors

Snoek & Smeulders, IEEE Computer 2010

Even when using training data of different origin, great progress. But the number of concepts is still limited.

Page 43: Lecture 03 internet video search

8. Conclusion

Impressive results and quickly improving per year. Very valuable competition. Best non-classes start to make sense!

Page 44: Lecture 03 internet video search

9 Speed

Page 45: Lecture 03 internet video search

SURF based on integral images

Introduced by Viola & Jones in the context of face detection: sliding windows in left to right / up to bottom integral images.

46

Page 46: Lecture 03 internet video search

SURF principle

LREC 2004, 26 May 2004, Lisbon 47

LyyLyyLxyLxy

Lyy

Lyy

L L L xx yy xy

Approximate Gaussian derivatives with box filters:

Page 47: Lecture 03 internet video search

SURF speed

LREC 2004, 26 May 2004, Lisbon 48

Computation time: 6 times faster than DoG (~100msec). Independent of filter scale.

Sca

le

Page 48: Lecture 03 internet video search

Dense descriptor extraction

Pixel-wise Responses Final Descriptor

Factor 16 speed improvement, Another factor 2 by the use of matrix libs.

Page 49: Lecture 03 internet video search

Projection: Random Forest

Binary decision trees

Moosmann et al. 2008 ......

.... ....

Page 50: Lecture 03 internet video search

Real-time bag of words

D-SURF 2x2 <empty> Random

Forest RBF

Descriptor Extraction

Projection Classification

Pre-projection Actual projection SVM kernel

MAP: 0.370

Total computation time is 38 milliseconds per image

26 frames per second on a normal PC in any 20 concepts.

15 10 13

Page 51: Lecture 03 internet video search

9. Conclusion

SURF scale and rotation invariant Fast due to the use of integral images Download: http://www.vision.ee.ethz.ch/~surf/ DURF extraction is 6x faster than Dense-SIFT. Projection using Random Forest 50x faster than NN.

Page 52: Lecture 03 internet video search

Internet Video Search: the beginning

concept

detection

telling stories

browsing

video video

video measuring

features

lexicon

learning