Video Google: A Text Retrieval Approach to Object Matching in Videos

32
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman Robotics Research Group, Department of Engineering Science University of Oxford, United Kingdom 1

description

Video Google: A Text Retrieval Approach to Object Matching in Videos. Josef Sivic and Andrew Zisserman Robotics Research Group, Department of Engineering Science University of Oxford, United Kingdom. Goal. To retrieve those key frames and shots of a video containing a particular object . - PowerPoint PPT Presentation

Transcript of Video Google: A Text Retrieval Approach to Object Matching in Videos

Page 1: Video Google: A Text Retrieval Approach to Object Matching in Videos

1

Video Google: A Text Retrieval Approach to Object Matching in Videos

Josef Sivic and Andrew ZissermanRobotics Research Group, Department of

Engineering ScienceUniversity of Oxford, United Kingdom

Page 2: Video Google: A Text Retrieval Approach to Object Matching in Videos

2

Goal

• To retrieve those key frames and shots of a video containing a particular object .

• With the ease, speed and accuracy.

Page 3: Video Google: A Text Retrieval Approach to Object Matching in Videos

3

Outline• Introduction

– Object query– Scene query

• Challenging problem• Text retrieval overview• Viewpoint invariant description

– Building the Descriptors– Building the Visual Word– The Visual Analogy

• Visual indexing using text retrievalmethods

• Experimental evaluation of scenematching using visual words

• Object retrieval– Stop list– Spatial Consistency

• Summary and conclusions• Video Google Demo

Page 4: Video Google: A Text Retrieval Approach to Object Matching in Videos

4

Introduction - Object query(1/2)

Page 5: Video Google: A Text Retrieval Approach to Object Matching in Videos

5

Introduction - Scene query(2/2)

Page 6: Video Google: A Text Retrieval Approach to Object Matching in Videos

6

Challenging problem(1/2)

• Changes in viewpoint, illumination and partial occlusion

• Large data• Real-world data

Page 7: Video Google: A Text Retrieval Approach to Object Matching in Videos

7

Challenging problem(2/2)

Page 8: Video Google: A Text Retrieval Approach to Object Matching in Videos

8

Text retrieval overview (1/2)• The documents are parsed into words.• Words are represented by their stems

– ‘walk’, ‘walking’, ‘walks’ -> ‘walk’• Stop list to filter common words( ‘the’, ‘an’,…)• Remaining words represent as a vector

weighted based on word frequency

Page 9: Video Google: A Text Retrieval Approach to Object Matching in Videos

9

Text retrieval overview (2/2)

• Inverted file to facilitate efficient retrieval.– An inverted file is structured like an ideal book

index.• Text is retrieved by computing its vector of

word frequencies, return documents with the closest vectors

• Rank the returned documents

Page 10: Video Google: A Text Retrieval Approach to Object Matching in Videos

10

Viewpoint invariant description(1/2)

• Two types of viewpoint covariant regions are computed for each frame.1. SA – Shape Adapted

corner like features

2. MS – Maximally Stable blobs of high contrast with respect to their

surroundings

• Regions computed in grayscale

Page 11: Video Google: A Text Retrieval Approach to Object Matching in Videos

11

Viewpoint invariant description(2/2)

The MS regions are in yellow. The SA regions are in cyan.

Page 12: Video Google: A Text Retrieval Approach to Object Matching in Videos

12

Building the Descriptors(1/2)

• SIFT – Scale Invariant Feature Transform– Each elliptical region is represented by a 128-

dimensional vector– SIFT is invariant to a shift of a few pixels

Page 13: Video Google: A Text Retrieval Approach to Object Matching in Videos

13

Building the Descriptors(2/2)

• Removing noise – tracking & averaging– Regions are tracked across sequence of frames

using “Constant Velocity Dynamical model”– Any region which does not survive for more than

three frames is rejected– Averaging the descriptors throughout the track– Large covariance’s descriptors are rejected

Page 14: Video Google: A Text Retrieval Approach to Object Matching in Videos

14

Building the Visual Word(1/2)

• Cluster descriptors into K groups using K-mean clustering algorithm

• Each cluster represent a “visual word” in the “visual vocabulary”

• MS and SA regions are clustered separately– different vocabularies for describing the same

scene.

Page 15: Video Google: A Text Retrieval Approach to Object Matching in Videos

15

Building the Visual Word(2/2)

SA

MS

Page 16: Video Google: A Text Retrieval Approach to Object Matching in Videos

16

The Visual AnalogyWord

Stem

Document

Corpus

Descriptor

Centroid

Frame

Film

Text Visual

Page 17: Video Google: A Text Retrieval Approach to Object Matching in Videos

17

Visual indexing using text retrievalmethods(1/2)

• tf-idf - ‘Term Frequency – Inverse Document Frequency’

• A vocabulary of k words, then each document is represented by a k-vector

Page 18: Video Google: A Text Retrieval Approach to Object Matching in Videos

18

Visual indexing using text retrievalmethods(2/2)

• The query vector is given by the visual words contained in a user specified sub-part of a frame

• And the other frames are ranked according to the similarity of their weighted vectors to this query vector.

Page 19: Video Google: A Text Retrieval Approach to Object Matching in Videos

19

Experimental evaluation of scenematching using visual words(1/5)

• Goal – Evaluate the method by matching scene locations

within a closed world of shots (‘ground truth set’)• Ground truth set

– 164 frames, from 48 shots, were taken at 19 3D location in the movie ‘Run Lola Run’ (4-9 frames from each location)

– There are significant view point changes in the frames for the same location

Page 20: Video Google: A Text Retrieval Approach to Object Matching in Videos

20

Experimental evaluation of scenematching using visual words(2/5)

Page 21: Video Google: A Text Retrieval Approach to Object Matching in Videos

21

Experimental evaluation of scenematching using visual words(3/5)

• The entire frame is used as a query region• The performance is measured over all 164

frames• The correct results were determined by hand• Rank calculation

Page 22: Video Google: A Text Retrieval Approach to Object Matching in Videos

22

Experimental evaluation of scenematching using visual words(4/5)

Page 23: Video Google: A Text Retrieval Approach to Object Matching in Videos

23

Experimental evaluation of scenematching using visual words(5/5)

Page 24: Video Google: A Text Retrieval Approach to Object Matching in Videos

24

Object retrieval(1/7)

• Goal– Searching for objects throughout the entire movie– The object of interest is specified by the user as a

sub part of any frame

Page 25: Video Google: A Text Retrieval Approach to Object Matching in Videos

25

Object retrieval – Stop list(2/7)

• To reduce the number of mismatches and size of the inverted file while keeping sufficient visual vocabulary.

Page 26: Video Google: A Text Retrieval Approach to Object Matching in Videos

26

Object retrieval – Spatial Consistency(3/7)

• Querying objects by a subpart of the image, where matched covariant regions in the retrieved frames should have a similar spatial arrangement to those of the outlined region in the query image.

Page 27: Video Google: A Text Retrieval Approach to Object Matching in Videos

27

Object retrieval(4/7)

Page 28: Video Google: A Text Retrieval Approach to Object Matching in Videos

28

Object retrieval(5/7)

Page 29: Video Google: A Text Retrieval Approach to Object Matching in Videos

29

Object retrieval(6/7)

Page 30: Video Google: A Text Retrieval Approach to Object Matching in Videos

30

Object retrieval(7/7)

Page 31: Video Google: A Text Retrieval Approach to Object Matching in Videos

31

Summary and conclusions

• Visual Word and vocabulary analogy• Immediate run-time object retrieval• Future work

– Automatic ways for building the vocabulary are needed

• Intriguing possibility– latent semantic indexing to find content– automatic clustering to find the principal objects

that occur throughout the movie.

Page 32: Video Google: A Text Retrieval Approach to Object Matching in Videos

32

Video Google Demo

• http://www.robots.ox.ac.uk/~vgg/research/vgoogle/