Video Google: A Text Retrieval Approach to Object Matching in Videos

1

Video Google: A Text Retrieval Approach to Object Matching in Videos

Josef Sivic and Andrew ZissermanRobotics Research Group, Department of

Engineering ScienceUniversity of Oxford, United Kingdom

2

Goal

• To retrieve those key frames and shots of a video containing a particular object .

• With the ease, speed and accuracy.

3

Outline• Introduction

– Object query– Scene query

• Challenging problem• Text retrieval overview• Viewpoint invariant description

– Building the Descriptors– Building the Visual Word– The Visual Analogy

• Visual indexing using text retrievalmethods

• Experimental evaluation of scenematching using visual words

• Object retrieval– Stop list– Spatial Consistency

• Summary and conclusions• Video Google Demo

4

Introduction - Object query(1/2)

5

Introduction - Scene query(2/2)

6

Challenging problem(1/2)

• Changes in viewpoint, illumination and partial occlusion

• Large data• Real-world data

7

Challenging problem(2/2)

8

Text retrieval overview (1/2)• The documents are parsed into words.• Words are represented by their stems

– ‘walk’, ‘walking’, ‘walks’ -> ‘walk’• Stop list to filter common words( ‘the’, ‘an’,…)• Remaining words represent as a vector

weighted based on word frequency

9

Text retrieval overview (2/2)

• Inverted file to facilitate efficient retrieval.– An inverted file is structured like an ideal book

index.• Text is retrieved by computing its vector of

word frequencies, return documents with the closest vectors

• Rank the returned documents

10

Viewpoint invariant description(1/2)

• Two types of viewpoint covariant regions are computed for each frame.1. SA – Shape Adapted

corner like features

2. MS – Maximally Stable blobs of high contrast with respect to their

surroundings

• Regions computed in grayscale

11

Viewpoint invariant description(2/2)

The MS regions are in yellow. The SA regions are in cyan.

12

Building the Descriptors(1/2)

• SIFT – Scale Invariant Feature Transform– Each elliptical region is represented by a 128-

dimensional vector– SIFT is invariant to a shift of a few pixels

13

Building the Descriptors(2/2)

• Removing noise – tracking & averaging– Regions are tracked across sequence of frames

using “Constant Velocity Dynamical model”– Any region which does not survive for more than

three frames is rejected– Averaging the descriptors throughout the track– Large covariance’s descriptors are rejected

14

Building the Visual Word(1/2)

• Cluster descriptors into K groups using K-mean clustering algorithm

• Each cluster represent a “visual word” in the “visual vocabulary”

• MS and SA regions are clustered separately– different vocabularies for describing the same

scene.

15

Building the Visual Word(2/2)

SA

MS

16

The Visual AnalogyWord

Stem

Document

Corpus

Descriptor

Centroid

Frame

Film

Text Visual

17

Visual indexing using text retrievalmethods(1/2)

• tf-idf - ‘Term Frequency – Inverse Document Frequency’

• A vocabulary of k words, then each document is represented by a k-vector

18

Visual indexing using text retrievalmethods(2/2)

• The query vector is given by the visual words contained in a user specified sub-part of a frame

• And the other frames are ranked according to the similarity of their weighted vectors to this query vector.

19

Experimental evaluation of scenematching using visual words(1/5)

• Goal – Evaluate the method by matching scene locations

within a closed world of shots (‘ground truth set’)• Ground truth set

– 164 frames, from 48 shots, were taken at 19 3D location in the movie ‘Run Lola Run’ (4-9 frames from each location)

– There are significant view point changes in the frames for the same location

20


21


• The entire frame is used as a query region• The performance is measured over all 164

frames• The correct results were determined by hand• Rank calculation

22


23


24

Object retrieval(1/7)

• Goal– Searching for objects throughout the entire movie– The object of interest is specified by the user as a

sub part of any frame

25

Object retrieval – Stop list(2/7)

• To reduce the number of mismatches and size of the inverted file while keeping sufficient visual vocabulary.

26

Object retrieval – Spatial Consistency(3/7)

• Querying objects by a subpart of the image, where matched covariant regions in the retrieved frames should have a similar spatial arrangement to those of the outlined region in the query image.

27


28


29


30


31

Summary and conclusions

• Visual Word and vocabulary analogy• Immediate run-time object retrieval• Future work

– Automatic ways for building the vocabulary are needed

• Intriguing possibility– latent semantic indexing to find content– automatic clustering to find the principal objects

that occur throughout the movie.

32

Video Google Demo

• http://www.robots.ox.ac.uk/~vgg/research/vgoogle/

http://www.robots.ox.ac.uk/~vgg/research/vgoogle/

http://www.robots.ox.ac.uk/~vgg/research/vgoogle/

Video Google: A Text Retrieval Approach to Object Matching in Videos

Documents

Transcript of Video Google: A Text Retrieval Approach to Object Matching in Videos