Efficient Visual Search for Objects in Videos

67
JOSEF SIVIC AND ANDREW ZISSERMAN PRESENTERS: ILGE AKKAYA & JEANNETTE CHANG MARCH 1, 2011 Efficient Visual Search for Objects in Videos

description

Efficient Visual Search for Objects in Videos. Josef Sivic and Andrew Zisserman Presenters: Ilge Akkaya & Jeannette Chang March 1, 2011. Introduction. Image Query. Text Query. Results: Frames. Results: Documents. Generalize text retrieval methods to non-textual information. - PowerPoint PPT Presentation

Transcript of Efficient Visual Search for Objects in Videos

Page 1: Efficient Visual Search for Objects in Videos

JOSEF SIVIC AND ANDREW ZISSERMAN

PRESENTERS: ILGE AKKAYA & JEANNETTE CHANG

MARCH 1 , 2011

Efficient Visual Search for Objects in Videos

Page 2: Efficient Visual Search for Objects in Videos

Introduction

Text Query

Results: Documents

Image Query

Results: Frames Generalize text retrieval methods to

non-textual information

Page 3: Efficient Visual Search for Objects in Videos

State-of-the-Art before this paper…

Text-based search for images (Google Images)Object recognition

Barnard, et al. (2003): “Matching words and pictures” Sivic, et al. (2005): “Discovering objects and their location in

images” Sudderth, et al. (2005): “Learning hierarchical models of

scenes, objects, and parts”Scene classification

Fei-Fei and Perona (2005): “A Bayesian hierarchical model for learning natural scene categories”

Quelhas, et al. (2005): “Modeling scenes with local descriptors and latent aspects”

Lazebnik, et al. (2006): “Beyond bag of features: Spatial pyramid matching for recognizing natural scene categories”

Page 4: Efficient Visual Search for Objects in Videos

Introduction (cont.)

Retrieve specific objects vs. categories of objects/scenes (“Camry” logo vs. cars)

Employ text retrieval techniques for visual search, with images as queries and results

Why Text Retrieval Approach? Matches essentially precomputed so that no delay at

run time Any object in video can be retrieved without

modification of descriptors originally built for video

Page 5: Efficient Visual Search for Objects in Videos

Overview of the Talk

Visual Search Algorithm Offline Pre-Processing Real-Time Query A Few Implementation Details

Performance General Results Testing Individual Words Using External Images As Queries

A Few Challenges and Future DirectionsConcluding RemarksDemo of the Algorithm

Page 6: Efficient Visual Search for Objects in Videos

Overview of the Talk

Visual Search Algorithm Offline Pre-Processing Real-Time Query A Few Implementation Details

Performance General Results Testing Individual Words Using External Images As Queries

A Few Challenges and Future DirectionsConcluding RemarksDemo of the Algorithm

Page 7: Efficient Visual Search for Objects in Videos

Pre-Processing (Offline)

1. For each frame, detect affine covariant regions.

2. Track the regions through video and reject unstable regions

3. Build visual vocabulary4. Remove stop-listed visual words5. Compute tf-idf weighted document

frequency vectors6. Built inverted file-indexing structure

Page 8: Efficient Visual Search for Objects in Videos

Typically ~1200 regions / frame (720x576)Elliptical regionsEach region represented by 128-dimensional

SIFT vectorSIFT features provide invariance against

affine transformations

Detection of Affine Covariant Regions

Page 9: Efficient Visual Search for Objects in Videos

Two types of affine covariant regions:1. Shape-Adapted(SA):

Mikolajczyk et al.Elliptical Shape adaptation about a Harris interest pointOften centered on corner-like features

2. Maximally-Stable(MS):Proposed by Matas et al.

Intensity watershed image segmentationHigh-contrast blobs

Page 10: Efficient Visual Search for Objects in Videos

Pre-Processing (Offline)

1. For each frame, detect affine covariant regions.

2. Track the regions through video and reject unstable regions

3. Build visual vocabulary4. Remove stop-listed visual words5. Compute tf-idf weighted document

frequency vectors6. Built inverted file-indexing structure

Page 11: Efficient Visual Search for Objects in Videos

Tracking regions through video and rejecting unstable regions

Any region that does not survive for 3+ frames is rejected

These regions are not potentially interestingReduces number of regions/frame to approx.

50% (~600/frame)

Page 12: Efficient Visual Search for Objects in Videos

Pre-Processing (Offline)

1. For each frame, detect affine covariant regions.

2. Track the regions through video and reject unstable regions

3. Build visual vocabulary4. Remove stop-listed visual words5. Compute tf-idf weighted document

frequency vectors6. Built inverted file-indexing structure

Page 13: Efficient Visual Search for Objects in Videos

Visual Indexing Using Text-Retrieval Methods

TEXT IMAGE

Represent words by the “stems”‘write’‘writing’ ‘write’‘written’ mapped to

Cluster similar regions into ‘visual words’

Stop-list common words ‘a/an/the’

Stop-list common visual words

Rank search results according to how close the query words occur within retrieved document

Use spatial information to check retrieval consistency

Page 14: Efficient Visual Search for Objects in Videos

Visual Vocabulary

Purpose: Cluster regions from multiple frames into fewer groups called ‘visual words’

Each descriptor: 128-vector

K-means clustering (explain more)

~300K descriptors mapped into 16K visual words(600 regions/frame x ~500 frames)

(6000 SA, 10000 MS regions used)

Page 15: Efficient Visual Search for Objects in Videos

K-Means Clustering

Purpose: Cluster N data points (SIFT descriptors) into K clusters (visual words)

K = desired number of cluster centers (mean points)Step 1: Randomly guess K mean points

Page 16: Efficient Visual Search for Objects in Videos

Step 2: Calculate nearest mean point to assign each data point to a cluster center

In this paper, Mahalanobis distance is used to determine ‘nearest cluster center’.

where ∑ is the covariance matrix for all descriptors,x2 is the length 128 mean vector andx1’s are the descriptor vectors(i.e. data points)

Page 17: Efficient Visual Search for Objects in Videos

Step 3: Recalculate cluster centers and distances, repeat until stationarity

Page 18: Efficient Visual Search for Objects in Videos

Samples of normalized affine covariant regions

Examples of Clusters of Regions

Page 19: Efficient Visual Search for Objects in Videos

Pre-Processing (Offline)

1. For each frame, detect affine covariant regions.

2. Track the regions through video and reject unstable regions

3. Build visual vocabulary4. Remove stop-listed visual words5. Compute tf-idf weighted document

frequency vectors6. Built inverted file-indexing structure

Page 20: Efficient Visual Search for Objects in Videos

Remove Stop-Listed Words

Analogy to text-retrieval:‘a’, ‘and’, ‘the’ … are not distinctive wordsCommon words cause mismatches5-10% of the most common visual words are

stopped800-1600 / 16000 words are stopped

(Upper row) Matches before stop-listing(Lower row) Matches after stop-listing

Page 21: Efficient Visual Search for Objects in Videos

Pre-Processing (Offline)

1. For each frame, detect affine covariant regions.

2. Track the regions through video and reject unstable regions

3. Build visual vocabulary4. Remove stop-listed visual words5. Compute tf-idf weighted document

frequency vectors6. Built inverted file-indexing structure

Page 22: Efficient Visual Search for Objects in Videos

tf-idf Weighting(term frequency-inverse document frequency weighting)

nid : #of occurrences of word(visual word) i in document(frame) dnd : total number of words in document dNi : total number of documents containing term IN : number of documents in the databaseti : weighted word frequency

Page 23: Efficient Visual Search for Objects in Videos

Each document(frame) represented by:

wherev = number of visual words in vocabulary

And vd = the tf-idf vector of the particular frame d

Page 24: Efficient Visual Search for Objects in Videos

Inverted File Indexing

Visual Word Index

1

2

N

Found in Frames:

1,4,5

1,2,10

Page 25: Efficient Visual Search for Objects in Videos

Overview of the Talk

Visual Search Algorithm Offline Pre-Processing Real-Time Query A Few Implementation Details

Performance General Results Testing Individual Words Using External Images As Queries

A Few Challenges and Future DirectionsConcluding RemarksDemo of the Algorithm

Page 26: Efficient Visual Search for Objects in Videos

Real-Time Query

1. Determine the set of visual words found within the query region

2. Retrieve keyframes based on visual word frequencies (Ns = 500)

3. Re-rank retrieved keyframes using spatial consistency

Page 27: Efficient Visual Search for Objects in Videos

Retrieve keyframes based on visual word frequencies

vq: vector containing visual word frequencies corresponding to query region is computed

the normalized scalar product of vq with vd’s are computed:

Page 28: Efficient Visual Search for Objects in Videos

Spatial Consistency Voting

Analogy: Google text document retrieval

Matched covariant regions in the retrieved frames should have a similar spatial arrangement

Search area: 15 nearest spatial neighbors of each match

Each neighboring region which also matches in the retrieved frame, votes for the frame

Page 29: Efficient Visual Search for Objects in Videos

Spatial Consistency Voting

Matched pair of words (A,B)

Each region in defined search area in both frames casts a voteFor the match (A,B)

(upper row)Matches after stop-listing(lower row) Remaining matches after spatial consistency voting

Page 30: Efficient Visual Search for Objects in Videos

1 2

3 4

5 6

7 8

1: Query Region2: Close-up version of 13-4: Initial matches5-6: Matches after stop-listing7-8: Matches after spatial consistency matching

Query Frame Sample Retrieved Frame

Page 31: Efficient Visual Search for Objects in Videos

Overview of the Talk

Visual Search Algorithm Offline Pre-Processing Real-Time Query A Few Implementation Details

Performance General Results Testing Individual Words Using External Images As Queries

A Few Challenges and Future DirectionsConcluding RemarksDemo of the Algorithm

Page 32: Efficient Visual Search for Objects in Videos

Implementation Details

Offline Processing:100-150K frames/typical feature length film,Refined to 4000-6000 keyframesDescriptors are computed for stable regions

in each frameEach region is assigned to a visual wordVisual words over all keyframes assembled

into an inverted file-structure

Page 33: Efficient Visual Search for Objects in Videos

Algorithm Implementation

Real-Time Process:User selects query regionVisual words are identified within query

regionA short list of Ns = 500 keyframes retrieved

based on tf-idf vector similaritySimilarity is recomputed considering spatial

consistency voting

Page 34: Efficient Visual Search for Objects in Videos

Example Visual Search

Page 35: Efficient Visual Search for Objects in Videos

Overview of the Talk

Visual Search Algorithm Offline Pre-Processing Real-Time Query A Few Implementation Details

Performance General Results Testing Individual Words Using External Images As Queries

A Few Challenges and Future DirectionsConcluding RemarksDemo of the Algorithm

Page 36: Efficient Visual Search for Objects in Videos

Retrieval Examples

Query Image

A Few Retrieved Matches

Page 37: Efficient Visual Search for Objects in Videos

Retrieval Examples (cont.)

Query Image

A Few Retrieved Matches

Page 38: Efficient Visual Search for Objects in Videos

Performance of the Algorithm

Tried 6 object queries

(1) Red Clock

(5) “Phil” Sign

(6) Microphone

(4) Digital Clock

(3) “Frame’s” Sign

(2) Black Clock

Page 39: Efficient Visual Search for Objects in Videos

Performance of the Algorithm (cont.)

Evaluated on the level of shots rather than keyframes

Measured using precision-recall plots

Precision like measure of fidelity or exactnessRecall like measure of completeness

Page 40: Efficient Visual Search for Objects in Videos

Performance of the Algorithm (cont.)

Ideally, precision = 1 for all recall valuesAverage Precision (AP) , ideally AP = 1

Page 41: Efficient Visual Search for Objects in Videos

Examples of Missed Shots

Extreme viewing angles

Original query object Low-ranked shot

Page 42: Efficient Visual Search for Objects in Videos

Examples of Missed Shots (cont.)

Significant changes in scale and motion blurring

Original query object Low-ranked shot

Page 43: Efficient Visual Search for Objects in Videos

Qualitative Assessment of Performance

General trends Higher precision at low recall levels Bias towards lightly textured

regions detectable by SA/MS detectors

Could address these challenges by adding more covariant regions

Other Difficulties Textureless regions (e.g., mug) Thin or wiry objects (e.g., bike) Highly-deformable (e.g., clothing)

Page 44: Efficient Visual Search for Objects in Videos

Overview of the Talk

Visual Search Algorithm Offline Pre-Processing Real-Time Query A Few Implementation Details

Performance General Results Testing Individual Words Using External Images As Queries

A Few Challenges and Future DirectionsConcluding RemarksDemo of the Algorithm

Page 45: Efficient Visual Search for Objects in Videos

Quality of Individual Visual Words

Using single visual word as query Tests the expressiveness of the visual vocabulary

Sample query Given an object of interest, select one of the visual

words from that object Retrieve all frames that contain the visual word (no

ranking) Retrieval considered correct if contains object of

interest

Page 46: Efficient Visual Search for Objects in Videos

Examples of Individual Visual Words

Top row: Scale-normalized close-ups of elliptical regions overlaid on query image

Bottom row: Corresponding normalized regions

Page 47: Efficient Visual Search for Objects in Videos

Results of Individual Word Searches

Individual words are “noisy”Intuitively because words occur in multiple

objects and do not cover all occurrences of the object

Page 48: Efficient Visual Search for Objects in Videos

Unrealistic Realistic

Require each word to occur on only one object (high precision)

Growing number of objects would result in growing number of words

Visual words shared across objects, with objects represented by a combination of words

Quality of Individual Visual Words

Page 49: Efficient Visual Search for Objects in Videos

Overview of the Talk

Visual Search Algorithm Offline Pre-Processing Real-Time Query A Few Implementation Details

Performance General Results Testing Individual Words Using External Images As Queries

A Few Challenges and Future DirectionsConcluding RemarksDemo of the Algorithm

Page 50: Efficient Visual Search for Objects in Videos

Searching for Objects From Outside of the Movie

Used external query images from the internetManually labeled all occurrences of external

queries in moviesResults

External Query Image

No. of Occurrences

Rankings of Retrieved Occurrences

AP (Average Precision)

Sony logo 3 1st, 4th, 35th 0.53

Hollywood sign

1 1st 1

Notre Dame 1 1st 1

Page 51: Efficient Visual Search for Objects in Videos

Sample External Query Results

Potential Applications

Page 52: Efficient Visual Search for Objects in Videos

Overview of the Talk

Visual Search Algorithm Offline Pre-Processing Real-Time Query A Few Implementation Details

Performance General Results Testing Individual Words Using External Images As Queries

A Few Challenges and Future DirectionsConcluding RemarksDemo of the Algorithm

Page 53: Efficient Visual Search for Objects in Videos

Challenge I: Visual Vocabularies for Very Large Scale Retrieval

Current progress: 150000 frame feature movie reduced to 6000 keyframes and then processed

Ultimate goal: indexing billions of online images to build a visual search engine

Page 54: Efficient Visual Search for Objects in Videos

Should the vocabulary increase in size as the image archive grows?

How discriminative should the words be?

Generalization of images from one movie to an outside database of images?

Learning a universal visual vocabulary still remains a challenge

(a) (c) external images downloaded from the Internet

(b) Correct retrieval frame from the movie ‘Pretty Woman’

(d) Correct retrieval from the movie ‘Charade’

Page 55: Efficient Visual Search for Objects in Videos

Challenge II: Retrieval of 3D Objects

Current algorithm covers successful detection despite slight changes in viewpoint, illumination, partial occlusion due to SIFT features

However, 3D retrieval is fundamentally a bigger challenge

Page 56: Efficient Visual Search for Objects in Videos

Proposed approach 1:Automatic association of images using temporal information

Grouping front-side-back of a car in a videoPossible either in query and/or database

sideQuery-Side Matching: Associated query

frames are computed and used for 3D image search

Query-Side matching of associated frames

Page 57: Efficient Visual Search for Objects in Videos

Proposed approach 1 (cont.)

Grouping on database side: Query on a single aspect is expected to retrieve pregrouped frames associated with 3D image

(Top Row) Query image(Bottom Rows) Matching frames

Page 58: Efficient Visual Search for Objects in Videos

Proposed approach 2:Building an explicit 3-D model for each 3-D object in the Video

Focus is more on model building than detection

Only rigid objects considered

Page 59: Efficient Visual Search for Objects in Videos

Challenge III: Verification using Spatial Structure

Spatial consistency was helpful, but could be improved

A few suggestions Caution with using measures for rigid geometry Reduce cost using hierarchical approach

Two complementary methods Ferrari et al. (2004): matching deformable objects Rothganger et al. (2003): matching 3D objects

Page 60: Efficient Visual Search for Objects in Videos

Verification Using Spatial Structure (cont.)

Method 1 (Ferrari) Based on spatial overlap of local

regions Requires regions to match

individually and pattern of intersection between neighboring regions to be preserved

Performance Pro: Works well with

deformations Con: Computationally expensive

Page 61: Efficient Visual Search for Objects in Videos

Verification Using Spatial Structure (cont.)

Method 2 (Rothganger) Based on 3-D object model Requires consistency of local

appearance descriptors and geometric consistency

Performance Pro: Object can be matched in

diverse (even novel) poses Con: 3-D model built offline,

requires up to 20 images of object taken from different viewpoints

Page 62: Efficient Visual Search for Objects in Videos

Overview of the Talk

Visual Search Algorithm Offline Pre-Processing Real-Time Query A Few Implementation Details

Performance General Results Testing Individual Words Using External Images As Queries

A Few Challenges and Future DirectionsConcluding RemarksDemo of the Algorithm

Page 63: Efficient Visual Search for Objects in Videos

Conclusion

Demonstrated scalable object retrieval architecture which uses Visual vocabulary based on vector-quantized

viewpoint invariant descriptors Efficient indexing techniques from text retrieval

A few notable differences between document and image bag-of-words retrieval Spatial information Numbers of “words” in query Matching requirements

Page 64: Efficient Visual Search for Objects in Videos

Looking forward…

TinEye (May 2008) Image-based search engine Given a query image, searches for altered versions of

that image (scaled or cropped) 1.86 billion images indexed

Google Goggles (2009) Use phone to take photo, results from the internet Limited categories

Page 65: Efficient Visual Search for Objects in Videos

Overview of the Talk

Visual Search Algorithm Offline Pre-Processing Real-Time Query A Few Implementation Details

Performance General Results Testing Individual Words Using External Images As Queries

A Few Challenges and Future DirectionsConcluding RemarksDemo of the Algorithm

Page 67: Efficient Visual Search for Objects in Videos

Main References

D. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision. 2(60):91.110, 2004.

J. Sivic and A. Zisserman. Efficient visual search for objects in videos. Proc. IEEE, 96(4):548–566, 2008.

W. Qian “Video Google: A Text Retrieval Approach to Object Matching in Videos.” www.mriedel.ece.umn.edu/wiki/index.php/Weikang_Qian