Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1....

Image Retrieval (Matching at Large Scale)

•  At a large scale the problem of matching between similar images translates into the problem of retrieving similar images given a query image.

•  EffecBve soluBons to this problem require the capability of designing some indexing structure that records where to find all images in which a feature occurs (the same matched local features can be present in many images)

Image Retrieval (matching at large scale)

•  In text documents, where the problem is to find all pages on which a word occurs inverted indexes are commonly used as a soluBon…

•  Following the analogy, visual vocabularies offer a simple but effecBve way to index images efficiently with an inverted file

K. Grauman, B. Leibe

Indexing images with local features

68 CHAPTER 5. INDEXING AND VISUAL VOCABULARIES

w23w7

Image #1

Word # Image #

w7Image #1

1 3

2

w7

w622…

7 1, 2mag

es

Image #2 8 3…

abas

e im

w91

w76

9Dat

a

w76 w8

w1Image #3

10…

91 2 1

… … …

(a) All database images are loaded into the indexmapping words to image numbers.

Word # Image #

1 3

22…

7 1, 2

w7

8 3…

9New query image

10…

91 2

… …

(b) A new query image is mapped to indicesof database images that share a word.

Figure 5.5: Main idea of an inverted file index for images represented by visualwords.

Retrieval via the inverted file is faster than searching every image, assumingthat not all images contain every word. In practice, an image’s distribution of wordsis indeed sparse. Since the index maintains no information about the relative spatiallayout of the words per image, typically a spatial verification step is performed onthe images retrieved for a given query (e.g ., see [PCI+07]).

5.2.3 Image Representation with a Bag of Visual Words

As briefly mentioned above, the visual vocabulary also enables a compact summa-rization of all an image’s words. The common text description of a “bag-of-words”can be mapped over to the visual domain: the image’s empirical distribution ofwords is captured with a histogram counting how many times each word in thevisual vocabulary occurs within it (see Figure 5.2 (d)).

What is convenient about this representation is that it translates a (usuallyvery large) set of high-dimensional local descriptors into a single sparse vector offixed dimensionality across all images. This in turn allows one to use many machinelearning algorithms that by default assume the input space is vectorial—whether forsupervised classification, feature selection, or unsupervised image clustering. Csurkaet al . [CBDF04] first showed this connection for recognition by using the bag-of-words descriptors for discriminative categorization. Since then, many supervisedmethods exploit the bag-of-words histogram as a simple but e!ective representation.In fact, many of the most accurate results in recent object recognition challenges

Text inverted index

A seminal system: Video Google

•  Video Google is a seminal system [Sivic and Zisserman ICCV 2003] that performs mapping of visual features into visual words using k-‐means clustering, and supports effecBve retrieval by content of visual data. Fundamental idea of paper: treat each frame as a document, then try to find “visual words”….

•  Having image features represented as visual words, inverted file index is used to efficiently index visual words for content-‐based retrieval. Video Google retrieves key frames and shots of a video containing a parBcular object with ease, speed, and accuracy with which Google retrieves text documents (web pages) containing parBcular words.

List of frame numbers

Word number

Word

retrieval

vector

feature

…

book

Doc. ID

1

2

3

…

N

text

images

Video Google visual analogy

1.  Detect affine covariant regions in each key frame of video

2.  Reject unstable regions 3.  Build visual vocabulary 4.  Remove stop listed words 5.  For each image compute weighted

document frequency based on the occurrence of the visual words

6.  Build the inverted file index

The Video Google algorithm

1.  Assume a vocabulary 2.  Parse documents into words 3.  Perform stemming: “walk" = { “walk”, “walking”,

“walks”, … } 4.  Stop list to reject very common words 5.  For each document define a vector with components

given by the frequency of occurrence of the words the document contains

6.  Store vector in an inverted file

Word Stemming Text document (page) Document corpus (book)

Visual descriptor Using centroids of visual features clusters Video frame Video

Text Retrieval

6

The Video Google algorithm Pre-‐processing (off-‐line): Step 1. Calculate viewpoint invariant regions and region descriptors:

-‐ Shape Adapted (SA) region: ellipBcal shape adaptaBon about interest point centered on corner-‐like features using Harris-‐affine operator

-‐ Maximally Stable (MS) region: MSER segmenta7on to extract blobs of high contrast with respect to their surroundings

Each region is represented by a 128-‐dimenBonal vector using SIFT descriptor 720 x 576 pixel video frame ≈ 1200 regions

Step 2. Reject unstable regions: Any region that does not survive for more than 3 frames is rejected. This stability check significantly reduces the number of regions to about 600 per frame.

Video Google

7

‘Maximally Stable’ (MS) regions are in yellow ‘Shape Adapted’ (SA) regions are in blue/cyan

MS – yellow SA -‐ cyan

Zoomed view

Step 3. Build Visual Vocabulary: Use k-‐Means clustering to vector quanBze descriptors into clusters Mahalanobis distance is used as the distance funcBon for k-‐Means clustering:

Step 4. Remove stop-‐listed visual words:

The most frequent visual words that occur in almost all images, such as highlights which occur in many frames, are rejected.

Step 5. Compute p-‐idf weighted document frequency vector:

nid = n. Bmes term i appears in doc d nd = n. terms in doc d N = n. docs in the collecBon Ni = n. docs where term i appears

~200k regions

Vocabulary building

Regions construcBon (SA + MS)

10k frames * 1200 = 1.2E6 regions

Subset of 48 shots selected

10k frames = 10% of movie

Frame tracking, RejecBng unstable regions

Clustering descriptors using k-‐means

SIFT descriptors representaBon

Step 6. Build inverted-‐file indexing structure. An inverted file is structured like an ideal book index: it has an entry for each word in the corpus followed by a list of all the documents (and posiBon in that document) in which the word occurs.

68 CHAPTER 5. INDEXING AND VISUAL VOCABULARIES

w23w7

Image #1

Word # Image #

w7Image #1

1 3

2

w7

w622…

7 1, 2mag

es

Image #2 8 3…

abas

e im

w91

w76

9Dat

a

w76 w8

w1Image #3

10…

91 2 1

… … …

(a) All database images are loaded into the indexmapping words to image numbers.

Word # Image #

1 3

22…

7 1, 2

w7

8 3…

9New query image

10…

91 2

… …

(b) A new query image is mapped to indicesof database images that share a word.

Figure 5.5: Main idea of an inverted file index for images represented by visualwords.

Retrieval via the inverted file is faster than searching every image, assumingthat not all images contain every word. In practice, an image’s distribution of wordsis indeed sparse. Since the index maintains no information about the relative spatiallayout of the words per image, typically a spatial verification step is performed onthe images retrieved for a given query (e.g ., see [PCI+07]).

5.2.3 Image Representation with a Bag of Visual Words

As briefly mentioned above, the visual vocabulary also enables a compact summa-rization of all an image’s words. The common text description of a “bag-of-words”can be mapped over to the visual domain: the image’s empirical distribution ofwords is captured with a histogram counting how many times each word in thevisual vocabulary occurs within it (see Figure 5.2 (d)).

What is convenient about this representation is that it translates a (usuallyvery large) set of high-dimensional local descriptors into a single sparse vector offixed dimensionality across all images. This in turn allows one to use many machinelearning algorithms that by default assume the input space is vectorial—whether forsupervised classification, feature selection, or unsupervised image clustering. Csurkaet al . [CBDF04] first showed this connection for recognition by using the bag-of-words descriptors for discriminative categorization. Since then, many supervisedmethods exploit the bag-of-words histogram as a simple but e!ective representation.In fact, many of the most accurate results in recent object recognition challenges

Take a query image region

1.  Determine the set of visual words within the query region 2.  Retrieve keyframes based on visual word frequencies 3.  Re-‐rank the top keyframes using spaBal consistency

The Video Google algorithm for content-‐based retrieval Run-‐Bme (on-‐line):

Use nearest neighbor to build query vector

Use inverse index to find relevant frames

Generate query descriptors

Calculate distance to relevant frames Rank results

-‐  Matched covariant regions in the retrieved frames should have a similar spaBal arrangement to those of the outlined region in the query image -‐  To verify a pair of matching regions (A, B), a circular search area is defined by the k (5 in figure) spaBal nearest neighbors in both frames –  Each match that lies within the search areas in both frames casts a vote in support of the match (in the example, three supporBng matches are found) -‐ Matches with no support are rejected

SpaBal consistency:

How Video Google works

Query region and its close-‐up.

Original matches based on visual words

Matches awer using the stop-‐list

Final set of matches awer filtering on spaBal consistency

Video Google

19

Video Google

20

Video Google Performance Analysis

•  Q – Number of queried descriptors (~102) •  M – Number of descriptors per frame (~103) •  N – Number of key frames per movie (~104) •  D – Descriptor dimension (128~102) •  K – Number of “words” in the vocabulary (16X103~103) •  α -‐ raBo of documents that does not contain any of the Q “words” (~.1)

•  ComputaBonal cost: •  Nearest Neighbor = QMND = ~ 1011 •  Video Google: Query Vector quanBzaBon + Distance = QKD + Q(αN) = ~ 107 + 105

•  Improvement factor = ~ 104 -‐:-‐ 106

Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1....

Documents

Transcript of Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1....