Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1....
Transcript of Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1....
Image Retrieval (Matching at Large Scale)
• At a large scale the problem of matching between similar images translates into the problem of retrieving similar images given a query image.
• EffecBve soluBons to this problem require the capability of designing some indexing structure that records where to find all images in which a feature occurs (the same matched local features can be present in many images)
Image Retrieval (matching at large scale)
• In text documents, where the problem is to find all pages on which a word occurs inverted indexes are commonly used as a soluBon…
• Following the analogy, visual vocabularies offer a simple but effecBve way to index images efficiently with an inverted file
K. Grauman, B. Leibe
Indexing images with local features
68 CHAPTER 5. INDEXING AND VISUAL VOCABULARIES
w23w7
Image #1
Word # Image #
w7Image #1
1 3
2
w7
w622…
7 1, 2mag
es
Image #2 8 3…
abas
e im
w91
w76
9Dat
a
w76 w8
w1Image #3
10…
91 2 1
… … …
(a) All database images are loaded into the indexmapping words to image numbers.
Word # Image #
1 3
22…
7 1, 2
w7
8 3…
9New query image
10…
91 2
… …
(b) A new query image is mapped to indicesof database images that share a word.
Figure 5.5: Main idea of an inverted file index for images represented by visualwords.
Retrieval via the inverted file is faster than searching every image, assumingthat not all images contain every word. In practice, an image’s distribution of wordsis indeed sparse. Since the index maintains no information about the relative spatiallayout of the words per image, typically a spatial verification step is performed onthe images retrieved for a given query (e.g ., see [PCI+07]).
5.2.3 Image Representation with a Bag of Visual Words
As briefly mentioned above, the visual vocabulary also enables a compact summa-rization of all an image’s words. The common text description of a “bag-of-words”can be mapped over to the visual domain: the image’s empirical distribution ofwords is captured with a histogram counting how many times each word in thevisual vocabulary occurs within it (see Figure 5.2 (d)).
What is convenient about this representation is that it translates a (usuallyvery large) set of high-dimensional local descriptors into a single sparse vector offixed dimensionality across all images. This in turn allows one to use many machinelearning algorithms that by default assume the input space is vectorial—whether forsupervised classification, feature selection, or unsupervised image clustering. Csurkaet al . [CBDF04] first showed this connection for recognition by using the bag-of-words descriptors for discriminative categorization. Since then, many supervisedmethods exploit the bag-of-words histogram as a simple but e!ective representation.In fact, many of the most accurate results in recent object recognition challenges
Text inverted index
A seminal system: Video Google
• Video Google is a seminal system [Sivic and Zisserman ICCV 2003] that performs mapping of visual features into visual words using k-‐means clustering, and supports effecBve retrieval by content of visual data. Fundamental idea of paper: treat each frame as a document, then try to find “visual words”….
• Having image features represented as visual words, inverted file index is used to efficiently index visual words for content-‐based retrieval. Video Google retrieves key frames and shots of a video containing a parBcular object with ease, speed, and accuracy with which Google retrieves text documents (web pages) containing parBcular words.
List of frame numbers
Word number
Word
retrieval
vector
feature
…
book
Doc. ID
1
2
3
…
N
text
images
Video Google visual analogy
1. Detect affine covariant regions in each key frame of video
2. Reject unstable regions 3. Build visual vocabulary 4. Remove stop listed words 5. For each image compute weighted
document frequency based on the occurrence of the visual words
6. Build the inverted file index
The Video Google algorithm
1. Assume a vocabulary 2. Parse documents into words 3. Perform stemming: “walk" = { “walk”, “walking”,
“walks”, … } 4. Stop list to reject very common words 5. For each document define a vector with components
given by the frequency of occurrence of the words the document contains
6. Store vector in an inverted file
Word Stemming Text document (page) Document corpus (book)
Visual descriptor Using centroids of visual features clusters Video frame Video
Text Retrieval
6
The Video Google algorithm Pre-‐processing (off-‐line): Step 1. Calculate viewpoint invariant regions and region descriptors:
-‐ Shape Adapted (SA) region: ellipBcal shape adaptaBon about interest point centered on corner-‐like features using Harris-‐affine operator
-‐ Maximally Stable (MS) region: MSER segmenta7on to extract blobs of high contrast with respect to their surroundings
Each region is represented by a 128-‐dimenBonal vector using SIFT descriptor 720 x 576 pixel video frame ≈ 1200 regions
Step 2. Reject unstable regions: Any region that does not survive for more than 3 frames is rejected. This stability check significantly reduces the number of regions to about 600 per frame.
Video Google
7
‘Maximally Stable’ (MS) regions are in yellow ‘Shape Adapted’ (SA) regions are in blue/cyan
MS – yellow SA -‐ cyan
Zoomed view
Step 3. Build Visual Vocabulary: Use k-‐Means clustering to vector quanBze descriptors into clusters Mahalanobis distance is used as the distance funcBon for k-‐Means clustering:
Step 4. Remove stop-‐listed visual words:
The most frequent visual words that occur in almost all images, such as highlights which occur in many frames, are rejected.
Step 5. Compute p-‐idf weighted document frequency vector:
nid = n. Bmes term i appears in doc d nd = n. terms in doc d N = n. docs in the collecBon Ni = n. docs where term i appears
~200k regions
Vocabulary building
Regions construcBon (SA + MS)
10k frames * 1200 = 1.2E6 regions
Subset of 48 shots selected
10k frames = 10% of movie
Frame tracking, RejecBng unstable regions
Clustering descriptors using k-‐means
SIFT descriptors representaBon
Step 6. Build inverted-‐file indexing structure. An inverted file is structured like an ideal book index: it has an entry for each word in the corpus followed by a list of all the documents (and posiBon in that document) in which the word occurs.
68 CHAPTER 5. INDEXING AND VISUAL VOCABULARIES
w23w7
Image #1
Word # Image #
w7Image #1
1 3
2
w7
w622…
7 1, 2mag
es
Image #2 8 3…
abas
e im
w91
w76
9Dat
a
w76 w8
w1Image #3
10…
91 2 1
… … …
(a) All database images are loaded into the indexmapping words to image numbers.
Word # Image #
1 3
22…
7 1, 2
w7
8 3…
9New query image
10…
91 2
… …
(b) A new query image is mapped to indicesof database images that share a word.
Figure 5.5: Main idea of an inverted file index for images represented by visualwords.
Retrieval via the inverted file is faster than searching every image, assumingthat not all images contain every word. In practice, an image’s distribution of wordsis indeed sparse. Since the index maintains no information about the relative spatiallayout of the words per image, typically a spatial verification step is performed onthe images retrieved for a given query (e.g ., see [PCI+07]).
5.2.3 Image Representation with a Bag of Visual Words
As briefly mentioned above, the visual vocabulary also enables a compact summa-rization of all an image’s words. The common text description of a “bag-of-words”can be mapped over to the visual domain: the image’s empirical distribution ofwords is captured with a histogram counting how many times each word in thevisual vocabulary occurs within it (see Figure 5.2 (d)).
What is convenient about this representation is that it translates a (usuallyvery large) set of high-dimensional local descriptors into a single sparse vector offixed dimensionality across all images. This in turn allows one to use many machinelearning algorithms that by default assume the input space is vectorial—whether forsupervised classification, feature selection, or unsupervised image clustering. Csurkaet al . [CBDF04] first showed this connection for recognition by using the bag-of-words descriptors for discriminative categorization. Since then, many supervisedmethods exploit the bag-of-words histogram as a simple but e!ective representation.In fact, many of the most accurate results in recent object recognition challenges
Take a query image region
1. Determine the set of visual words within the query region 2. Retrieve keyframes based on visual word frequencies 3. Re-‐rank the top keyframes using spaBal consistency
The Video Google algorithm for content-‐based retrieval Run-‐Bme (on-‐line):
Use nearest neighbor to build query vector
Use inverse index to find relevant frames
Generate query descriptors
Calculate distance to relevant frames Rank results
-‐ Matched covariant regions in the retrieved frames should have a similar spaBal arrangement to those of the outlined region in the query image -‐ To verify a pair of matching regions (A, B), a circular search area is defined by the k (5 in figure) spaBal nearest neighbors in both frames – Each match that lies within the search areas in both frames casts a vote in support of the match (in the example, three supporBng matches are found) -‐ Matches with no support are rejected
SpaBal consistency:
How Video Google works
Query region and its close-‐up.
Original matches based on visual words
Original matches based on visual words
Matches awer using the stop-‐list
Final set of matches awer filtering on spaBal consistency
Video Google
19
Video Google
20
Video Google Performance Analysis
• Q – Number of queried descriptors (~102) • M – Number of descriptors per frame (~103) • N – Number of key frames per movie (~104) • D – Descriptor dimension (128~102) • K – Number of “words” in the vocabulary (16X103~103) • α -‐ raBo of documents that does not contain any of the Q “words” (~.1)
• ComputaBonal cost: • Nearest Neighbor = QMND = ~ 1011 • Video Google: Query Vector quanBzaBon + Distance = QKD + Q(αN) = ~ 107 + 105
• Improvement factor = ~ 104 -‐:-‐ 106