Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1....

21
Image Retrieval (Matching at Large Scale)

Transcript of Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1....

Page 1: Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1. Detectaffine&covariantregions&in&each& key&frame&of&video& 2. Rejectunstable&regions& 3. Build&visual&vocabulary&

Image  Retrieval  (Matching  at  Large  Scale)  

   

Page 2: Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1. Detectaffine&covariantregions&in&each& key&frame&of&video& 2. Rejectunstable&regions& 3. Build&visual&vocabulary&

•  At  a  large  scale  the  problem  of  matching  between  similar  images  translates  into  the  problem  of  retrieving  similar  images  given  a  query  image.      

•  EffecBve  soluBons  to  this  problem    require  the  capability  of  designing  some  indexing  structure  that  records  where  to  find  all  images  in  which  a  feature  occurs  (the  same  matched  local  features  can  be  present  in  many  images)  

Image  Retrieval  (matching  at  large  scale)  

Page 3: Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1. Detectaffine&covariantregions&in&each& key&frame&of&video& 2. Rejectunstable&regions& 3. Build&visual&vocabulary&

•  In  text  documents,  where  the  problem  is  to  find  all  pages  on  which  a  word  occurs  inverted  indexes  are  commonly  used  as  a  soluBon…    

•  Following  the  analogy,  visual  vocabularies  offer  a  simple  but  effecBve  way  to  index  images  efficiently  with  an  inverted  file  

K. Grauman, B. Leibe

Indexing  images  with  local  features  

68 CHAPTER 5. INDEXING AND VISUAL VOCABULARIES

w23w7

Image #1

Word # Image #

w7Image #1

1 3

2

w7

w622…

7 1, 2mag

es

Image #2 8 3…

abas

e im

w91

w76

9Dat

a

w76 w8

w1Image #3

10…

91 2 1

… … …

(a) All database images are loaded into the indexmapping words to image numbers.

Word # Image #

1 3

22…

7 1, 2

w7

8 3…

9New query image

10…

91 2

… …

(b) A new query image is mapped to indicesof database images that share a word.

Figure 5.5: Main idea of an inverted file index for images represented by visualwords.

Retrieval via the inverted file is faster than searching every image, assumingthat not all images contain every word. In practice, an image’s distribution of wordsis indeed sparse. Since the index maintains no information about the relative spatiallayout of the words per image, typically a spatial verification step is performed onthe images retrieved for a given query (e.g ., see [PCI+07]).

5.2.3 Image Representation with a Bag of Visual Words

As briefly mentioned above, the visual vocabulary also enables a compact summa-rization of all an image’s words. The common text description of a “bag-of-words”can be mapped over to the visual domain: the image’s empirical distribution ofwords is captured with a histogram counting how many times each word in thevisual vocabulary occurs within it (see Figure 5.2 (d)).

What is convenient about this representation is that it translates a (usuallyvery large) set of high-dimensional local descriptors into a single sparse vector offixed dimensionality across all images. This in turn allows one to use many machinelearning algorithms that by default assume the input space is vectorial—whether forsupervised classification, feature selection, or unsupervised image clustering. Csurkaet al . [CBDF04] first showed this connection for recognition by using the bag-of-words descriptors for discriminative categorization. Since then, many supervisedmethods exploit the bag-of-words histogram as a simple but e!ective representation.In fact, many of the most accurate results in recent object recognition challenges

Text inverted index

Page 4: Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1. Detectaffine&covariantregions&in&each& key&frame&of&video& 2. Rejectunstable&regions& 3. Build&visual&vocabulary&

A  seminal  system:  Video  Google  

•  Video  Google  is  a  seminal  system  [Sivic  and  Zisserman  ICCV  2003]  that  performs  mapping  of  visual  features  into  visual  words  using  k-­‐means  clustering,  and  supports  effecBve  retrieval  by  content  of  visual  data.  Fundamental  idea  of  paper:  treat  each  frame  as  a  document,  then  try  to  find  “visual  words”….    

•  Having  image  features  represented  as  visual  words,  inverted  file  index  is  used  to  efficiently  index  visual  words  for  content-­‐based  retrieval.  Video  Google  retrieves  key  frames  and  shots  of  a  video  containing  a  parBcular  object  with  ease,  speed,  and  accuracy  with  which  Google  retrieves  text  documents  (web  pages)  containing  parBcular  words.  

                   

             

List of frame numbers

Word number

Word    

retrieval

vector

feature

book

Doc.    ID

1

2

3

N

text

images

Page 5: Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1. Detectaffine&covariantregions&in&each& key&frame&of&video& 2. Rejectunstable&regions& 3. Build&visual&vocabulary&

Video  Google  visual  analogy  

1.  Detect  affine  covariant  regions  in  each  key  frame  of  video  

2.  Reject  unstable  regions  3.  Build  visual  vocabulary  4.  Remove  stop  listed  words  5.  For  each  image  compute  weighted  

document  frequency  based  on  the  occurrence  of  the  visual  words  

6.  Build  the  inverted  file  index  

The  Video  Google  algorithm  

   1.  Assume  a  vocabulary  2.  Parse  documents  into  words  3.  Perform   stemming:   “walk"   =   {   “walk”,   “walking”,  

“walks”,  …  }  4.  Stop  list  to  reject  very  common  words  5.  For  each  document  define  a  vector  with  components  

given  by  the  frequency  of  occurrence    of  the  words  the  document  contains  

6.  Store  vector  in  an  inverted  file    

Word  Stemming  Text  document  (page)  Document  corpus  (book)  

Visual  descriptor  Using  centroids  of  visual  features  clusters  Video  frame  Video    

Text  Retrieval  

Page 6: Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1. Detectaffine&covariantregions&in&each& key&frame&of&video& 2. Rejectunstable&regions& 3. Build&visual&vocabulary&

   

6  

The  Video  Google  algorithm  Pre-­‐processing  (off-­‐line):    Step  1.  Calculate  viewpoint  invariant  regions  and  region  descriptors:  

-­‐  Shape  Adapted  (SA)  region:  ellipBcal  shape  adaptaBon  about  interest  point  centered  on                        corner-­‐like  features  using    Harris-­‐affine  operator    

 -­‐  Maximally  Stable  (MS)  region:    MSER  segmenta7on  to  extract  blobs  of  high  contrast  with                        respect  to  their  surroundings  

Each  region  is  represented  by  a  128-­‐dimenBonal  vector  using  SIFT  descriptor  720  x  576  pixel  video  frame  ≈  1200  regions  

Step  2.  Reject  unstable  regions:  Any  region  that  does  not  survive  for  more  than  3  frames  is  rejected.    This  stability  check  significantly  reduces  the  number  of  regions  to  about  600  per  frame.  

Page 7: Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1. Detectaffine&covariantregions&in&each& key&frame&of&video& 2. Rejectunstable&regions& 3. Build&visual&vocabulary&

Video  Google  

7  

‘Maximally  Stable’  (MS)  regions  are  in  yellow    ‘Shape  Adapted’  (SA)  regions  are  in  blue/cyan    

Page 8: Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1. Detectaffine&covariantregions&in&each& key&frame&of&video& 2. Rejectunstable&regions& 3. Build&visual&vocabulary&

   

MS  –  yellow  SA  -­‐  cyan

Zoomed  view

Page 9: Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1. Detectaffine&covariantregions&in&each& key&frame&of&video& 2. Rejectunstable&regions& 3. Build&visual&vocabulary&

   Step  3.  Build  Visual  Vocabulary:  Use  k-­‐Means  clustering  to  vector  quanBze  descriptors  into  clusters    Mahalanobis  distance  is  used  as  the  distance  funcBon  for  k-­‐Means  clustering:  

 

 Step  4.  Remove  stop-­‐listed  visual  words:  

The  most  frequent  visual  words  that  occur  in  almost  all  images,  such  as  highlights  which  occur  in  many  frames,  are  rejected.  

 

 

Step  5.  Compute  p-­‐idf  weighted  document  frequency  vector:                      

nid  =  n.  Bmes  term  i  appears  in  doc  d nd  =  n.  terms  in  doc  d N    =  n.  docs  in  the  collecBon  Ni  =  n.  docs  where  term i appears  

Page 10: Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1. Detectaffine&covariantregions&in&each& key&frame&of&video& 2. Rejectunstable&regions& 3. Build&visual&vocabulary&

~200k  regions    

Vocabulary  building  

Regions  construcBon              (SA  +  MS)  

10k  frames  *  1200  =  1.2E6  regions  

 

Subset  of  48  shots    selected  

10k  frames  =  10%  of  movie  

 

Frame  tracking,  RejecBng  unstable  regions  

Clustering  descriptors  using  k-­‐means  

SIFT  descriptors  representaBon  

Page 11: Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1. Detectaffine&covariantregions&in&each& key&frame&of&video& 2. Rejectunstable&regions& 3. Build&visual&vocabulary&

Step  6.  Build  inverted-­‐file  indexing  structure.      An  inverted  file  is  structured  like  an  ideal  book  index:  it  has  an  entry  for  each  word  in  the  corpus  followed  by  a  list  of  all  the  documents  (and  posiBon  in  that  document)  in  which  the  word  occurs.  

68 CHAPTER 5. INDEXING AND VISUAL VOCABULARIES

w23w7

Image #1

Word # Image #

w7Image #1

1 3

2

w7

w622…

7 1, 2mag

es

Image #2 8 3…

abas

e im

w91

w76

9Dat

a

w76 w8

w1Image #3

10…

91 2 1

… … …

(a) All database images are loaded into the indexmapping words to image numbers.

Word # Image #

1 3

22…

7 1, 2

w7

8 3…

9New query image

10…

91 2

… …

(b) A new query image is mapped to indicesof database images that share a word.

Figure 5.5: Main idea of an inverted file index for images represented by visualwords.

Retrieval via the inverted file is faster than searching every image, assumingthat not all images contain every word. In practice, an image’s distribution of wordsis indeed sparse. Since the index maintains no information about the relative spatiallayout of the words per image, typically a spatial verification step is performed onthe images retrieved for a given query (e.g ., see [PCI+07]).

5.2.3 Image Representation with a Bag of Visual Words

As briefly mentioned above, the visual vocabulary also enables a compact summa-rization of all an image’s words. The common text description of a “bag-of-words”can be mapped over to the visual domain: the image’s empirical distribution ofwords is captured with a histogram counting how many times each word in thevisual vocabulary occurs within it (see Figure 5.2 (d)).

What is convenient about this representation is that it translates a (usuallyvery large) set of high-dimensional local descriptors into a single sparse vector offixed dimensionality across all images. This in turn allows one to use many machinelearning algorithms that by default assume the input space is vectorial—whether forsupervised classification, feature selection, or unsupervised image clustering. Csurkaet al . [CBDF04] first showed this connection for recognition by using the bag-of-words descriptors for discriminative categorization. Since then, many supervisedmethods exploit the bag-of-words histogram as a simple but e!ective representation.In fact, many of the most accurate results in recent object recognition challenges

Page 12: Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1. Detectaffine&covariantregions&in&each& key&frame&of&video& 2. Rejectunstable&regions& 3. Build&visual&vocabulary&

   Take  a  query  image  region  

1.  Determine  the  set  of  visual  words  within  the  query  region  2.  Retrieve  keyframes  based  on  visual  word  frequencies  3.  Re-­‐rank  the  top  keyframes  using  spaBal  consistency  

The  Video  Google  algorithm  for  content-­‐based  retrieval  Run-­‐Bme  (on-­‐line):  

Use  nearest  neighbor  to  build  query  vector  

Use  inverse  index  to  find  relevant  frames  

Generate    query  descriptors  

Calculate  distance  to  relevant  frames  Rank  results  

Page 13: Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1. Detectaffine&covariantregions&in&each& key&frame&of&video& 2. Rejectunstable&regions& 3. Build&visual&vocabulary&

   -­‐     Matched  covariant  regions  in  the  retrieved  frames  should  have  a  similar  spaBal  arrangement  to            those  of  the  outlined  region  in  the  query  image    -­‐         To  verify  a  pair  of  matching  regions  (A,  B),  a  circular  search  area  is  defined  by  the  k  (5  in  figure)                spaBal  nearest  neighbors  in  both  frames  –  Each  match  that  lies  within  the  search  areas  in  both  frames  casts  a  vote  in  support  of  the  match                      (in  the  example,  three  supporBng  matches  are  found)  -­‐          Matches  with  no  support  are  rejected    

SpaBal  consistency:    

Page 14: Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1. Detectaffine&covariantregions&in&each& key&frame&of&video& 2. Rejectunstable&regions& 3. Build&visual&vocabulary&

   How  Video  Google  works  

Query  region  and  its  close-­‐up.  

Page 15: Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1. Detectaffine&covariantregions&in&each& key&frame&of&video& 2. Rejectunstable&regions& 3. Build&visual&vocabulary&

   

Original  matches  based  on  visual  words  

Page 16: Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1. Detectaffine&covariantregions&in&each& key&frame&of&video& 2. Rejectunstable&regions& 3. Build&visual&vocabulary&

   

Original  matches  based  on  visual  words  

Page 17: Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1. Detectaffine&covariantregions&in&each& key&frame&of&video& 2. Rejectunstable&regions& 3. Build&visual&vocabulary&

   

   

Matches  awer  using  the  stop-­‐list  

Page 18: Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1. Detectaffine&covariantregions&in&each& key&frame&of&video& 2. Rejectunstable&regions& 3. Build&visual&vocabulary&

   

   

Final  set  of  matches  awer  filtering  on  spaBal  consistency  

Page 19: Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1. Detectaffine&covariantregions&in&each& key&frame&of&video& 2. Rejectunstable&regions& 3. Build&visual&vocabulary&

Video  Google  

19  

Page 20: Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1. Detectaffine&covariantregions&in&each& key&frame&of&video& 2. Rejectunstable&regions& 3. Build&visual&vocabulary&

Video  Google  

20  

Page 21: Image&Retrieval&(Matching&atLarge&Scale)& · Video&Google&visual&analogy& 1. Detectaffine&covariantregions&in&each& key&frame&of&video& 2. Rejectunstable&regions& 3. Build&visual&vocabulary&

Video  Google  Performance  Analysis

•  Q  –  Number  of  queried  descriptors  (~102)  •  M  –  Number  of  descriptors  per  frame  (~103)  •  N  –  Number  of  key  frames  per  movie  (~104)  •  D  –  Descriptor  dimension  (128~102)  •  K  –  Number  of  “words”  in  the  vocabulary  (16X103~103)  •  α  -­‐  raBo  of  documents  that  does  not  contain  any  of  the  Q  “words”  (~.1)  

•  ComputaBonal  cost:  •  Nearest  Neighbor    =  QMND  =  ~  1011    •  Video  Google:  Query  Vector  quanBzaBon  +  Distance  =          QKD  +  Q(αN)  =      ~      107  +    105  

•  Improvement  factor    =  ~  104  -­‐:-­‐  106