Detecngavocadostozucchinis:$ …ai.stanford.edu/~olga/posters/iccv13_poster.pdf · 2013. 11....

1
Why run analysis? Reason #1 : Surprisingly strong performance of the winning entry. Reason #2: The scale of 1000 object categories allows for an unprecedented look at how object properAes affect accuracy of leading algorithms. Image i: Steel drum PASCAL VOC 20052012 Classifica0on: person, motorcycle DetecAon SegmentaAon Person Motorcycle Ac0on: riding bicycle 20 object classes 22,591 images Detec0ng avocados to zucchinis: what have we done, and where are we going? Olga Russakovsky 1 Jia Deng 1 Zhiheng Huang 1 Alexander C. Berg 2 Li FeiFei 1 1 Stanford University 2 UNC Chapel Hill Analysis setup Bibliography IntroducAon [1] SV details at hZp://imagenet.org/challenges/LSVRC/2012/supervision.pdf and in Krizhevsky et al. NIPS 2012 [2] VGG details at hZp://imagenet.org/challenges/LSVRC/2012/oxford_vgg.pdf and in Sánchez CVPR 2011 and PRL2012, Arandjelović CVPR12, Felzenszwalb PAMI 2012 [3] Alexe, Deselaers, Ferrari. Measuring the objectness of image windows. PAMI 2012 Dataset The ImageNet LargeScale Visual RecogniAon Challenge (ILSVRC) 2012 is much larger and more diverse than previous datasets. DalmaAan hGp://imagenet.org/challenges/LSVRC/{2010,2011,2012,2013} What images are difficult? ILSVRC: ClassificaAon Accuracy (5 predicAons/image) # Submissions 0.72 0.74 0.85 2010 2011 2012 Mo0va0on LargeScale RecogniAon is a grand goal of computer vision. Benchmarking and analysis measure progress and inform future direcAons. Goal The goal is to analyze and compare performance of stateoftheart systems on largescale recogniAon. 1000 object classes 1,431,167 images Clasifica0on+localiza0on challenge (ILSVRC2012) Task: To determine the presence and locaAon of an object class. Accuracy = Σ 100,000 images 1[correct on image i] 1 100,000 Output (bad localizaAon by IOU measure) Folding chair Persian cat Loud speaker Steel drum Picket fence Folding chair Persian cat Loud speaker King penguin Picket fence Output (bad classificaAon) Folding chair Persian cat Loud speaker Steel drum Picket fence Output ILSVRC 2012: ClassificaAon + LocalizaAon ISI OXFORD_VGG SuperVision Accuracy (5 predicAons) Stateoftheart largescale object localiza0on algorithms SuperVision (SV ) by A. Krizhevsky, I. Sutskever, G. Hinton [1] ClassificaAon: Deep convoluAonal neural network; 7 hidden layers, recAfied linear units, max pooling, dropout trick, trained with SGD on two GPUs for a week LocalizaAon: regression on (x, y, w, h) OxfordVGG (VGG ) by K. Simonyan, Y. Aytar, A. Vedaldi, A. Zisserman [2] ClassificaAon: RootSIFT, color staAsAcs, Fisher vector (1024 Gaussians), product quanAzaAon, linear SVM, onevrest SVM trained with Pegasos SGD LocalizaAon: Deformable parts model, rootonly Protocol For every one of the 1000 object categories Compute average measure of difficulty on validaAon images (x) Compute accuracy of algorithms on test images (y) Level of cluZer For every image, generate generic object locaAon hypotheses using method of [3] unAl target object is localized Clutter = log2 (average number of windows required) Low cluZer => target object is most salient in image High cluZer => object is in a complex image (hard ) Both methods are significantly less accurate on cluZered images. SV’s accuracy is more affected by the number of object instances per image than VGG’s accuracy. What objects are difficult? Protocol For every one of the 1000 classes Ask humans to annotate different properAes, e.g., is this object deformable? (x) Compute accuracy of algorithms on test images (y) Highly textured objects are much easier for current algorithms to localize (especially for SV). Deformable objects are much easier for current algorithms to localize, but when considering just manmade objects the effect disappears. Where are we going? CluGered images remain very challenging for object localiza0on Proposed measure of cluZer can be used for creaAng and evaluaAng datasets. Untextured and manmade objects are s0ll challenging even for the best algorithms. Complementary advantages of SV and VGG can be used to design the next generaAon of detectors: SV algorithm is very strong at learning object texture, and VGG algorithm is less sensiAve to number of instances and object scale. ILSVRC dataset is a promising benchmark for detec0on algorithms. Person Car Motorcycle Helmet hZp://image net.org /challenges/LSVRC/2013 ILSVRC 2013 200 object classes fully annotated on 60K images Only one object class is annotated per image (due to high cost of annota8on at this scale) so an algorithm is allowed to produce mul8ple (up to 5) guesses without penalty. SV’s accuracy is more affected by object scale than VGG’s accuracy. SV outperforms VGG on 562 object classes with same average CPL of 0.087 as the PASCAL VOC classes However, VGG outperforms SV on subsets of ≤ 225 classes with smallest CPL. B1 B2 B3 B4 B5 Chance Performance of LocalizaAon (CPL) Take all instances of a class on all images: B1,B2,…BN High CPL => object at the same locaAon/scale in all images Low CPL => object at varied locaAons/scales (hard ) Steel drum Upper bound (UB) OpAmally combines the output of SV and VGG (using an oracle) to demonstrate the current limit of object localizaAon accuracy. White bars are clsonly accuracy

Transcript of Detecngavocadostozucchinis:$ …ai.stanford.edu/~olga/posters/iccv13_poster.pdf · 2013. 11....

Page 1: Detecngavocadostozucchinis:$ …ai.stanford.edu/~olga/posters/iccv13_poster.pdf · 2013. 11. 25. · Whyrunanalysis? Reason’#1:’Surprisingly’strong’performance’of’the’winning’entry.’

Why  run  analysis?  Reason  #1:  Surprisingly  strong  performance  of  the  winning  entry.  

 Reason  #2:  The  scale  of  1000  object  categories  allows  for  an  unprecedented  

look  at  how  object  properAes  affect  accuracy  of  leading  algorithms.  

Image  i:  Steel  drum  

PASCAL  VOC  2005-­‐2012  Classifica0on:  person,  motorcycle  

DetecAon   SegmentaAon  

Person  

Motorcycle  

Ac0on:  riding  bicycle  

20  object  classes                22,591  images  

…  

Detec0ng  avocados  to  zucchinis:  what  have  we  done,  and  where  are  we  going?  

Olga  Russakovsky1        Jia  Deng1        Zhiheng  Huang1        Alexander  C.  Berg2        Li  Fei-­‐Fei1   1  Stanford  University            2  UNC  Chapel  Hill  

Analysis  setup  

Bibliography  

IntroducAon  

[1]  SV  details  at  hZp://image-­‐net.org/challenges/LSVRC/2012/supervision.pdf  and  in  Krizhevsky  et  al.  NIPS  2012  [2]  VGG  details  at  hZp://image-­‐net.org/challenges/LSVRC/2012/oxford_vgg.pdf  and  in  Sánchez  CVPR  2011  and  PRL2012,  Arandjelović  CVPR12,  Felzenszwalb  PAMI  2012  [3]  Alexe,  Deselaers,  Ferrari.  Measuring  the  objectness  of  image  windows.  PAMI  2012    

Dataset  The  ImageNet  Large-­‐Scale  Visual  RecogniAon  Challenge  (ILSVRC)  2012  is  much  larger  and  more  diverse  than  previous  datasets.    

DalmaAan  

hGp://image-­‐net.org/challenges/LSVRC/{2010,2011,2012,2013}  

What  images  are  difficult?  

ILSVRC:  ClassificaAon  

Accuracy  (5  predicAons/image)  

#  Subm

issions  

0.72  

0.74  

0.85  

2010  

2011  

2012  

Mo0va0on  Large-­‐Scale  RecogniAon  is  a  grand  goal  of  computer  vision.  Benchmarking  and  analysis  measure  progress  and  inform  future  direcAons.  

Goal  The  goal  is  to  analyze  and  compare  performance  of  state-­‐of-­‐the-­‐art  systems  on  large-­‐scale  recogniAon.  

1000  object  classes                  1,431,167  images  

Clasifica0on+localiza0on  challenge  (ILSVRC2012)  Task:  To  determine  the  presence  and  locaAon  of  an  object  class.  

Accuracy  =       Σ  100,000  images  

1[correct  on  image  i]  1  100,000  

✗  Output  (bad  localizaAon  by  IOU  measure)  

Folding  chair  

Persian  cat   Loud  speaker  

Steel  drum  

Picket  fence  

Folding  chair  

Persian  cat  Loud  speaker  

King  penguin  Picket  

fence  

Output  (bad  classificaAon)  

✔  

Folding  chair  

Persian  cat   Loud  speaker  

Steel  drum  Picket  fence  

Output  ✗  

ILSVRC  2012:    ClassificaAon  +  LocalizaAon  

ISI  

OXFORD

_VGG  

SuperV

ision  

Accuracy      

(5  predicAon

s)  

State-­‐of-­‐the-­‐art  large-­‐scale  object  localiza0on  algorithms  SuperVision  (SV)  by  A.  Krizhevsky,  I.  Sutskever,  G.  Hinton  [1]  

ClassificaAon:  Deep  convoluAonal  neural  network;  7  hidden  layers,  recAfied  linear  units,    max  pooling,  dropout  trick,  trained  with  SGD  on  two  GPUs  for  a  week  LocalizaAon:  regression  on  (x,  y,  w,  h)  

OxfordVGG  (VGG)  by  K.  Simonyan,  Y.  Aytar,  A.  Vedaldi,  A.  Zisserman  [2]  ClassificaAon:  Root-­‐SIFT,  color  staAsAcs,  Fisher  vector  (1024  Gaussians),  product  quanAzaAon,  linear  SVM,  one-­‐v-­‐rest  SVM  trained  with  Pegasos  SGD  LocalizaAon:  Deformable  parts  model,  root-­‐only  

 

Protocol  For  every  one  of  the  1000  object  categories  -­‐  Compute  average  measure  of  difficulty  on  validaAon  images  (x)  -­‐  Compute  accuracy  of  algorithms  on  test  images  (y)  

Level  of  cluZer  For  every  image,    generate  generic  object  locaAon  hypotheses  using  method  of  [3]    unAl  target  object    is  localized  

Clutter = log2 (average number of windows required)

Low  cluZer  =>  target  object  is  most  salient  in  image  High  cluZer  =>  object  is  in  a  complex  image  (hard)  

Both  methods  are  significantly  less  accurate  on  cluZered  images.  

SV’s  accuracy    is  more  affected  by  the  number  of  object  instances  per  image  

than  VGG’s  accuracy.  

What  objects  are  difficult?  Protocol  

For  every  one  of  the  1000  classes  -­‐  Ask  humans  to  annotate  

different  properAes,  e.g.,  is  this  object  deformable?  (x)  

-­‐  Compute  accuracy  of  algorithms  on  test  images  (y)  

Highly  textured  objects  are  much  easier  for  current  algorithms  to  localize  (especially  for  SV).  

Deformable  objects  are  much  easier  for  current  algorithms  to  localize,  but  when  considering  just  man-­‐made  objects  the  effect  disappears.  

Where  are  we  going?  •  CluGered  images  remain  very  challenging  for  object  localiza0on  

•  Proposed  measure  of  cluZer  can  be  used  for  creaAng  and  evaluaAng  datasets.  

•  Untextured  and  man-­‐made  objects  are  s0ll  challenging  even  for  the  best  algorithms.  

•  Complementary  advantages  of  SV  and  VGG  can  be  used  to  design  the  next  generaAon  of  detectors:  •  SV  algorithm  is  very  strong  at  learning  object  texture,  and  •  VGG  algorithm  is  less  sensiAve  to  number  of  instances  and  object  scale.  

•  ILSVRC  dataset  is  a  promising  benchmark  for  detec0on  algorithms.  

Person  Car  

Motorcycle  Helmet  

hZp://image-­‐net.org/challenges/LSVRC/2013  

ILSVRC  2013  200  object  classes  

fully  annotated  on  60K  images  

Only  one  object  class  is  annotated  per  image  (due  to  high  cost  of  annota8on  at  this  scale)  so  an  algorithm  is  allowed  to  produce  mul8ple  (up  to  5)  guesses  without  penalty.  ✔  

SV’s  accuracy  is  more  affected  by  object  scale  than  VGG’s  accuracy.  

SV  outperforms  VGG  on  562  object  classes  with  same  

average  CPL  of  0.087  as  the  PASCAL  VOC  classes  

However,  VGG  outperforms  SV  on  subsets  of  ≤  225  classes  

with  smallest  CPL.  

B1   B2   B3   B4   B5  

Chance  Performance  of  LocalizaAon  (CPL)  Take  all  instances  of  a  class  on  all  images:  B1,  B2,  …  BN  

High  CPL  =>  object  at  the  same  locaAon/scale  in  all  images  Low  CPL  =>  object  at  varied  locaAons/scales  (hard)  

Steel  drum  

Upper  bound  (UB)  OpAmally  combines  the  output  of  SV  and  VGG  (using  an  oracle)  to  demonstrate  the  current  limit  of  object  localizaAon  accuracy.  

 

White  bars  are  cls-­‐only  accuracy  

ILSVRC  2012:    ClassificaAon  +  LocalizaAon  

Accuracy      

Number  of  guesses  

Winning  entry  

Second  entry