Introduction to a Research Topic in Mobile Robotics€¦ · Image Feature Extraction Dense sampling...

Introduction to Research in Mobile Robotics: Visual Place Recognition

Luis Gomez Camara [email protected]

Intelligent and Mobile Robotics (IMR) Group Czech Institute of Informatics, Robotics and Cybernetics

Czech Technical University in Prague

Motivation: Lifelong Autonomy

Long-term autonomy of mobile robots is a highly relevant research topic (also at IMR)

Requires navigation over extended periods of time

Long-term navigation is challenging due

Accumulation of errors (drift)

Dynamic environments

Visual Place Recognition (VPR) is a valuable tool

Long-term navigation: Applications

Self-driving cars

Planetary roversAutonomous

underwater vehicles

Injectable nanorobots

UAVs

Domestic robots

. . .

Robot navigation

The ability of a robot to

Determine its own position in its frame of reference

Plan a path towards some goal location

Answering the questions:

1. Where am I?

2. How do I get to other places?

3. Where are other places relative to me?

Navigation consists of:

1. Self-localization

2. Path planning

3. Map building and map interpretation

Visual SLAM

Simultaneous Localization And Mapping using optical sensors

Image: www.dragonfly.com

Problem: Drifting over time

Loop closure to correct the drift

Visual Place Recognition to solve loop closure

Visual SLAM: Drift

“SLAM- Loop Closing with Visually Salient Features

P. Newman et. al. 2005

Red arrows: camera pose

Grey ellipses: global uncertainty

Images: views used in loop closure

Note angular error at the bottom right

Visual SLAM: Loop closure

Definition: the task of recognising a previously-visited location and updating beliefs accordingly

“Deformation-based loop closure for large scale dense

RGB-D SLAM”, Thomas Whelan et. al. 2013

Basic component of SLAM systems

Used to correct drift that accumulates over time

Reduction in the uncertainty of the map estimate

Necessary for long-term navigation

Visual Place Recognition (VPR)

Definition: given a query image of a place, find its location by comparison with a database of previously visited places

Query image

Database of places

?

Fundamental and challenging problem in computer vision

Navigation

Autonomous driving

Geolocalization

Image retrieval

AR, VR, etc.

VPR: Applications

VPR: Challenges

Day-night cycles and illumination changes

Weather and season-related changes Viewpoint changes Dynamic objects, occlusions, etc.

Image Retrieval: Pipeline

Offline stage: Database creation

Online stage: Place recognition

Places image dataset Feature extractionImage Feature Representation

f1 f2 fn…

Places database

Feature Matching

Feature extractionImage Feature Representation

f1 f2 fn…

Query image

Ranked list of candidate

images

Exhaustive search

Re-ranked list of images

Place recognition = best candidate

Image Retrieval: Milestones

Dominated by Bag of Visual Words (BoVW) model Pre-trained and fine-tuned models

Spatial pyramid matchingLazebnik et al.

Image Feature Extraction

Dense sampling

• Patches of fixed size and shape

• Regular grid, possibly overlapping and over range of scales

• Simpler than keypoints (but heavier)

• Optimal for high-level representations (e.g. scene classification)

Interest points (keypoints)

• Salient locations that are likely to match in other images

• Edges, corners, blobs, etc.

• Optimal for image correspondences

Image Feature Descriptors

Created from regions around points of interest

Should be stable (robust) to orientation, illumination, etc.

Can be matched against descriptors in other images

Handcrafted: SIFT, SURF, ORB, etc.

Learned: CNN features

SIFT

Scale Invariant Feature Transform

Hand-crafted (engineered)

Used as both detector and descriptor

There are faster alternatives such as SURF, ORB, BRISK

SIFT is still one the most accurate hand-crafted descriptors

Scale affects detection

"Distinctive Image Features from Scale-Invariant Keypoints”,David Lowe 2004

SIFT

1. Scale-space extrema detector• LoG approximated by DoG

• Successively blur with Gaussian filter

• Scale parameter: standard deviation

First octave second third fourth

Maxima/minima detection

• Find local extrema

• Over both scale and space

SIFT

2. Keypoint localisation• Remove low -contrast keypoints

• Remove keypoints edges

• Only strong points in interest remain

Before After

3. Orientation assignment• Based on local properties, find a consistent

orientation for each keypoint and scale

• Invariance to image rotation

• Orientation histogram (36 bins) around keypoint:

• gradient and magnitude from pixel diffs.

• Highest bin and bins > 80% of highest are used to create keypoints

SIFT

4. Keypoint descriptor• 16x16 neighborhood

• 16 sub-blocks of 4x4 size

• 8 bin orientation histogram per sub-block

• Total of 128 bin values

• Everything relative to keypoint orientation

• Normalization for contrast changes

• Thresholding large gradients for brightness changes

Bag of (visual) Words (BoW)

Traditional approach in VPR

Borrowed from Natural Language Processing

"Video Google: A Text Retrieval Approach to Object Matching in Videos”,

Sivic and Zisserman 2003

Stores zero-order information (word repetitions)

Uses hand-crafted descriptors (SIFT, SURF, ORB, etc.)

Bag of (visual) Words

Steps:1. Extract descriptors from collection of

images

2. Learn visual dictionary by clustering descriptors (e.g. k-means)

3. Represent query image by

• Quantizing descriptors to closest word (centroid)

• Histogram of word repetitions

4. Image is represented as a vector

Vector quantization

Images: www.mathworks.com

Bag of (visual) Words

Pros:

• Largely unaffected by object positions, scale and orientation

• Good for classifying images according to content

• Fast search thanks to inverted indices

(requires sparsity of words in images)

Cons:

• Spatial information is discarded

• Information loss due to quantization

• High dimensionality

Inverted file index

BoW improvements

al pyramid

representation

• Spatial pyramid

• Fisher Vectors:

• Uses Gaussian mixture model (GMM) as vocabulary

• Statistical measure of descriptors wrt GMM

• Derivative of likelihood wrt GMM parameters

• Stores second order information (covariances)

• VLAD: Vector of Locally Aggregated Descriptors

• Similar to Fisher Vectors but only first order information (distances)

Spatial pyramid

level 0 level 1level 2

Based on approximate global geometric correspondence

Image divided into increasingly fine sub-regions

Histograms of local features found inside each sub-region

Extension of an orderless bag-of-features

"Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories”,

Lazebnik et al. 2006

Spatial Pyramid

al pyramid

representation

"Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories”,

Lazebnik et al. 2006

Weak features: oriented edge points (gradient in a given direction above minimum threshold)

Strong features: SIFT

VLAD (Vector of Locally Aggregated Descriptors)

"Aggregating local descriptors into a compact image representation”,

Jégou et al. 2010

0. Train a visual dictionary C (k-means): 𝐶 = {𝜇1, 𝜇2, . . . , 𝜇𝑘}

1. For an image with m descriptors, 𝑋 = {𝑥1, 𝑥2, . . ., 𝑥𝑚},

assign descriptors to closest cell centroid

2. Compute residuals 𝑥 − 𝜇𝑖

3. Accumulate residuals for each cell:

𝑣𝑖 =

𝑥:𝑛𝑛 𝑥 =𝜇𝑖

(𝑥 − 𝜇𝑖)

4. Concatenate accumulated residuals in vector 𝑣 ∈ ℝ𝑘𝑚

𝑣 = [𝑣1, 𝑣2, . . . , 𝑣𝑘]

VLAD

Advantages

Fast to compute

Adds more discriminative power than BoW

Good results with small dimensionality

Fixed length vector irrespectively of feature detections

"Aggregating local descriptors into a compact image representation”,

Jégou et al. 2010

CNN Features

Extracted from Convolutional Neural Networks

Pre-trained vs. end-to-end

Early layers learn features similar to Gabor filters

Later layers learn more semantic features

Semantic features are robust (a car is always a car)

Spatial information can also be exploited

Gabor filters

Semantic features

CNN Features

CNN: mathematical model with huge number of parameters

Automatically learned during training with massive labelled datasets

Number of CNN features depends on architecture

CNN Features

Important concepts:

Parameter sharing: weights in kernels are used at all locations

Pooling: used to subsample feature maps and obtain translation invariance

CNN Features

"Visualizing and Understanding Convolutional Networks"Zeiler and Fergus 2013

CNN: Some recent approaches

"Bag of Local Convolutional Features for Scalable Instance Search" (Mohedano et al. 2016)

• Instance retrieval based on CNN features and the BoW model

• Activations of pre-trained CNN as local features

• High dimensional, sparse representation (N=512, 20k visual words)

• Each local CNN feature is assigned its closest visual word (assignment map)

• Performance comparable to other CNN-based approaches but more scalable


"On the performance of ConvNet features for place recognition" (Sünderhauf et al. 2015)

• Systematic analysis on the performance of pre-trained CNN layers

• Tested on the AlexNet architecture trained on ImageNet

• Nearest Neighbor search of extracted feature vectors

• Layer Conv3 best performing for place recognition

AlexNet architecture: "ImageNet Classification with Deep ConvolutionalNeural Networks",Krizhevsky et al. 2012

Example of CNN image features


"NetVLAD: CNN architecture for weakly supervised place recognition" (Arandjelović et al. 2016)

• Learns image representation in an end-to-end manner for the VPR task

• Steps:

1. Crop the CNN at the last convolutional layer (H x W x D)

2. Each spatial location generates one descriptor

3. Express VLAD image representation as a matrix

... ...

conv3 conv4 conv5N = 13 x 13 = 169

j-th dimension of the i-th descriptor

j-th dimension of the k-th cluster

Membership of descriptor to k-th word (cluster)Value: 0 or 1

4. Make the membership term it differentiable

: Soft membership assignment

, and are sets of trainable parameters

: Response attenuation constant (positve)


"NetVLAD: CNN architecture for weakly supervised place recognition" (Arandjelović et al. 2016)

• More flexible than original VLAD thanks to extra trainable parameters

and are descriptors known to belong to images that should not match

Supervised VLAD allows to learn a better anchor (cluster center)

that minimizes the product between the residuals


"Levelling the Playing Field: A Comprehensive Comparison of Visual PlaceRecognition Approaches under Changing Conditions" (Zaffar et al. 2019)

Berlin Kudamm

Gardens Point

Nordland

Neuroscience of place recognition

"The cognitive map in humans: Spatial navigation and beyond"

Russel A. Epstein et al. 2017

Hippocampus (HPC) and Entorhinal cortex (EC)

Stores map-like spatial codes (cognitive maps)

Supports memory during navigation

A cognitive map is an internal neural

representation of one's surrounding physical

environmentEntorhinal cortex

Neuroscience of place recognition

Parahippocampal Place Area (PPA) and Retrosplenial cortex (RSC)

PPA perception of landmarks and visuospatial structure of the scene

RSC cognitive map retrieval

PPA + RSC Place Recognition

Landmarks:

• Spatial layout (very important)

• Discrete landmarks: buildings, statues, etc.

• Extended topographical landmarks:

arrangement of buildings, valleys, ridges, etc.

Entorhinal cortex

"Where am I now? Distinct Roles for Parahippocampaland Retrosplenial Cortices in Place Recognition" Russel A. Epstein et al. 2007

Visual Cortex Hierarchy

"Bio-inspired computer vision: Towards a synergistic approach of artificial and biological vision"Medathati et al. 2016

Our approach

Use CNNs to extract semantic (robust) features from images

Store them along with their spatial arrangement

Compare images by simultaneously matching features and locations

VGG16 architecture

"Very Deep Convolutional Networks for Large-Scale Image Recognition"Simonyan and Zisserman 2014

Increased depth (16 layers) compared to AlexNet

Smaller convolutional filters (3x3)

Trained on Places205 for scene recognition

Layer conv4_2 for spatial consistency check

We use conv5_2 for quick retrieval of candidates

SSM-VPR (Semantic and Spatial Matching Visual Place Recognition)

Two-stage system

"Spatio-Semantic ConvNet-based Visual Place Recognition" Camara et al. 2019

STAGE 1 Image Filtering

InputQuery image of a place

ProcessFast search of images similar to

query in large database of places

OutputN top-ranked candidates

STAGE 2 Spatial Matching

InputQuery + candidates

ProcessSemantic and geometric comparison of query and

candidate using CNNs

OutputRecognized place (best match)

SSM-VPR

VGG16 CNN pre-trained on Places205 dataset

Image filtering stage:

• Layer conv5_2

• 14x14x512 feature maps

• 16 sliding 7x7x512 cubes per image

• Store into image filtering database (IFDB)

Spatial matching stage

• Layer conv4_2

• 56x56x512 feature maps

• 729 sliding 3x3x512 cubes per image

• Store into spatial matching database (SMDB)

• Also store locations

Image Retrieval: Pipeline




f1 f2 fn…

Places database

Feature Matching


f1 f2 fn…

Query image


images

Exhaustive search



SSM-VPR Pipeline




f1 f2 fn…

IFDB

Feature Matching


f1 f2 fn…

Query image


images

Exhaustive search



Imagefiltering

Spatial matching

SMDB

SSM-VPR: Image filtering

Ground truth candidate: 0



SSM-VPR: Spatial matching

1. For each location in candidate, find location of closest match in query

2. Set the pair of locations as anchor points

3. Look at the spatial consistency between the locations of matched pairs of vectors

4. Location consistency: check all cells around candidate anchor point

5. Accumulate consistent matches for all locations in candidate

6. Select candidate with largest score

SSM-VPR: Parameter optimization

Berlin Kudamm

Gardens Point

Nordland

Same datasets as in Zaffar et al. 2019

SSM-VPR: Recognition results


𝑹𝒆𝒄𝒂𝒍𝒍 =𝑻𝑷

(𝑻𝑷 + 𝑭𝑵)𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 =

𝑻𝑷

(𝑻𝑷 + 𝑭𝑷)

𝑇𝑃 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

𝐹𝑃 = 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

𝐹𝑁 = 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠

Gardens Point





𝑻𝑷





Kudamm





𝑻𝑷





Nordland

SSM-VPR: Teach-and-Replay navigation

Conclusions

Separating recognition in two stages is a highly successful approach

High-level CNN features are very robust to changes

Considering the spatial location of features is the key for high performance recognition

Substantial improvement of the state-of-the-art

Interesting applications in autonomous navigation

Introduction to a Research Topic in Mobile Robotics€¦ · Image Feature Extraction Dense sampling...

Documents

Transcript of Introduction to a Research Topic in Mobile Robotics€¦ · Image Feature Extraction Dense sampling...