SIFT Guest Lecture by Jiwon Kim

SIFT

• Guest Lecture by Jiwon Kim• http://www.cs.washington.edu/homes/jwkim/

http://www.cs.washington.edu/homes/jwkim/

SIFT Features andIts Applications

Autostitch Demo

Autostitch

• Fully automatic panorama generation– Input: set of images– Output: panorama(s)

• Uses SIFT (Scale-Invariant Feature Transform) to find/align images

1. Solve for homography

2. Find connected sets of images

3. Solve for camera parameters

• New images initialised with rotation, focal length of best matching image

4. Blending the panorama

• Burt & Adelson 1983– Blend frequency bands over range

Low frequency ( > 2 pixels)

High frequency ( < 2 pixels)

2-band Blending

Linear Blending

2-band Blending

So, what is SIFT?

• Scale-Invariant Feature Transform• David Lowe at UBC• Scale/rotation invariant• Currently best known feature descriptor• Many real-world applications

– Object recognition– Panorama stitching– Robot localization– Video indexing– …

Example: object recognition

SIFT properties

• Locality: features are local, so robust to occlusion and clutter

• Distinctiveness: individual features can be matched to a large database of objects

• Quantity: many features can be generated for even small objects

• Efficiency: close to real-time performance

SIFT algorithm overview

1. Feature detection– Detect points that can be repeatably

selected under location/scale change

2. Feature description– Assign orientation to detected feature

points– Construct a descriptor for image patch

around each feature point

3. Feature matching

1. Feature detection

• Detect points stable under location/scale change

– Build continuous space (x, y, scale)– Approximated by multi-scale Difference-of-

Gaussian pyramid– Select maxima/minima in (x, y, scale)


• Localize extrema by fitting a quadratic

1) Sub-pixel/sub-scale interpolation using Taylor expansion

2) Take derivative and set to zero

1. Feature detection• Discard low-contrast/edge points

1) Low contrast: discard keypoints with < threshold

2) Edge points: high contrast in one direction, low in the other compute principal curvatures from eigenvalues of 2x2 Hessian matrix, and limit ratio

)ˆ(xD

1. Feature detection• Example

(a) 233x189 image(b) 832 DOG extrema(c) 729 left after peak value threshold(d) 536 left after testing ratio of principle curvatures

2. Feature description

– Create histogram of local gradient directions computed at selected scale

– Assign canonical orientation at peak of smoothed histogram

0 2

• Assign orientation to keypoints

2. Feature description• Construct SIFT descriptor

– Create array of orientation histograms– 8 orientations x 4x4 histogram array = 128

dimensions

2. Feature description• Advantage over simple correlation

– Gradients less sensitive to illumination change

– Gradients may shift: robust to deformation, viewpoint change

Performance: stability to noise

• Match features after random change in image scale & orientation, with differing levels of image noise

• Find nearest neighbor in database of 30,000 features

Performance:stability to affine change

• Match features after random change in image scale & orientation, with 2% image noise, and affine distortion

• Find nearest neighbor in database of 30,000 features

Performance: distinctiveness

• Vary size of database of features, with 30 degree affine change, 2% image noise

• Measure % correct for single nearest neighbor match

3. Feature matching

• For each feature in A, find nearest neighbor in B

A B

3. Feature matching

• Nearest neighbor search too slow for large database of 128-dimenional data

• Approximate nearest neighbor search:– Best-bin-first [Beis et al. 97]: modification to k-d

tree algorithm– Use heap data structure to identify bins in

order by their distance from query point

• Result: Can give speedup by factor of 1000 while finding nearest neighbor (of interest) 95% of the time

3. Feature matching• Reject false matches

– Compare distance of nearest neighbor to second nearest neighbor

– Common features aren’t distinctive, therefore bad– Threshold of 0.8 provides excellent separation

3. Feature matching

• Now, given feature matches…– Find an object in the scene– Solve for homography (panorama)– …

3. Feature matching

• Example: 3D object recognition

3. Feature matching

• 3D object recognition– Assume affine transform: clusters of size >=3– Looking for 3 matches out of 3000 that agree

on same object and pose: too many outliers for RANSAC or LMS

– Use Hough Transform• Each match votes for a hypothesis for object

ID/pose• Voting for multiple bins & large bin size allow for

error due to similarity approximation

3. Feature matching• 3D object recognition: solve for pose

– Affine transform of [x,y] to [u,v]:

– Rewrite to solve for transform parameters:

3. Feature matching

• 3D object recognition: verify model1) Discard outliers for pose solution in prev step2) Perform top-down check for additional

features3) Evaluate probability that match is correct

a) Use Bayesian model, with probability that features would arise by chance if object was not present

b) Takes account of object size in image, textured regions, model feature count in database, accuracy of fit [Lowe 01]

Planar recognition• Training images

Planar recognition

• Reliably recognized at a rotation of 60° away from the camera

• Affine fit approximates perspective projection

• Only 3 points are needed for recognition

3D object recognition• Training images

3D object recognition

• Only 3 keys are needed for recognition, so extra keys provide robustness

• Affine model is no longer as accurate

Recognition under occlusion

Illumination invariance

Applications of SIFT

• Object recognition• Panoramic image stitching• Robot localization• Video indexing• …

• The Office of the Past– Document tracking and recognition

Location recognition

Robot Localization

Map continuously built over time

Locations of map features in 3D

Sony Aibo

SIFT usage:

Recognize charging station

Communicate with visual cards

Teach object recognition

The Office of the Past• Paper everywhere

Unify physical andelectronic desktops

• Recognize video of paper on physical desktop– Tracking– Recognition– Linking

Video camera

Desktop

Unify physical andelectronic desktops

• Applications– Find lost documents– Browse remote

desktop– Find electronic

version– History-based

queries

Video camera

Desktop

Example input video

Demo – Remote desktop

System overviewVideo camera

DeskUser

Computer

System overview

Video of desk

System overview

Video of desk Images from PDF

System overview


Track & recognize

System overview


Track & recognize

T T+1

Desk Desk

Internal representation

System overview


Track & recognize

T T+1


Scene Graph

Desk Desk

System overview


Track & recognize

T T+1


Where is my W-2?

Desk Desk

System overview


Track & recognize

T T+1

Desk Desk


Where is my W-2?

Answer

Assumptions

• Document– Corresponding electronic copy exists– No duplicates of same document

Assumptions

• Document– Corresponding electronic copy exists– No duplicates of same document

• Motion– 3 event types: move/entry/exit– One document at a time– Only topmost document can move

Non-assumptions

• Desk need not be initially empty

Non-assumptions

• Desk need not be initially empty• Stacks may overlap

Algorithm overviewInput

Frames… …


Frames… …

Event Detection

before after


Frames… …

Event Detection

Event Interpretation

“A document moved from (x1,y1) to (x2,y2)”

before after


Frames… …

Event Detection



Document Recognition

before after

File1.pdf

File2.pdf

File3.pdf


Frames… …

Event Detection




before after

File1.pdf

File2.pdf

File3.pdf

Scene Graph Update

Desk Desk


Frames… …

Event Detection




before after

File1.pdf

File2.pdf

File3.pdf

Scene Graph Update

Desk Desk

SIFT

Document tracking example

before after

Document tracking example

Motion: (x,y,θ)

before after


…

File1.pdf File2.pdf File3.pdf File4.pdf File5.pdf File6.pdf

• Match against PDF image database

…

Document Recognition• Performance analysis

– Tested 20 pages against database of 162 pages



– ~200x300 pixels per document for reliable match

Document Resolution

Recognition Rate



– ~200x300 pixels per document for reliable match

Document Resolution

Recognition Rate

300

0.9

Results

• Input video– ~40 minutes– 1024x768 @ 15 fps– 22 documents, 49 events

• Running time– Video processed offline– No optimization– A few hours for entire video

Demo – Paper tracking

Photo sorting example

Demo – Photo sorting

Future work

• Enhance realism– Handle more realistic desktops– Real-time performance

• More applications– Support other document tasks

• E.g., attach reminder, cluster documents

– Beyond documents• Other 3D desktop objects, books/CD’s

Summary

• SIFT is:– Scale/rotation invariant local feature– Highly distinctive– Robust to occlusion, illumination

change, 3D viewpoint change– Efficient (real-time performance)– Suitable for many useful applications

References

• Distinctive image features from scale-invariant keypoints – David G. Lowe, International Journal of Computer Vision,

60, 2 (2004), pp. 91-110• Recognising panoramas

– Matthew Brown and David G. Lowe, International Conference on Computer Vision (ICCV 2003), Nice, France (October 2003), pp. 1218-25.

• Video-Based Document Tracking: Unifying Your Physical and Electronic Desktops – Jiwon Kim, Steven M. Seitz and Maneesh Agrawala, ACM

Symposium on User Interface Software and Technology (UIST 2004), pp. 99-107.

http://www.cs.ubc.ca/~lowe/keypoints/






http://grail.cs.washington.edu/projects/office/doc/kim04unifying.pdf

http://grail.cs.washington.edu/projects/office/doc/kim04unifying.pdf

SIFT Guest Lecture by Jiwon Kim

Documents

Transcript of SIFT Guest Lecture by Jiwon Kim