video descriptors
-
Upload
vasutammina -
Category
Documents
-
view
21 -
download
0
description
Transcript of video descriptors
Performance Evaluation
of
Video Interest Point Detectors
A Thesis submitted in partial fulfillment of the requirements
for the award of the degree
Master of Technology
by
Tammina Siva Naga Srinivasu
Roll No: 124102011
under supervision of
Dr. Prabin Kumar Bora and Dr. Tony Jacob
Department of Electronics and Electrical Engineering
Indian Institute of Technology Guwahati
July 2014
D E C L A R A T I O N
I hereby declare that the work which is being presented in the thesis entitled, “Perfor-
mance Evaluation of Video Interest Point Detectors”, in partial fulfillment of
the requirements for the award of degree of Master of Technology at Indian Institute
of Technology Guwahati, is an authentic record of my own work carried out under the
supervision of Dr. Prabin Kumar Bora and Dr. Tony Jacob and duly listed
other researchers works in the reference section. The contents of this thesis, in full or
in parts, have not been submitted to any other Institute or University for the award of
any degree or diploma.
Tammina Siva Naga Srinivasu
July 2014 Dept. of Electronics & Electrical Engineering,
Guwahati IIT Guwahati,
Guwahati, Assam,
India - 781039.
C E R T I F I C A T E
This is to certify that the work contained in the thesis entitled Performance Evalu-
ation of Video Interest Point Detectors is a bonafide work of Tammina Siva
Naga Srinivasu, Roll No. 124102011, which has been carried out in the department
of Electronics and Electrical Engineering, Indian Institute of Technology Guwahati under
my supervision and this work has not been submitted elsewhere for a Degree.
Dr. Prabin Kumar Bora and Dr. Tony Jacob
July 2014 Dept. of Electronics & Electrical Engineering,
Guwahati IIT Guwahati,
Guwahati, Assam,
India - 781039.
“Difficulties in your life do not come to destroy you, but to help you to realize your
potential and power, let difficulties know that you too are difficult.”
Dr. A. P. J. Abdul Kalam
Abstract
This thesis presents an evaluation and comparison of various spatial and spatio temporal
interest points detectors. The popular spatial interest point detectors such as SUSAN
detector, Harris corner detector, Harris-Laplace corner detector and Scale Invariant
Feature Transform (SIFT) are studied. Their performance is evaluated under affine
transformations and JPEG compression. It is found that SIFT performs better than
other spatial interest point detectors.
Extension of these spatial interest point detectors into spatio temporal domain for videos
like n-SIFT, MoSIFT, ESURF,STIP and LIFT are studied. LIFT detects visually similar
interest points called feature tracks. We propose extension for LIFT which call ELIFT
to detect interest points on feature tacks, by this we obtain interest points with motion.
Performance of the interest point detectors is evaluated under geometric, phototrophic
transforms and MPEG compression. Repeatability and the time of execution are con-
sidered for evaluation. n-SIFT performed better when compared to other detectors but
it has background points. MoSIFT and ELIFT detects lower number of interest points
which are stable compared to the other algorithms.
Acknowledgements
I would like to thank my project supervisors Dr. Prabin Kumar Bora and Dr.
Tony Jacob sincerely for giving me the opportunity to work on the project Perfor-
mance Evaluation of Video Descriptors. I must say that they always came up with new
innovative ideas to help me when I was stuck at some point. I am very thankful to them
for spending their valuable time for clarifying my doubts. I would also like to thank the
Head of the Department and the other faculty members for their kind help in carrying
out this work.I am always indebted to Indian Institute of Technology Guwahati for
giving me a wonderful opportunity to study my in a world class university.
My heartfelt gratitude to my family for being on my side at all times. I will always
remember my beloved friends who made my stay at IIT Guwahati, a pleasant and
memorable period of my life.
Tammina Siva Naga Srinivasu
Indian Institute of Technology Guwahati,
July 2014.
v
Contents
Abstract iv
Acknowledgements v
Contents vi
List of Figures viii
List of Tables ix
1 Introduction 1
1.1 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Scope of Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Image Interest Point Detectors 5
2.1 SUSAN Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Harris Corner Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Harris Laplace Corner Detector . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Scale Invariant Feature Transform . . . . . . . . . . . . . . . . . . . . . . 8
2.4.1 Basic Algorithmic Principle . . . . . . . . . . . . . . . . . . . . . . 8
2.4.2 Key-Point Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.3 Key Point Localization . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.4 Key Point Orientation . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.5 Key Point Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.6 Image Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Video Interest Point Detectors 20
3.1 n-SIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.1 Interest Point Detection . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 n-SIFT Descriptor Computation . . . . . . . . . . . . . . . . . . . 21
3.2 MoSIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Interest Point Detection . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.2 Descriptor Computation . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 STIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
vi
Contents vii
3.3.1 Interest Points in the Spatio-Temporal Domain . . . . . . . . . . . 26
3.3.2 Scale Selection in Space-Time . . . . . . . . . . . . . . . . . . . . . 27
3.4 ESURF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.1 Hessian based Interest Point Detection and Scale Selection . . . . 29
3.4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 ELIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5.1 Feature Track Extraction . . . . . . . . . . . . . . . . . . . . . . . 31
3.5.2 Interest Points on Feature Track . . . . . . . . . . . . . . . . . . . 32
3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6.1 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 35
4 Conclusions and Future Scope 38
4.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
List of Figures
2.1 Results of Harris corner detector on (a) Synthetic image (b) Texture image 7
2.2 SIFT DoG Pyramid and Extrema Detection . . . . . . . . . . . . . . . . 11
2.3 SIFT descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Image Matching Using SIFT Features (a) shows matching with 0.3 thresh-old and (b) shows matching with 0.6 threshold . . . . . . . . . . . . . . . 14
2.5 Three Image Classes from SIPI Data set . . . . . . . . . . . . . . . . . . 15
2.6 Average Repeatability with Intensity . . . . . . . . . . . . . . . . . . . . . 16
2.7 Average Repeatability with Rotation . . . . . . . . . . . . . . . . . . . . . 17
2.8 Average Repeatability with Scale . . . . . . . . . . . . . . . . . . . . . . 18
2.9 Average Repeatability with Compression . . . . . . . . . . . . . . . . . . . 19
3.1 n-SIFT Image pyramid and DoG images [1] . . . . . . . . . . . . . . . . . 21
3.2 n-SIFT each of the 4x4x4 regions for Descriptor computation [1]. . . . . 22
3.3 n-SIFT Example: Frames with interest points marked . . . . . . . . . . . 23
3.4 MoSIFT Flow Chart [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 MoSIFT Example: Frames with interest points marked . . . . . . . . . . 26
3.6 STIP Example: Frames with interest points marked . . . . . . . . . . . . 28
3.7 The two types of box filter approximations for the 2 + 1D second orderpartial derivatives in one direction and in two directions [3] . . . . . . . . 30
3.8 ESURF Example: Frames with interest points marked . . . . . . . . . . . 31
3.9 Figure (a) Track with no motion and * Figure (b) Track with motion anddeviation in path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.10 Figure (a) shows the traced features(LIFT) and * Figure (b) shows themodified Interest points(ELIFT) . . . . . . . . . . . . . . . . . . . . . . . 34
3.11 Brightness change Vs Repeatability . . . . . . . . . . . . . . . . . . . . . 36
3.12 Scale change Vs Repeatability . . . . . . . . . . . . . . . . . . . . . . . . 36
3.13 Compression change Vs Repeatability . . . . . . . . . . . . . . . . . . . . 37
viii
List of Tables
3.1 Execution Time Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 37
ix
Chapter 1
Introduction
Digital Videos are important source of information and entertainment. Analyzing video
and extracting information from it is important for applications like event detection in a
surveillance video, piracy video detection and video classification to manage large data
bases . These tasks can be achieved by using local features or interest points. Many
robust techniques are proposed for interest point detection in images can be used in
frame by frame basis treating video as a group of frames. The main difficulty is that the
content in video is in megabits or gigabytes with temporally covariant information which
is redundant. Hence we need to find true spatio temporal interest points. In capturing a
scene video camera is free to rotate, change zooming factor or can move. During Video
capturing illumination can vary and camera and environment can add noise to video
captured. Videos are compressed for Storing and transmission. The challenging task is
to obtain a stable, robust detector which addresses the above problems. This task can
made a bit easier by fixing camera during capturing.
Many methods have been proposed for detection of interest points in video for specific
applications. There is need for comparative study of the interest point detector’s stability
under geometric and photometric variations. Their performance on other applications
need to be evaluated.
1.1 Literature Survey
Local features are image patterns which are different from the immediate neighborhood
[4]. They are marked by changes in image properties such as intensity, colour, and
texture. Examples of local features are points, edge points and blobs. A detector is used
to extract features from image. Two examples are the Harris corner detector [5] which
1
Chapter 1. Introduction 2
detects corners and the SIFT [6] which is a blob detector. It is good if local features
represent meaningful object parts. It requires a high level knowledge of image content
which is not available in early stages. Therefore detectors use underlying image intensity
variations.
A feature detector should have good repeatability, informativeness and accuracy. It
should be invariant or covariant for common image deformations such as rotations,
translations, intensity variations including brightness and contrast and so on. Invari-
ance means that detected features should not change with image transformations. This
property of detector is very important for image processing applications such as im-
age registration, object recognition where images can be taken at different scales, view
point, or in the presence of occlusion. For example, the SIFT [6] descriptor for an inter-
est point will not change with rotations or brightness and contrast variations. Detectors
can also be covariant under certain transforms which means that it can vary location
but it holds it structure. For example, an edge at a particular orientation for an image
changes its orientation direction with rotation of the image. So most of the interest point
detectors are mainly covariant the descriptors derived from them are invariant due to
normalization.
Corners are interest points whose locations are invariant under rotations and translations
of the image. Harris and Stevens [5] detected corner points from an image using a
Gaussian smoothed intensity gradient popularly known as Harris corner detector. Smith
and Brady [7] had considered points which are extreme in their circular neighborhood as
SUSAN corners . Both of these corner detectors are rotation and translation invariant
but vary with scale changes. Harris corner detector was extended by Mikolajczyk and
Schmid [8] to find interest points and their scales, by using the scale space theory [9]
proposed by Lindbergh. In 1999, David Lowe presented a new algorithm called Scale
Invariant Feature Transform or SIFT [6] which was based on scale the space theory
and used the scale space representation of an image to find interest points and describe
them using a robust descriptor based on the local gradient structure around the interest
point. The algorithm was refined by 2004 to account for sub-pixel accuracy in key point
location and also described ways to match the descriptors between images for object
matching, registration and recognition under affine transformations. This thesis deals
with implementation and evaluation of Harris corner detector, Harris-Laplace corner
Detector, SUSAN corner detector and SIFT performances under intensity and geometric
transforms.
Similar to image interest point detector, video interest point detectors detects spatio
temporal interest points present in a video. The existing image detector can be ex-
tended to spatio temporal domain by changing image saliency measure. On Space Time
Chapter 1. Introduction 3
Interest Points (STIP) [10] proposed by Laptev is an extension of the Harris Laplace
corner detector. He extended 2D second moment matrix to a 3D second moment ma-
trix using gradients in spatio temporal domain and he determined spatial and temporal
scales of the detected points by maximizing the output of a normalized spatio tem-
poral Laplacian operator. In a similar fashion, SIFT which very robust technique in
image processing was developed into an arbitrary n-dimensional extension by Cheung
and Hamarneh as n-SIFT [1] towards applications in medical images including 3D MRI
and dynamic 3D+time CT data. These techniques treat time as a third extension of
space while detecting interest points. Other extension to SIFT was proposed by Chen
and Hauptmann [2] which uses SIFT to find spatial interest points which are then fil-
tered in the temporal domain using their optical flow. The motion in temporal domain
is captured by optical flow and interest points with sufficient optical flow are consid-
ered as spatio temporal interest points. Hence this technique is called Motion SIFT
(MoSIFT). SURF [11] is extend to into spatio temporal domain by Tuytelaars [3] by
using 3DHessian as saliency parameter and improved the computation performance by
approximation with box filter on integral video structure. Vasileios and Anastasios [12]
proposed a method LIFT which tracks SIFT features present in each frame to from
a feature tracks which are similarly moving, visually similar local regions. We extend
the LIFT to find interest points on feature tracks by considering points with significant
motion and rejecting most of background points which are redundant.
1.2 Scope of Project
The thesis work mainly deals with the study of image and video interest point detectors
and their evaluation. The image interest point detectors Harris corner detector, SUSAN
detector, Harris-Laplacian corner detector and SIFT are studied in detail and their
performance is evaluated. Video interest point detectors n-SIFT, MoSIFT, ESURF,
STIP and LIFT are studied. We proposed a method to extract interest points on the
tracks of LIFT. All the above video detectors are evaluated with respect to repeatability
and time of execution.
1.3 Thesis Organization
The Thesis is organized as follows:
• Chapter 2 gives detailed study of image interest point detectors and their perfor-
mance.
Chapter 1. Introduction 4
• Chapter 3 presents a overview of video interest point detectors and their perfor-
mance is evaluated.
• Chapter 4 concludes the thesis with a summary of work done and suggestions for
future work.
Chapter 2
Image Interest Point Detectors
Image Interest points which are invariant to rotation and translation as a part of affine
class of transformations along with scaling, brightness and contrast variations are used in
problems like image registration, object recognition etc. The literature on interest point
detection in images is vast and different type of detectors are present. This chapter gives
study and evaluation of corner and blob detectors. SUSAN [7], Harris corner detector
[5] and Harris Laplace Corner Detector [8] are corner detectors and SIFT [6] is blob
detector. SIFT is used for image matching and results are discussed.
2.1 SUSAN Detector
Corners are important local features in images. Generally a corner can be defined as
the intersection of two edges. They are high curvature points and lie in junctions of
different brightness.
SUSAN (Smallest Univalue Segment Assimilating Nucleus) detector [7] is proposed by
SMITH for detecting corners using. The SUSAN detector considers a circular mask
with the center of the mask as the nucleus. Intensity of every pixel is compared with the
nucleus value. Pixels having similar intensity are considered as USAN (Univalve Segment
Assimilating Nucleus) and if the area of USAN is less than half of the total mask area
then the nucleus is considered as a corner point. SUSAN detector’s computational
complexity is low but it is sensitive to image deformations.
5
Chapter 2. Image Interest Point Detectors 6
2.2 Harris Corner Detector
The Harris corner detector [5] is based on the fact that the eigen values of second
moment matrix represent the signal changes in orthogonal directions. The corner has
large variation in intensity in both directions so, they have large Eigen values. For an
image I(x,y), consider a window which is centered at (u, v) and the change E produced
by a shift (x,y) is given by Equation 2.1
E(x, y) ≈∑
u,v
w(u, v)[I(x + u, y + v)− I(x, y)]2 (2.1)
To maximize E(x,y) the term [I(x+ u, y + v)− I(x, y)]2 is expanded in Taylor series
and by neglecting the higher order terms we get
E(x, y) =∑
u,v
u2I2x + 2uvIxIy + v2I2y
Matrix equation form of this equation is
E(x, y) =[x y
]M
[u
v
]
Where
M =
[I2x IxIy
IxIy I2y
]
The derivatives are averaged by a Gaussian kernel of standard deviation σD and the
image is smoothed with a Gaussian mask of standard deviation σ1 to remove noise.
Matrix M is changed as given below
M = g(σ1)σD2
[I2x(x, σD) Ix(x, σD)Iy(y, σD)
Ix(x, σD)Iy(y, σD) I2y (y, σD)
]
Eigen values of the matrix M are required to determine the existence of corner. Instead
of actually calculating eigen values, a new quantity cornerness is defined as
cornerness = det(M)− k ∗ trace(M)2 (2.2)
Where k =0.04. Cornerness is difference of the product of eigen values and the sum of
eigen values. When there is change in both x,y direction the eigen values will be large
and positive. So cornerness should be positive and large for a corner point.
Chapter 2. Image Interest Point Detectors 7
Harris corner detector on texture images
(a)
Harris corner detector on texture images
(b)
Figure 2.1: Results of Harris corner detector on(a) Synthetic image (b) Texture image
Example 2.1. Consider a synthetically generated checkerboard image shown in Fig-
ure 2.1(a) we can see that almost all corners have been detected in that image. The
response of the Harris corner detector on a texture image gives the output given in Fig-
ure 2.1(b). We can notice that it performs better on this image except that it can’t detect
some of the corner with less intensity variation. If we reduce the threshold to detect low
contrast corners there are false detections.
The main draw back of Harris corner detector is that it is variant to scale changes. It is
invariant to intensity changes since it considers difference of intensity values not direct
value.
Chapter 2. Image Interest Point Detectors 8
2.3 Harris Laplace Corner Detector
Corners can be detected by the Harris corner detector [8], but they are not scale invariant.
Harris Laplace Detector provides a scale invariant features. It was developed based on
Harris corner detector principle. This gives a scale adapted local features i.e. interest
point along with the scale of occurrence.
Harris Laplace Corner Detector is developed by Constructing scale space L(x, y;σ) which
is a set of Gaussian smoothed images. The Laplacian detector detects interest points
whose location and characteristic scale is given by the maximum of laplacian operator.
The Laplacian of Gaussian scale space is constructed for finding scale of interest point.
At every scale of image find corner points by second moment matrix and Harris corner
principle (2.2). A corner is said to be a interest point if it attains an extrema in 8
neighbors in current scale and 18 neighbors present in scale above and below the current
scale.
2.4 Scale Invariant Feature Transform
The scale Invariant Feature Transform [SIFT] was developed by David Lowe [6]. The
SIFT interest points are scale invariant and are robust under affine transformations.
SIFT is based on concept of Scale-Space theory developed by Lindberg’s [9] .
2.4.1 Basic Algorithmic Principle
From pyramid of images (images with increasing blur), SIFT detects a series of key
points which are mostly blob like structures and accurately locates their locations and
scales. Then it calculates dominant orientation θref over a region surrounding the key
point. The knowledge of location, scale and orientation is used to find a descriptor
for the key point.Descritpor is computed from the normalized patch around key point.
The descriptor encodes the spatial gradient distribution around the key point by a 128
dimensional vector. This descriptor makes SIFT more robust to rotation, scale and
translation variations. The main steps involved in SIFT as explained by Lowe are:
• Key-Point Detection: A pyramid of Gaussian blurred images of different scales is
formed and the extremas are located in the Difference of Gaussian (DoG) images.
• Key-Point Localization: Each key point is fine-tuned by the Taylor series model
to get subpixel accuracy and they are selected based on their stability.
Chapter 2. Image Interest Point Detectors 9
• Key-Point Orientation Computation: One or more orientations are found for a
key point based on local gradients. This orientation is used in future which makes
SIFT rotation invariant.
• Key-Point Descriptor Computation: From orientation normalized patch around a
key point, descriptor is formed by local image gradients from selected scale and
orientation.
2.4.2 Key-Point Detection
Gaussian Scale Space
Lindberg in his detailed mathematical analysis [9] of the multi-scale representation of
images, showed that the Gaussian kernel was the unique scale-space kernel. Gaussian
scale space is simply a stack of increasing blurred images. This blurring
simulates the loss produced when photograph is taken with different zoom out factor.
For an image I(x,y), the Gaussian blurring image L(x, y;σ) is defined as
L(x, y;σ) = g(x, y;σ) ∗ I (2.3)
Where g(x,y;σ) is the Gaussian kernel which is parameterized by standard deviation σ is
given by g(x, y;σ) = 12πσ2 e
−(x2+y2)/2σ2. the gaussian smoothing operator satisfies semi
closed relation given by Equation 2.4. We can construct Gaussian scale space by using
the Gaussian kernel and its semi closed relation.
g(;σ1 + σ2) = g(;σ1) ∗ g(;σ2) (2.4)
Digital Gaussian Smoothing
the two dimensional Gaussian kernel is obtained by sampling and truncating the Gaus-
sian function given by
g(x, y;σ) = Ke−(x2+y2)/2σ2∀ |x| , |y| ≤ [4σ] (2.5)
Where [.] is the floor function which truncates a fraction to an integer and K is used
for normalization. The scale space of images is constructed by 2-D convolution of image
with digital Gaussian kernel.
In his scale-space theory, Lindbergh [9] proposed scale normalized Laplacian for detecting
scale. Lowe proposed Difference of Gaussian (DoG) function as an approximation for
Chapter 2. Image Interest Point Detectors 10
scale normalized Laplacian and obtained stable key points of the image with scaling.
The approximation of Laplacian can be derived from solution of diffusion Equation 2.6
∂g
∂σ= σ∇2g (2.6)
∂g
∂σ≈g(x, y;Kσ) − g(x, y;σ)
Kσ − σ
From diffusion Equation 2.6, we can approximate above equation as
g(x, y;Kσ) − g(x, y;σ) ≈ (K − 1)xσ2∇2g
The Difference of Gaussian DoG, D(x,y;σ) as given in Equation 2.7 is the difference of
Gaussian smoothed images of standard deviation separated by a constant factor k.
D(x, y;σ) = L(x, y;Kσ)− L(x, y;σ) (2.7)
The scale space is constructed by Lowe by step value k = 21s where s is the number scales
in an octave within which σ of the Gaussian kernel doubles. The scheme of constructing
scale space is shown in Figure 2.2. Key points are extremas in DoG image which are
extracted as follows:
1. Linearly interpolate the input image to twice its size.
2. Pre-smooth the image with a Gaussian filter of σ = 1.6 before giving it was given
input to the first octave level. The input blur is assumed as 0.5. This must be
considered getting initial blur 1.6. Use semi closed property of Gaussian scale
space and calculate the extra blur to be added.
3. Blur the image with a Gaussian filter of variance σ, kσ, k2σ and so on to create
Gaussian smoothed images in the octave, where Lowe recommends s=3. The blur
of in sth level in o octave is given by σ = 1.6 × 2o× s
3
4. Subtract the Gaussian convolved images as shown in Figure 2.2 to get s+2 DoG
images.
5. Detect all extrema in DoG image as shown in Figure 2.2. The extrema can be
either the maxima or minima with respect to its 8-neighbours in the same scale
and 9 neighbors in the scale above and below the present scale.
6. Once the octave is processed, sub-sample the image to half its size in both dimen-
sions by taking every second pixel in row and column.The sub-sampled image is
then processed in the next octave level with steps 3 to 6 and the process continues
all octaves in the pyramid.
Chapter 2. Image Interest Point Detectors 11
Number of octaves o = log2(min(row, col)), where row and col represent the width
and length of the image.
Figure 2.2: SIFT DoG Pyramid and Extrema Detection [6]
2.4.3 Key Point Localization
SIFT aims for a descriptor invariant to affine transformations which is necessary for
image registration, object detection. Sub-pixel accuracy of the key-point is a desirable
feature for addressing such problems. Till now location of the key point is confined to
sampling grid Such coarse location is not suitable to reach full scale and translation
invariance. SIFT refines the position of key point by using Taylor series expansion of
the DoG, D(x,y;σ) given by
D(X) = D +∂DT
∂XX +
1
2XT ∂
2D
∂X2X (2.8)
where D and its derivatives are evaluated at the key-point location X = (x, y, σ)T . Then
the local extremum can be obtained by setting the derivative of D(x) with respect to x
to zero which yields displacement vector from the key-point position, X ,given by
X = −∂2D
∂X2
−1∂D
∂X(2.9)
If X < 0.5 in any of the dimension then X value is updated, else the key point is skipped
and processes is continued with next point. Substituting the value of X in D(X) (2.8)
the function value at extrema is calculated. Lowe used this value to eliminate weak
unstable points. If function values is greater than threshold of 0.3, where the image
intensity is in [0 1], key point is considered as interest point.
Chapter 2. Image Interest Point Detectors 12
Eliminating Points on Edges
The DoG extrema were found to contain strong response along edges which are not good
interest points with respect to rotations or affine transformations. To detect edges, Lowe
used the Hessian matrix (2.10). Since eigenvalues of Hessian matrix correspond to the
principal curvatures, their ratio will be deviating from unity by a large factor if the point
is an edge. If the eigenvalues are having opposite signs, then the point is a saddle point
and hence the point is not an extremum.
H =
[Dxx Dxy
Dxy Dyy
](2.10)
Lowe considered the ratio of largest eigenvalue to smaller one α/β = r. For an edge key
point it has a large value r > 10, Tr(H)2
Det(H) =(α+β)2
(αβ) = (rβ+β)2
(rβ2)= (1+r)2
r
By considering cedge = 10 as threshold we decide whether it is an edge point or not.
If Tr(H)2
Det(H) >(cedge+1)2
(cedge)key point is discarded.
2.4.4 Key Point Orientation
Rotation invariant descriptors can classified into two categories. One them is based on
image properties which are already rotation invariant. The other is that descriptors
which are normalized with respect to reference orientation. SIFT belongs to second
category.
The Key-point orientation corresponds to the dominant gradient orientation in the neigh-
borhood of the key point.Orientation can be found by the following procedure.
1. At each (x, y;σ) , compute the gradient magnitude and orientation using pixel
difference as given below
m(x, y) =
√(L(x+ 1, y;σ) − L(x− 1, y;σ))2 + (L(x, y + 1;σ)− L(x, y − 1;σ))2
θ(x, y) = tan−1(L(x,y+1;σ)−L(x,y−1;σ)L(x+1,y;σ)−L(x−1,y;σ) )
(2.11)
Weigh the gradient magnitudes by a Gaussian window with σ = 1.6 times the scale
of the key point. This will give more importance to the gradients closer to the key
point.
2. Compute an orientation histogram of 36 bins covering the 360 degrees of gradi-
ent orientations with each sample contributing its gradient magnitude to the bin
corresponding to its gradient orientation bin.
Chapter 2. Image Interest Point Detectors 13
3. Fit a parabola to the 3 bins with the highest values to interpolate the peak position
and thus determine the key point orientation.If there are any bins with 80% or
more magnitude of the highest value bin, then they should also be used to create
new key points at the same (x, y, σ), but with different orientations corresponding
to these bins.
Thus the key point(x, y, σ, θ) is localized and now the descriptor for this key point has
to be computed from the gradients.
2.4.5 Key Point Descriptor
The local patch of 16x16 is taken in L(x, y, σ) around the key point. Split the region into
sixteen 4x4 regions and in each region, an 8 bin histogram is computed by the same way
as the orientation histogram with each sample contributing its gradient magnitude to
the corresponding orientation bin as shown in Figure 2.3. The sixteen 8-bin histograms
are then concatenated to get the 128-bin key point descriptor.In addition to these steps,
Lowe introduced 2 additional steps to minimize the effects of lighting on the descriptor.
Figure 2.3: SIFT descriptor [6]
1. Normalize the descriptor to unit length. This reduces effect of contrast changes
which adds a constant value to the image gradients in the neighborhood.
2. Threshold any bin value above 0.2 to 0.2 and normalize the descriptor again. This
reduces the effect of directional illumination changes which can cause large gradient
Chapter 2. Image Interest Point Detectors 14
magnitude changes in particular direction. By limiting the bin values to 0.2, effect
of large gradient values on the descriptor is reduced.
Thus each SIFT key point (x, y, σ, θ) is described by 128-bin histogram descriptor.
2.4.6 Image Matching
SIFT has many applications like object recognition, image registration, and image
matching of them we considered image matching. It has been done by following steps
1. Compute SIFT descriptors for both images
2. Pair the similar key points of two images with Euclidian distance between the
descriptors of two key points is below a threshold. Lowe suggested 0.6 as threshold.
3. If ratio of distance of second best match to first best match is more than 80 then
key point is discarded.
sift features matching under scale variation with 0.3 as threshold
(a)
sift features matching under scale variation with 0.6 as threshold
(b)
Figure 2.4: Image Matching Using SIFT Features(a) shows matching with 0.3 threshold and (b) shows matching with 0.6 threshold
Example 2.2. Two images with different scales are taken and applied for matching
algorithm. SIFT detected 1022 and 939 key points in two images. Out of which for 0.6
as threshold we got 405 points as matched key points between the two images it is shown
in Figure 2.4(b). If threshold is 0.3, it showed 122 points only. It shown in Figure 2.4(a).
Depending on the application and nature of images we can vary threshold to get better
performance in matching.
Chapter 2. Image Interest Point Detectors 15
2.5 Evaluation
This section deals with the performance evaluation of image interest point detectors.
We considered repeatability as performance factor, which is used to test the detector’s
performance in image deformations like compression, intensity variation, scale changes
and rotation. The SUSAN detector, the Harris corner detector, the Harris-Laplace
corner detector and the SIFT are evaluated. Since only SIFT has descriptor, only
interest point detection part of the algorithms are compared. The images used for the
evaluation were taken from the SIPI [13] data set. It contains three classes of images ,
namely textures, aerials and miscellaneous images. One image from each class is shown
in Figure 2.5. This offers a wider range of data to test the algorithms.
Figure 2.5: Three Image Classes from SIPI Data set
Repeatability Criterion
Repeatability Criterion test whether the detector is invariant to changes of imaging
conditions and data storage. It is defined by Hendaoui and Abdellaoui in [14] as Equa-
tion 2.12
rdi (Ii, Idi ) =
C(Ii, Idi )
mean(mi,mdi )
(2.12)
Where Ii, Idi are original image and deformed image respectively. C(Ii, I
di ) is num-
ber of matched interest points and mi, mdi are the number of interest points in Ii, I
di
respectively. For performance evaluation , six images from each class of the data set
were considered and the repeatability was averaged over the six images from each class
according Equation 2.13.
average repeatability, rdavg =
6∑i=1
rdi
6(2.13)
Chapter 2. Image Interest Point Detectors 16
Intensity Variations
The overall brightness of the image is varied from -40 to 40 by using relation
Inew = Iimage ± d and image intensity is limited to [0 250]. The interest point positions
ideally should be unchanged from the original image under intensity variations. But
as shown from the plot, repeatability of the detectors falls with changes in intensity.
Gradients being the difference of pixel intensities should ideally be invariant under ad-
ditive intensity variations. However pixel intensities close to white (255) or dark (0)
may be lost in the process and hence the neighborhood gradients get affected. With
increasing additive values, more pixels become white or dark and so detectors perfor-
mance comes down. All detectors response is affected by brightness variations in all
three image classes. SUSAN gives a better response since it considers a 7 circular neigh-
borhood to detect a point and majority the pixels turning into black or white due to
intensity variation is less often. SIFT and SUSAN response is better than Harris and
Harris Laplace corner detector response. Results of intensity variation are shown in
Figure 2.6(a) for aerial images ,Figure 2.6(c) for texture images and Figure 2.6(b) for
miscellaneous images.
−40 −30 −20 −10 0 10 20 30 400
10
20
30
40
50
60
70
80
90
100brightness variation of AERIAL images
brightness variation
Ave
rage
Rep
eata
bilit
y
SIFTHrs−lpHarrissusan
(a)
−40 −30 −20 −10 0 10 20 30 400
10
20
30
40
50
60
70
80
90
100brightness variation of MISC images
brightness variation
Ave
rage
Rep
eata
bilit
y
SIFTHars−lptHarisSUSAN
(b)
−40 −30 −20 −10 0 10 20 30 4010
20
30
40
50
60
70
80
90
100brightness variation of TEXTURE images
brightness variation
Ave
rage
Rep
eata
bilit
y
SIFThrs−lpHarrissusan
(c)
Figure 2.6: Average Repeatability with Intensity
Chapter 2. Image Interest Point Detectors 17
Image Rotations
The images and their rotated versions have been taken from SIPI [13] Image database.
There are texture images with rotations spanning from 0 to 200 degrees. SIFT has shown
better repeatability with descriptor since we normalize the orientation of interest point
before calculating the descriptor which gives rotation invariance to it. In the absence of
descriptor SIFT has shown less repeatability and it is almost constant for a any angle.
Harris and Harris Laplacian detectors show almost no repeatability since they consider
1D gradients for detection which are sensitive to rotation. SUSAN detector has shown a
better repeatability than Harris since it considers a circular neighborhood for detection,
rotation of image mostly doesn’t change circular neighborhood. Results are shown in
Figure 2.7
−80 −60 −40 −20 0 20 40 60 800
10
20
30
40
50
60
70
80
90
100rotation variation
rotation variation
Ave
rage
Rep
eata
bilit
y
SIFTSUSAN
Figure 2.7: Average Repeatability with Rotation
Scale Variations
The images and their scaled versions from 50% to 200%, are used to evaluate detectors
response to scaling. It may be noted that SIFT and Harris Laplacian detectors account
for scale invariance while SUSAN and Harris corners are not scale invariant. All detectors
show poor repeatability with scale except SIFT. Harris-Laplace has a better performance
than Harris but lower than SIFT. Results of scale variation are shown in Figure 2.8(a)
for aerial images , Figure 2.8(b) for miscellaneous images and Figure 2.8(c) for texture
images.
Compression Variations
Images were compressed in JPEG format in a scale of [1 99] where scale represents
inverse of compression ratio. As we performed lossy compression it changes values of
pixels which in tern changes the value of gradients. Hence every detector is expected
to show decrease in repeatability as compression increases. Detectors performed better
in compression of aerial and texture images which have redundant information. SIFT
Chapter 2. Image Interest Point Detectors 18
0 0.5 1 1.5 2 2.5 3 3.5 40
10
20
30
40
50
60
70
80
90
100scale variation of AERIAL images
scale variation
Ave
rage
Rep
eata
bilit
y
SIFTHarrs−laptHarrisSUSAN
(a)
0 0.5 1 1.5 2 2.5 3 3.5 40
10
20
30
40
50
60
70
80
90
100scale variation of misc images
scale variation
Ave
rage
Rep
eata
bilit
y
SIFTHaris−laptHarrisSUSAN
(b)
0 0.5 1 1.5 2 2.5 3 3.5 40
10
20
30
40
50
60
70
80
90
100scale variation of texture images
scale variation
Ave
rage
Rep
eata
bilit
y
SIFTHaris−laptHarrisSUSAN
(c)
Figure 2.8: Average Repeatability with Scale
and Harris Laplace detectors performed better than others as they consider scale space
of images so they detect interest points in blurred or approximated scales also.Results
of compression variation are shown in Figure 2.9(a) for aerial images, Figure 2.9(b) for
miscellaneous images and Figure 2.9(c) for texture images.
2.6 Conclusion
The investigations in this chapter shows that SIFT exhibits better repeatability than
other detectors in most image deformations. In scale and rotation changes, SIFT also
has poor repeatability. The descriptor is responsible for rotation invariance of SIFT,
with it SIFT became more popular. The number of interest points detected is more for
SIFT. The SUSAN detector has better response to rotation. Harris Laplace detector
has better response than Harris corner detector in scale variations.
Chapter 2. Image Interest Point Detectors 19
0 20 40 60 80 1000
10
20
30
40
50
60
70
80
90
100compression of aerial image
compression scale
aver
age
repe
atab
ility
SIFTHars−lapHarrisSUSAN
(a)
0 20 40 60 80 1000
10
20
30
40
50
60
70
80
90
100compression of misc image
compression scale
aver
age
repe
atab
ility
SIFTHaris−lapHarrisSUSAN
(b)
0 20 40 60 80 1000
10
20
30
40
50
60
70
80
90
100compression of texture image
compression scale
aver
age
repe
atab
ility
SIFTHars−lapHarrisSUSAN
(c)
Figure 2.9: Average Repeatability with Compression
Chapter 3
Video Interest Point Detectors
A number of video interest point detectors have been proposed for detection of spatio
temporal interest points. some use robust images detectors frame by frame basis for
applications like duplicate video detection [15] and video stitching [16]. These are not
true video descriptors as they do not consider time dimension which is very critical
for videos. Image detectors have been extended for videos in many different forms.
SIFT [6] was extended to n-SIFT [1] by considering time gradients for saliency measure.
MoSIFT [2] considers optical flow in time domain and SIFT for spatial domain. STIP
[10] is extension of Harris Laplacian detector [8] which use 3D gradients. ESURF [3] is
direct extension of SURF to videos which use integral video concept and box filters. The
LIFT [12] is trajectory based techniques which utilizes the matching image descriptors
between frames. Local Invariant Feature Tracks (LIFT) form another class of video
which gives similarly moving and visibility moving features. we propose an extension of
LIFT to ELIFT by picking interest points on feature tracks. This chapter deals with
study of all these video descriptors and gives a comparative study of their performances
under deformations like scaling,change in brightness and compression. Video alignment
was done by using these descriptors and results were discussed.
3.1 n-SIFT
n-SIFT [1] was proposed by Cheung and Ghassan Hamarneh for the detection and
matching of interest points in medical images which includes 3D MRI images as well as
3D+time CT images. It was direct extension of SIFT from 2D to arbitrary nD images.
20
Chapter 3. Video Interest Point Detectors 21
Figure 3.1: n-SIFT Image pyramid and DoG images [1]
3.1.1 Interest Point Detection
The nD image is progressively convolved with nD Gaussian and sub-sampled for next
octave. Difference of Gaussian (DoG), is formed by difference between successive levels
in an octave. Each points is an extrema when compared to neighbors in n-dimensions
it is a interest point. The interest point detection algorithm as described by Cheung [1]
is given in Algorithm 3.1.
The pyramid construction is similar to SIFT and is shown in Figure 3.1. Steps shown in
Algorithm 3.1. is described as follows: We convolve nD input images with nD gaussian
of variance σ, kσ, k2σ and so on to create s+3 Gaussian smoothed images in an octave.
Their difference give s+2 DoG images. Interest points are the extrema in DoG images.
A voxel pj(x1, x2, ...xn) is defined as local extrema iff it satisfies the conditions
|pj(x1, x2, ...xn)| ≥ |pj+i0(x1+i1, x2+i2 , ...xn+in)| or
|pj(x1, x2, ...xn)| ≤ |pj+i0(x1+i1, x2+i2 , ...xn+in)|
∀i0 = −1, 0, 1,∀i1 = −1, 0, 1∀i2 = −1, 0, 1.....and∀in = −1, 0, 1
(3.1)
Subsample the image by half in all dimensions and is input to next octave. For a video
number of octaves is o = log2(min(m,n, t)) where m,n are frame dimensions and t is
number of frames. However, it has been noted that no meaningful interest points are
detected by n-SIFT after 3-4 octaves. So we take the number of octaves as min(4, o).
Initial blur is taken as 1.6 and s=3 as suggested by lowe [2].
3.1.2 n-SIFT Descriptor Computation
The n-SIFT uses the nD gradients to create 25n−3 dimensional descriptor for every
feature point which is similar to the descriptor of SIFT. The steps followed to calculate
Chapter 3. Video Interest Point Detectors 22
Algorithm 1 n-SIFT Interest Point Detection [1]
1: I0 = I2: F = φ3: for i=0 to maxImageLevel do4: Ii,j=GaussianFilter
(Ii, k
jσ)Gaussian Filter Image Ii with sigma =kjσ
5: dIi,j = Ii,j − Ii,j−1
6: end for7: for j=1 to S+2 do8: PI=Find Extrema(dIi,j−1, dIi,j , dIi,j+1)9: for p∈PI do
10: F=F∪GenegateFeatures(p,Ii)11: end for12: Ii+1=scale(GaussianFilter((Ii, σ), 0.5
i))13: end for14: return F
Figure 3.2: n-SIFT each of the 4x4x4 regions for Descriptor computation [1].
the descriptor are as follows:
1. Select a 16n hypercube around the interest point as shown in Figure 3.2. Then
calculate nD gradients and express them in terms of the magnitude and the orien-
tation direction.
2. The magnitudes of gradients are weighted by a Gaussian centered at the interest
point location, it gives more weight to pixels near to the interest point.
3. Compute the (n-1) orientation bin histogram with each voxel gradient magnitude
added to the bin corresponding to its orientation.
4. Find the bin with highest value. This gives dominant orientation and it is sub-
tracted from neighboring voxel orientation to make it rotation invariant.
5. Divide the hypercube into 4n sub parts as shown in Figure 3.2, each part is de-
scribed by 8 bins for each of the (n-1) directions. Thus each sub region is rep-
resented by 8n−1 bins and in total, 4n8n−1 = 25n−3 dimensional feature vector is
created.
Chapter 3. Video Interest Point Detectors 23
Figure 3.3: n-SIFT Example: Frames with interest points marked
6. Normalize the feature vector as in the case of SIFT to account for scaling from
Brightness variations.
Example 3.1. We applied n-SIFT algorithm on a sample of walking sequence. As shown
in Figure 3.3 the algorithm detects points around head, hands and legs where there is
motion. As seen from Figure 3.3, the algorithm detects interest points with no motion
also intensity variations in between the frames is the reason for such redundant points.
The most noticeable trait for n-SIFT is that the feature detection detects a large number
of spatial interest points without motion. Another problem detected with n-SIFT is its
large memory usage. Since it treats a video as 3D image and builds octaves from it, it
requires a large amount of memory.
3.2 MoSIFT
The MoSIFT also known as Motion-SIFT [2] was proposed by Chen and Hauptmann to
detect interest points which are used to describe motion and action recognition in videos.
Chapter 3. Video Interest Point Detectors 24
The algorithm unlike the previous algorithms use optical flow for time motion instead
of temporal gradient. A detail view of this method is given in subsequent sections.
3.2.1 Interest Point Detection
MoSIFT uses SIFT for the detection of interest points in spatial domain on frame by
frame basis and optical flow is used for temporal domain. the flow chart of this algorithm
is given in Figure 3.4. We calculate interest points in spatial domain by using the SIFT
by considering each frame at a time. We construct Gaussian pyramid of octaves and
then taking their difference to get DoG. The interest points are the Extremas in DoG
as explained in Chapter 2. These interest points must be refined to get actual MoSIFT
interest points. The optical flow is considered as saliency measure for the time dimension.
Figure 3.4: MoSIFT Flow Chart [2].
The optical flow [17] is used to find motion interest points in the video. Optical flow
is based on brightness and smoothness. Let the image brightness at point (x,y) in an
image plane at time t is denoted by I(x,y,t). The brightness constraint assumes that the
brightness of particular point in the pattern is constant, which gives following:
dIdt = 0
⇒ ∂I∂x
∂x∂t +
∂I∂y
∂y∂t +
∂I∂t = 0
Ixux + Iyvy + It = 0
(3.2)
Chapter 3. Video Interest Point Detectors 25
where vx,vy are the optical flow velocities in x and y directions in constitutive frames.
There are many techniques for solving this equation. Here, the least-squares method
proposed by Lucas-Kanade [17] is used.
Figure 3.4 shows the block diagram to compute optical flow between the successive
frames. Consider a pair of pyramids from the a successive frames to compute the optical
flow at the same image pyramid level at which the SIFT interest point is detected. The
magnitude of optical flow value at the SIFT interest point is above the threshold it
is considered as spatio-temporal interest point in video. We can match SIFT interest
points between frames and apply optical flow threshold on the matched points. This
reduces number of interest points and also their stability across the frames.
3.2.2 Descriptor Computation
MoSIFT uses the DoG and optical flow. So it is easier to acquire both spatial gradients
and optical flow. MoSIFT uses the histogram of gradient (HoG) and also the histogram
of optical flow (HoF) for describing an interest point. MoSIFT descriptor is obtained
by concatenating HoG and HoF to get a 256 bin vector.
SIFT descriptors is computed as described in Chapter2. For describing motion around
the interest point, the descriptor is computed from the local optical flow in the same
way SIFT descriptor is computed from the image gradients. The dominant orientation
normalization is not present here since the actual direction of motion is important. The
gradient optical flow has also both direction and magnitude of motion. Take a local
patch of 16x16 around the interest point and divide it into 4x4 patches. Each of 4x4
patch is described with 8 bin orientation histogram. Each optical flow value contributes
to the bin of its orientation direction with weight given by its magnitude. The sixteen
8-bin histograms are concatenated into a 128-bin optical flow histogram.
Example 3.2. We applied MoSIFT algorithm on a sample of walking sequence . As
shown in Figure 3.5 the algorithm detects points around head, hands and legs where
there is motion between frames, along with some points near them were detected due to
Gaussian weighing in the neighborhood .
However the main noticeable trait of this technique is that the uniform motion points
were captured in multiple frames which is redundant and the quantity of interest points
detected is less .
Chapter 3. Video Interest Point Detectors 26
Frame 1 Frame 2
Frame 3 Frame 4
Figure 3.5: MoSIFT Example: Frames with interest points marked
3.3 STIP
3.3.1 Interest Points in the Spatio-Temporal Domain
The Space-Time Interest Points (STIP) was developed by Laptev [10] using
3D gradient structure along with the scale space representation to find interest points
in video. The video f(x, y, t) is convolved with 3D Gaussian g(.;σ2l , τ2l ) to give the scale
space representation:
L(x, y;σ2l , τ2l ) = g(.;σ2l , τ
2l ) ∗ f
where
g(.;σ2l , τ2l ) =
1√(2π)3σl4τl2
e−(x2+y2)
2σl2 −
t2
2τl2
Similar to the Harris corner detector we consider the second moment matrix given by
Equation 3.3. First spatial and temporal gradients are smoothed by Gaussian function.
Chapter 3. Video Interest Point Detectors 27
The second moment matrix is given by
µ = g(;σi2, τi
2)
L2x LxLy LxLt
LxLy L2y LyLt
LxLt LyLt L2t
(3.3)
Where Lx,Ly,Lt are the first order derivatives of L(x, y;σl2, τl
2) and σi, τi are the inte-
gration scales related to σl, τl by σi = sσl, τi = sτl where s=2 is the preferred value.
Interest points in video are the one which have large eigen values of the matrix µ. Instead
of finding eigen values to find the interest point. we can apply indirect method followed
by Harris to detect spatial interest points. The Harris corner function for videos is
defined as Equation 3.4
H = det (µ)− k ∗ trace3 (µ)
= λ1λ2λ3 − k(λ1 + λ2 + λ3)3
(3.4)
With k=0.005 and if H is greater than a positive threshold, we say it is a spatio temporal
interest point. The spatial and temporal scales of the interest detected are important
for capturing events with different spatial and temporal scales.
3.3.2 Scale Selection in Space-Time
Interest points are determined as the maxima of scale-normalized Laplacian defined by
∇2normL = σ2τ
12 (Lxx + Lyy) + στ
32Ltt (3.5)
Hence the interest points are the ones which attain maximum in normalized Lapla-
cian Equation 3.5 corresponding to Harris corner function H given by Equation 3.4.The
processes to find such points is given in Algorithm 2.
Corner points are found in spatio temporal scale space by using Equation 3.4 and these
points are tracked in scale space for maxima of ∇2normL.
Example 3.3. We applied STIP algorithm on a sample of walking sequence. As shown
in Figure 3.6, algorithm detects points around head, hands and legs where there are
motion between frames, along with some points near them. In addition to those points
there other background pointa with no motion are being detected. This is due to intensity
variation between frames.
Chapter 3. Video Interest Point Detectors 28
Algorithm 2 Detection Scale adapted Spatio temporal Interest Points
Require: Video f and spatial scales σ2l = σ2l,1, σ2l,2....σ
2l,n and temporal scales
τ2l = τ2l,1, τ2l,2....τ
2l,n
1: Construct scale space for initial spatial and temporal scales and in integration scales.
2: Find interest points pj =(xj, yj , tj , σ
2l,j, τ
2l,j
), for j = 1, 2, ...., N over scale space
using maxima of Equation 3.4.3: for each interest point j=1 to N do4: Compute ∇2
normL by using Equation 3.5 in present scale σ2l,j, τ2l,j and scale below
and above the present scale.5: Chose the scale which maximizes ∇2
normL6: if σ2i,j 6= σ2i,jorτ
2i,j 6= τ2i,j then
7: Redirect interest point pi to pj =(xj , yj, tj , σ
2l,j, τ
2l,j
)where τ2l,j,σ
2l,j are the
adapted scales and(xj , yj, tj
)is new position of interest point close to (xj, yj , tj)
8: set pj = pj and got o step 49: end if
10: end for
Figure 3.6: STIP Example: Frames with interest points marked
Chapter 3. Video Interest Point Detectors 29
There are cases where spatial corners have also been detected as spatio temporal interest
points even though there is no motion around the point. This could be attributed to
intensity variations between the frames which can cause a spatial interest point with
high variation in both spatial domains to become a spatio temporal interest point with
high gradient in temporal domain resulting form brightness variation or noise rather
than any motion or event of interest around the same. Points surrounding corner are
also being detected due to gaussian blurring of different scales.
3.4 ESURF
Speeded Up Robust Features (SURF) [11] is extended to detect spatio-temporal interest
points by Willems and Tuytelaars in [3]. It uses 3D Hessian as the saliency measure.
Box filters and integral image concepts in SURF are extended to integral video to get
better execution performance.
3.4.1 Hessian based Interest Point Detection and Scale Selection
Spatio temporal interest points were detected by using the 3D Hessian matrix :
H(.;σ2, τ2) =
Lxx Lxy Lxt
Lyx Lyy Lyt
Ltx Lty Ltt
If det(H) > 0 it is treated as interest point. This doest imply that all eigen values are
positive as in 2D case, so in addition to blobs saddle points are also being detected.
Scale Selection
We use γ normalization to find correct scales σ0 and τ0 at the center of Gaussian blob.At
the center of blob only first term in determinant exists and remaining vanish. Thus
Lγnormxx Lγnorm
yy Lγnormtt = σ5τ
52LxxLyyLtt (3.6)
For scale selection using Equation 3.6 we need to optimize two parameters σ and τ .
Iterative method is used find the scales at which det(H) attains maxima over the scales.
3.4.2 Implementation Details
Integral Video SURF uses integral image which simplifies computation of sum of
values in a rectangular region to addition of 4 terms. similarly ESURF uses integral
Chapter 3. Video Interest Point Detectors 30
video concept to compute sum of values within rectangular volume using 8 additions.
For a video of MxNxT an integral video at location (x,y,t) is equal to sum of all pixel
values over rectangular region spanned by (0,0) , (x,y) and summed over all frames
[0,t].For a video V, integral video IV is defined as Equation 3.7.
IV (x, y, t) =
t∑
i=1
y∑
j=1
x∑
k=1
V (i,j, k) (3.7)
ESURF uses determinant of 3D Hessian matrix as saliency measure, which have second
order derivatives in spatio temporal domain Dxx,Dyy ,Dtt,Dxt,Dxy,Dyt. They are com-
puted by rotated versions of box filters shown in Figure 3.7(a) and Figure 3.7(b) with 8
additions by using integral video.
(a) (b)
Figure 3.7: The two types of box filter approximations for the 2 + 1D second orderpartial derivatives in one direction and in two directions [3]
We build a pyramid of 3 octaves each octave consisting of 5 scales and ratio of scaling
between consecutive scales is 1.2 . This task can be parallelized due to box filter and
integral video each scale is independent of other. Determinant Hessian matrix is com-
puted over each scale in temporal and spatial domain. Combination of spatial and
temporal octaves oσandoτ results in pair of scales (σi, τi) gives a cube structure with
Hessian determinants. once cube is filled apply non maximum suppression to obtain
extrema in five spatio temporal scale space(x, y, t, σ, τ). The points at which extrema
occurred is the location of interest point and scale of interest point is the scale at which
it is detected .
Example 3.4. We applied ESURF algorithm on a sample of walking sequence. As
shown in Figure 3.8 algorithm detects points around head and jacket where there is
motion between frames, along with some points in the background.
ESURF detects blobs which are both in motion and are in background which gives
redundant information for some of applications.This algorithm execution time is less
Chapter 3. Video Interest Point Detectors 31
Figure 3.8: ESURF Example: Frames with interest points marked
due to the integral video concept and approximation of derivatives with box filters .
since each octave is independent of others it can be paralleled in execution which is not
possible for all other detector which work in hierarchial way.
3.5 ELIFT
Local Invariant Feature Tracks (LIFT) [12] was proposed by Mezaris,Ioannis and Dimou.
Feature tracks are the set of visually similar interest points in video having temporal
continuity.So, tracks are set point which are visually similar and similarly moving points.
We extended this notion of tracks to get interest points in video of fixed camera by
considering magnitude and direction of motion.
3.5.1 Feature Track Extraction
Let S be video of T frames, S = {It}Tt=1. We apply SIFT [6] on each frame It to get
feature points and feature descriptors Φ = {φm}Mm=1 where M is number of detected
Chapter 3. Video Interest Point Detectors 32
features and the descriptor φm is defined as φm = [φxm, φym, φdm]. φxm, φ
ym represents the
location of the interest point and φdm describes the interest point with 128 bin vector.
The interest point φm from It can have temporal correspondence with the point in It−1.
It is found by a local search in the neighbors of interest point by taking a square patch
of dimension 2δ + 1. A point φn ∈ Φt−1 is said to be tracked in It as φm if it satisfy
Equation 3.8
|φxm − φxn| ≤ δ,
|φym − φyn| ≤ δ,
d(φdm, φ
dn
)≤ dmin
(3.8)
δ = 7 and dmin = 0.6 are the values taken for evaluation.If there are more than one point
being matched to current point, we consider the one with having less distance. When
such an interest point φn exists, the interest point φm ∈ Φt is appended to the track
where the φn is present else φm is considered as the first element of the new feature track.
A set of feature tracks Ψ = {ψk}Kk=1 is formed by finding the temporal correspondences of
all interest points in all frames of S. The feature track ψk is defined as ψk = [ψxk , ψ
yk , ψ
dk],
where ψdk is average descriptor for feature track which is obtained by element wise
averaging of descriptors of elements of feature track and ψxk = [ψx,tk1
k , ψx,tk1+1k , ....ψx,tk2
k ]
, tk1 < tk2. ψyk is defined in the similar way as ψx
k . The trajectory of the feature is
ξk = [ψxk , ψ
yk ]. If the length of trajectory, tk2 − tk1, is more than 5 it is considered as
valid else it is discarded.
3.5.2 Interest Points on Feature Track
Interest points are the points on feature track with special features like points where track
changes its direction or motion. Select a feature track ψk with trajectory ξk. Algorithm
3 shows shows steps followed to detect interest points on feature track. Tracks with
average displacement greater than threshold are only considered for detection, because
background points which are stationary will have no motion and are not truly spatio
temporal interest points as shown in Figure 3.9(a). Points which have nonuniform
displacement are interest points on tracks and they become local maxima or minima
in track.The points where a track changes it direction or where it deviates from normal
path more than a threshold are said to be interest points as shown in Figure 3.9(b).
Example 3.5. We applied ELIFT algorithm on a sample of walking sequence. As shown
in Figure 3.10(b) algorithm detects points around head and jacket where there is motion
between frames, along with some points in the background due to round off errors.Points
which were tracked are shown in Figure 3.10(a) There are 1782 feature tracks with 49577
tracked points which were reduced to 2046 points.
Chapter 3. Video Interest Point Detectors 33
010
2030
40
93
93.5
94
94.5
95286
286.5
287
287.5
288
y
tx
(a)
4950
5152
5354
95
95.5
96
96.5
9798
99
100
101
102
103
(b)
Figure 3.9: Figure (a) Track with no motion andFigure (b) Track with motion and deviation in path
Algorithm 3 Detection Interest Points on Feature Tracks
Require: Feature tracks ΨEnsure: Ψ 6= φ1: Interest points P = φ2: for Feature Track ψk=1 to K do3: Calculate displacement in X and Y directions xd and yd4: Calculate average displacement xdavg and ydavg5: if average displacement xdavg > 1 && ydavg > 1 then
6: calculate direction of displacement d = tan−1(ydxd
)
7: for Every point pi ∈ ψk do8: if pi is local maxima or minima in x or y then9: P = P ∪ pi
10: break;11: end if12: if Direction di is opposite to di−1 then13: P = P ∪ pi14: break;15: end if16: if |di−di−1|> 40 then17: P = P ∪ pi18: end if19: end for20: end if21: end for22: RETURN P
The main advantage of this algorithm is that it detects points with motion without
optical flow which is computationally expensive. There are some background points
still persisting. This is due jumping of features between tracks which has arisen due
to distance between them spatially is very small. They have almost same 16x16 patch
considered for descriptor computation in SIFT. Their distance between their descriptors
Chapter 3. Video Interest Point Detectors 34
(a)
(b)
Figure 3.10: Figure (a) shows the traced features(LIFT) andFigure (b) shows the modified Interest points(ELIFT)
Chapter 3. Video Interest Point Detectors 35
is very small. This problem can be addressed by increasing dmin given Equation 3.8 but
the actual interest points are being filtered so there should a tradeoff.
3.6 Evaluation
3.6.1 Evaluation Criteria
We considered repeatability and execution time as in [14] for performance measuring.
Repeatability
It determines how much percentage of spatio temporal interest points are being retained
in spite of geometric transforms and protomorphic deformations. If this rate is higher
detector is robust to the transformations. It is defined as the ratio of number of matched
points to the mean of number of detected interest points of two videos given by
Equation 3.9
r1,2 =c(V1, V2)
mean(n1, n2)(3.9)
where c (V1, V2) is the number of matched pair of points and n1 and n2 represent number
detected interest points in V1 andV2 respectively.
Execution Time
Execution time is time taken for detector for detection of interest points in the video.
It measures the computational complexity of the detector. we also measure the number
of points detected and determine if they are sparse or dense.
3.6.2 Experimental Results
All programs are being executed on Intel Core2duo processor computer with 16 GB
RAM, under Windows 7 professional operating system. Table 3.1 shows the amount of
time taken for execution and number of interest points detected. A video of 1000 frames
is taken with 1280x720 YouTube720p with frame rate 25fps is considered for evaluation.
Brightness change
video is subjected to uniform change in intensity by Inew = Iframe + d, where d is a
constant varying from -40 to 40 image intensity is limited to [0 255]. The corresponding
performance characteristics of the video descriptors is shown in Figure 3.11. n-SIFT
,ESURF and STIP performed better than MoSIFT and ELIFT. The number of interest
points are less for ELIFT and MoSIFT and due to saturation of intensity at lower or
upper level may remove some of the points so they are sensitive to brightness change.
Chapter 3. Video Interest Point Detectors 36
−40 −30 −20 −10 0 10 20 30 400
10
20
30
40
50
60
70
80
90
100
brightness variation
Rep
eata
bilit
y
Brightness variation
ESURFnSIFTSTIPMoSIFTELIFT
Figure 3.11: Brightness change Vs Repeatability
0 0.5 1 1.5 2 2.5 3 3.5 40
10
20
30
40
50
60
70
80
90
100
scale variation
Rep
eata
bilit
y
Scale Variation Vs Repeatability
ESURFn−SIFTSTIPMoSIFTELIFT
Figure 3.12: Scale change Vs Repeatability
Scale changes
The videos and their spatially scaled versions from 50% to 200%, are used for evalu-
ating robustness of the descriptors to scaling. The result is shown in Figure 3.12.The
performance of all the algorithms is quite bad with down scaling of the video but with
up scaling there is some improvement in performance. This does not augur well for
matching or aligning images of different scales.
Compression
Video is compressed in MPEG format in a scale of [1 50] and results are presented in
Figure 3.13. As we performed lossy compression it changes values of pixels which changes
Chapter 3. Video Interest Point Detectors 37
1 2 3 4 5 6 70
10
20
30
40
50
60
70
80
90
100 Compression Vs Repeatability
compression variation
repe
atab
ility
ESURFn−SIFTSTIPMoSIFTELIFT
Figure 3.13: Compression change Vs Repeatability
the value of gradients hence every detector is expected to show decrease in repeatability
as compression increases. n-SIFT and STIP has shown better repeatability and ESURF
which uses box filters and integral video concept to give additive losses of information
and thus it shows low repeatability.
Execution Time
A comparison table of time of execution and number of interest points detected is given
in Table 3.1.It is observed that ESURF has better performance in time of execution
which has derived from integral video usage. MoSIFT and n-SIFT require more time of
execution. MoSIFT and ELIFT detect less number of interest points with less number
of background points. n-SIFT and ESURF detect a large quantity of points which are
dense. n-SIFT, STIP and ESURF detect background points along with points with
motion.
Time ofExecution(sec)
No.pointsdetected
n-SIFT 1404.954570 31288
STIP 1127.339653 130261
ESURF 113.326833 261960
MoSIFT 1829.52802 72723
ELIFT 737.707349 12046
Table 3.1: Execution Time Comparison
Chapter 4
Conclusions and Future Scope
4.1 Conclusion
This thesis mainly dealt with spatio temporal interest point detectors of video and their
stability in geometric and photometric deformations. SUSAN detector, Harris corner
detector, Harris Laplace detector and SIFT spatial interest points detectors have been
studied and their performance has been evaluated. SIFT exhibits better repeatability
compared to other detectors under image deformation. SIFT is a better package with
robust interest point detection and descriptor along with suitable matching techniques.
The spatio temporal extension of these image interest point detectors has been studied
and their performance is evaluated in terms of repeatability and time of execution. n-
SIFT has better repeatability than others under video deformations. The proposed
ELIFT has least number of interest points and less number of points without motion.
MoSIFT detects second least number of points, while ESURF detects the maximum
number of interest points. ESURF have less execution time and next is ELIFT. While
ESURF has problems in detecting spatio temporal interest points with motion, n-SIFT
detects spatial interest points with and without motion.
4.2 Future Scope
The probable extensions of the work done in this thesis are following
• Further study is required for analyzing stability temporal scaling i.e having differ-
ent frame rates.
• Performance evaluation of algorithms with nonuniform brightness variation need
to be evaluated.
38
Bibliography 39
• Detectors performance along with corresponding descriptor need to be analyzed
for different application like video classification, video alignment and action recog-
nition.
• ELIFT need to be extended for moving camera video applications.
• Camera motion estimation and evaluation of the video descriptors on video se-
quences with a moving camera along with alignment of such sequences present
another challenging research topic.
Bibliography
[1] W. Cheung and G. Hamarneh. N-sift: N-dimensional scale invariant feature trans-
form for matching medical images. In Biomedical Imaging: From Nano to Macro,
2007. ISBI 2007. 4th IEEE International Symposium on, pages 720–723, April 2007.
doi: 10.1109/ISBI.2007.356953.
[2] Ming-yu Chen and Alexander Hauptmann. Mosift: Recognizing human actions in
surveillance videos. 2009.
[3] Geert Willems, Tinne Tuytelaars, and Luc Van Gool. An efficient dense and scale-
invariant spatio-temporal interest point detector. In Computer Vision–ECCV 2008,
pages 650–663. Springer, 2008.
[4] Tinne Tuytelaars and Krystian Mikolajczyk. Local invariant feature detectors: a
survey. Foundations and Trends R© in Computer Graphics and Vision, 3(3):177–280,
2008.
[5] Chris Harris and Mike Stephens. A combined corner and edge detector. In Alvey
vision conference, volume 15, page 50. Manchester, UK, 1988.
[6] DavidG. Lowe. Distinctive image features from scale-invariant key-
points. International Journal of Computer Vision, 60(2):91–110, 2004.
ISSN 0920-5691. doi: 10.1023/B:VISI.0000029664.99615.94. URL
http://dx.doi.org/10.1023/B%3AVISI.0000029664.99615.94.
[7] StephenM. Smith and J.Michael Brady. Susan a new approach to low
level image processing. International Journal of Computer Vision, 23(1):
45–78, 1997. ISSN 0920-5691. doi: 10.1023/A:1007963824710. URL
http://dx.doi.org/10.1023/A%3A1007963824710.
[8] International Journal of Computer Vision, 60(1), 2004. ISSN 0920 5691.
[9] Tony Lindeberg. Linear spatio-temporal scale-space. In Scale-Space Theory in
Computer Vision, pages 113–127. Springer, 1997.
40
Bibliography 41
[10] Ivan Laptev. On space-time interest points. International Journal of Computer
Vision, 64(2-3):107–123, 2005.
[11] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up ro-
bust features. In Ales Leonardis, Horst Bischof, and Axel Pinz, editors, Com-
puter Vision ECCV 2006, volume 3951 of Lecture Notes in Computer Science,
pages 404–417. Springer Berlin Heidelberg, 2006. ISBN 978-3-540-33832-1. doi:
10.1007/11744023 32. URL http://dx.doi.org/10.1007/11744023_32.
[12] Vasileios Mezaris, Anastasios Dimou, and Ioannis Kompatsiaris. Local invariant
feature tracks for high-level video feature extraction. In Analysis, Retrieval and
Delivery of Multimedia Content, pages 165–180. Springer, 2013.
[13] University of Southern California. Sipi image database. URL
http://sipi.usc.edu/database/database.php.
[14] R Hendaoui, M Abdellaoui, and A Douik. Synthesis of spatio-temporal interest
point detectors: Harris 3d, mosift and surf-mhi. In Advanced Technologies for Signal
and Image Processing (ATSIP), 2014 1st International Conference on, pages 89–94.
IEEE, 2014.
[15] Zhijie Zhang, Chongxiao Cao, Ruijie Zhang, and Jianhua Zou. Video copy detection
based on speeded up robust features and locality sensitive hashing. In Automation
and Logistics (ICAL), 2010 IEEE International Conference on, pages 13–18. IEEE,
2010.
[16] Lin-na Li and Nan Geng. Algorithm for sequence image automatic mosaic based on
sift feature. In Information Engineering (ICIE), 2010 WASE International Con-
ference on, volume 1, pages 203–206. IEEE, 2010.
[17] J. Barron and N. Thacker. Tutorial: Computing 2d and 3d optical flow. 2005.