video descriptors

Performance Evaluation

of

Video Interest Point Detectors

A Thesis submitted in partial fulfillment of the requirements

for the award of the degree

Master of Technology

by

Tammina Siva Naga Srinivasu

Roll No: 124102011

under supervision of

Dr. Prabin Kumar Bora and Dr. Tony Jacob

Department of Electronics and Electrical Engineering

Indian Institute of Technology Guwahati

July 2014

D E C L A R A T I O N

I hereby declare that the work which is being presented in the thesis entitled, “Perfor-

mance Evaluation of Video Interest Point Detectors”, in partial fulfillment of

the requirements for the award of degree of Master of Technology at Indian Institute

of Technology Guwahati, is an authentic record of my own work carried out under the

supervision of Dr. Prabin Kumar Bora and Dr. Tony Jacob and duly listed

other researchers works in the reference section. The contents of this thesis, in full or

in parts, have not been submitted to any other Institute or University for the award of

any degree or diploma.


July 2014 Dept. of Electronics & Electrical Engineering,

Guwahati IIT Guwahati,

Guwahati, Assam,

India - 781039.

C E R T I F I C A T E

This is to certify that the work contained in the thesis entitled Performance Evalu-

ation of Video Interest Point Detectors is a bonafide work of Tammina Siva

Naga Srinivasu, Roll No. 124102011, which has been carried out in the department

of Electronics and Electrical Engineering, Indian Institute of Technology Guwahati under

my supervision and this work has not been submitted elsewhere for a Degree.

Dr. Prabin Kumar Bora and Dr. Tony Jacob

July 2014 Dept. of Electronics & Electrical Engineering,

Guwahati IIT Guwahati,

Guwahati, Assam,

India - 781039.

Department or School Web Site URL Here (include http://www.iitg.ac.in/eee/)

http://www.iitg.ac.in

“Difficulties in your life do not come to destroy you, but to help you to realize your

potential and power, let difficulties know that you too are difficult.”

Dr. A. P. J. Abdul Kalam

Abstract

This thesis presents an evaluation and comparison of various spatial and spatio temporal

interest points detectors. The popular spatial interest point detectors such as SUSAN

detector, Harris corner detector, Harris-Laplace corner detector and Scale Invariant

Feature Transform (SIFT) are studied. Their performance is evaluated under affine

transformations and JPEG compression. It is found that SIFT performs better than

other spatial interest point detectors.

Extension of these spatial interest point detectors into spatio temporal domain for videos

like n-SIFT, MoSIFT, ESURF,STIP and LIFT are studied. LIFT detects visually similar

interest points called feature tracks. We propose extension for LIFT which call ELIFT

to detect interest points on feature tacks, by this we obtain interest points with motion.

Performance of the interest point detectors is evaluated under geometric, phototrophic

transforms and MPEG compression. Repeatability and the time of execution are con-

sidered for evaluation. n-SIFT performed better when compared to other detectors but

it has background points. MoSIFT and ELIFT detects lower number of interest points

which are stable compared to the other algorithms.

Acknowledgements

I would like to thank my project supervisors Dr. Prabin Kumar Bora and Dr.

Tony Jacob sincerely for giving me the opportunity to work on the project Perfor-

mance Evaluation of Video Descriptors. I must say that they always came up with new

innovative ideas to help me when I was stuck at some point. I am very thankful to them

for spending their valuable time for clarifying my doubts. I would also like to thank the

Head of the Department and the other faculty members for their kind help in carrying

out this work.I am always indebted to Indian Institute of Technology Guwahati for

giving me a wonderful opportunity to study my in a world class university.

My heartfelt gratitude to my family for being on my side at all times. I will always

remember my beloved friends who made my stay at IIT Guwahati, a pleasant and

memorable period of my life.


Indian Institute of Technology Guwahati,

July 2014.

v



Contents

Abstract iv

Acknowledgements v

Contents vi

List of Figures viii

List of Tables ix

1 Introduction 1

1.1 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Scope of Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Image Interest Point Detectors 5

2.1 SUSAN Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Harris Corner Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Harris Laplace Corner Detector . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Scale Invariant Feature Transform . . . . . . . . . . . . . . . . . . . . . . 8

2.4.1 Basic Algorithmic Principle . . . . . . . . . . . . . . . . . . . . . . 8

2.4.2 Key-Point Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.3 Key Point Localization . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4.4 Key Point Orientation . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4.5 Key Point Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4.6 Image Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Video Interest Point Detectors 20

3.1 n-SIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.1 Interest Point Detection . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.2 n-SIFT Descriptor Computation . . . . . . . . . . . . . . . . . . . 21

3.2 MoSIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.1 Interest Point Detection . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.2 Descriptor Computation . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 STIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

vi

Contents vii

3.3.1 Interest Points in the Spatio-Temporal Domain . . . . . . . . . . . 26

3.3.2 Scale Selection in Space-Time . . . . . . . . . . . . . . . . . . . . . 27

3.4 ESURF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4.1 Hessian based Interest Point Detection and Scale Selection . . . . 29

3.4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . 29

3.5 ELIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.5.1 Feature Track Extraction . . . . . . . . . . . . . . . . . . . . . . . 31

3.5.2 Interest Points on Feature Track . . . . . . . . . . . . . . . . . . . 32

3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.6.1 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Conclusions and Future Scope 38

4.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

List of Figures

2.1 Results of Harris corner detector on (a) Synthetic image (b) Texture image 7

2.2 SIFT DoG Pyramid and Extrema Detection . . . . . . . . . . . . . . . . 11

2.3 SIFT descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Image Matching Using SIFT Features (a) shows matching with 0.3 thresh-old and (b) shows matching with 0.6 threshold . . . . . . . . . . . . . . . 14

2.5 Three Image Classes from SIPI Data set . . . . . . . . . . . . . . . . . . 15

2.6 Average Repeatability with Intensity . . . . . . . . . . . . . . . . . . . . . 16

2.7 Average Repeatability with Rotation . . . . . . . . . . . . . . . . . . . . . 17

2.8 Average Repeatability with Scale . . . . . . . . . . . . . . . . . . . . . . 18

2.9 Average Repeatability with Compression . . . . . . . . . . . . . . . . . . . 19

3.1 n-SIFT Image pyramid and DoG images [1] . . . . . . . . . . . . . . . . . 21

3.2 n-SIFT each of the 4x4x4 regions for Descriptor computation [1]. . . . . 22

3.3 n-SIFT Example: Frames with interest points marked . . . . . . . . . . . 23

3.4 MoSIFT Flow Chart [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5 MoSIFT Example: Frames with interest points marked . . . . . . . . . . 26

3.6 STIP Example: Frames with interest points marked . . . . . . . . . . . . 28

3.7 The two types of box filter approximations for the 2 + 1D second orderpartial derivatives in one direction and in two directions [3] . . . . . . . . 30

3.8 ESURF Example: Frames with interest points marked . . . . . . . . . . . 31

3.9 Figure (a) Track with no motion and * Figure (b) Track with motion anddeviation in path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.10 Figure (a) shows the traced features(LIFT) and * Figure (b) shows themodified Interest points(ELIFT) . . . . . . . . . . . . . . . . . . . . . . . 34

3.11 Brightness change Vs Repeatability . . . . . . . . . . . . . . . . . . . . . 36

3.12 Scale change Vs Repeatability . . . . . . . . . . . . . . . . . . . . . . . . 36

3.13 Compression change Vs Repeatability . . . . . . . . . . . . . . . . . . . . 37

viii

List of Tables

3.1 Execution Time Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 37

ix

Chapter 1

Introduction

Digital Videos are important source of information and entertainment. Analyzing video

and extracting information from it is important for applications like event detection in a

surveillance video, piracy video detection and video classification to manage large data

bases . These tasks can be achieved by using local features or interest points. Many

robust techniques are proposed for interest point detection in images can be used in

frame by frame basis treating video as a group of frames. The main difficulty is that the

content in video is in megabits or gigabytes with temporally covariant information which

is redundant. Hence we need to find true spatio temporal interest points. In capturing a

scene video camera is free to rotate, change zooming factor or can move. During Video

capturing illumination can vary and camera and environment can add noise to video

captured. Videos are compressed for Storing and transmission. The challenging task is

to obtain a stable, robust detector which addresses the above problems. This task can

made a bit easier by fixing camera during capturing.

Many methods have been proposed for detection of interest points in video for specific

applications. There is need for comparative study of the interest point detector’s stability

under geometric and photometric variations. Their performance on other applications

need to be evaluated.

1.1 Literature Survey

Local features are image patterns which are different from the immediate neighborhood

[4]. They are marked by changes in image properties such as intensity, colour, and

texture. Examples of local features are points, edge points and blobs. A detector is used

to extract features from image. Two examples are the Harris corner detector [5] which

1

Chapter 1. Introduction 2

detects corners and the SIFT [6] which is a blob detector. It is good if local features

represent meaningful object parts. It requires a high level knowledge of image content

which is not available in early stages. Therefore detectors use underlying image intensity

variations.

A feature detector should have good repeatability, informativeness and accuracy. It

should be invariant or covariant for common image deformations such as rotations,

translations, intensity variations including brightness and contrast and so on. Invari-

ance means that detected features should not change with image transformations. This

property of detector is very important for image processing applications such as im-

age registration, object recognition where images can be taken at different scales, view

point, or in the presence of occlusion. For example, the SIFT [6] descriptor for an inter-

est point will not change with rotations or brightness and contrast variations. Detectors

can also be covariant under certain transforms which means that it can vary location

but it holds it structure. For example, an edge at a particular orientation for an image

changes its orientation direction with rotation of the image. So most of the interest point

detectors are mainly covariant the descriptors derived from them are invariant due to

normalization.

Corners are interest points whose locations are invariant under rotations and translations

of the image. Harris and Stevens [5] detected corner points from an image using a

Gaussian smoothed intensity gradient popularly known as Harris corner detector. Smith

and Brady [7] had considered points which are extreme in their circular neighborhood as

SUSAN corners . Both of these corner detectors are rotation and translation invariant

but vary with scale changes. Harris corner detector was extended by Mikolajczyk and

Schmid [8] to find interest points and their scales, by using the scale space theory [9]

proposed by Lindbergh. In 1999, David Lowe presented a new algorithm called Scale

Invariant Feature Transform or SIFT [6] which was based on scale the space theory

and used the scale space representation of an image to find interest points and describe

them using a robust descriptor based on the local gradient structure around the interest

point. The algorithm was refined by 2004 to account for sub-pixel accuracy in key point

location and also described ways to match the descriptors between images for object

matching, registration and recognition under affine transformations. This thesis deals

with implementation and evaluation of Harris corner detector, Harris-Laplace corner

Detector, SUSAN corner detector and SIFT performances under intensity and geometric

transforms.

Similar to image interest point detector, video interest point detectors detects spatio

temporal interest points present in a video. The existing image detector can be ex-

tended to spatio temporal domain by changing image saliency measure. On Space Time


Interest Points (STIP) [10] proposed by Laptev is an extension of the Harris Laplace

corner detector. He extended 2D second moment matrix to a 3D second moment ma-

trix using gradients in spatio temporal domain and he determined spatial and temporal

scales of the detected points by maximizing the output of a normalized spatio tem-

poral Laplacian operator. In a similar fashion, SIFT which very robust technique in

image processing was developed into an arbitrary n-dimensional extension by Cheung

and Hamarneh as n-SIFT [1] towards applications in medical images including 3D MRI

and dynamic 3D+time CT data. These techniques treat time as a third extension of

space while detecting interest points. Other extension to SIFT was proposed by Chen

and Hauptmann [2] which uses SIFT to find spatial interest points which are then fil-

tered in the temporal domain using their optical flow. The motion in temporal domain

is captured by optical flow and interest points with sufficient optical flow are consid-

ered as spatio temporal interest points. Hence this technique is called Motion SIFT

(MoSIFT). SURF [11] is extend to into spatio temporal domain by Tuytelaars [3] by

using 3DHessian as saliency parameter and improved the computation performance by

approximation with box filter on integral video structure. Vasileios and Anastasios [12]

proposed a method LIFT which tracks SIFT features present in each frame to from

a feature tracks which are similarly moving, visually similar local regions. We extend

the LIFT to find interest points on feature tracks by considering points with significant

motion and rejecting most of background points which are redundant.

1.2 Scope of Project

The thesis work mainly deals with the study of image and video interest point detectors

and their evaluation. The image interest point detectors Harris corner detector, SUSAN

detector, Harris-Laplacian corner detector and SIFT are studied in detail and their

performance is evaluated. Video interest point detectors n-SIFT, MoSIFT, ESURF,

STIP and LIFT are studied. We proposed a method to extract interest points on the

tracks of LIFT. All the above video detectors are evaluated with respect to repeatability

and time of execution.

1.3 Thesis Organization

The Thesis is organized as follows:

• Chapter 2 gives detailed study of image interest point detectors and their perfor-

mance.


• Chapter 3 presents a overview of video interest point detectors and their perfor-

mance is evaluated.

• Chapter 4 concludes the thesis with a summary of work done and suggestions for

future work.

Chapter 2

Image Interest Point Detectors

Image Interest points which are invariant to rotation and translation as a part of affine

class of transformations along with scaling, brightness and contrast variations are used in

problems like image registration, object recognition etc. The literature on interest point

detection in images is vast and different type of detectors are present. This chapter gives

study and evaluation of corner and blob detectors. SUSAN [7], Harris corner detector

[5] and Harris Laplace Corner Detector [8] are corner detectors and SIFT [6] is blob

detector. SIFT is used for image matching and results are discussed.

2.1 SUSAN Detector

Corners are important local features in images. Generally a corner can be defined as

the intersection of two edges. They are high curvature points and lie in junctions of

different brightness.

SUSAN (Smallest Univalue Segment Assimilating Nucleus) detector [7] is proposed by

SMITH for detecting corners using. The SUSAN detector considers a circular mask

with the center of the mask as the nucleus. Intensity of every pixel is compared with the

nucleus value. Pixels having similar intensity are considered as USAN (Univalve Segment

Assimilating Nucleus) and if the area of USAN is less than half of the total mask area

then the nucleus is considered as a corner point. SUSAN detector’s computational

complexity is low but it is sensitive to image deformations.

5

Chapter 2. Image Interest Point Detectors 6

2.2 Harris Corner Detector

The Harris corner detector [5] is based on the fact that the eigen values of second

moment matrix represent the signal changes in orthogonal directions. The corner has

large variation in intensity in both directions so, they have large Eigen values. For an

image I(x,y), consider a window which is centered at (u, v) and the change E produced

by a shift (x,y) is given by Equation 2.1

E(x, y) ≈∑

u,v

w(u, v)[I(x + u, y + v)− I(x, y)]2 (2.1)

To maximize E(x,y) the term [I(x+ u, y + v)− I(x, y)]2 is expanded in Taylor series

and by neglecting the higher order terms we get

E(x, y) =∑

u,v

u2I2x + 2uvIxIy + v2I2y

Matrix equation form of this equation is

E(x, y) =[x y

]M

[u

v

]

Where

M =

[I2x IxIy

IxIy I2y

]

The derivatives are averaged by a Gaussian kernel of standard deviation σD and the

image is smoothed with a Gaussian mask of standard deviation σ1 to remove noise.

Matrix M is changed as given below

M = g(σ1)σD2

[I2x(x, σD) Ix(x, σD)Iy(y, σD)

Ix(x, σD)Iy(y, σD) I2y (y, σD)

]

Eigen values of the matrix M are required to determine the existence of corner. Instead

of actually calculating eigen values, a new quantity cornerness is defined as

cornerness = det(M)− k ∗ trace(M)2 (2.2)

Where k =0.04. Cornerness is difference of the product of eigen values and the sum of

eigen values. When there is change in both x,y direction the eigen values will be large

and positive. So cornerness should be positive and large for a corner point.


Harris corner detector on texture images

(a)

Harris corner detector on texture images

(b)

Figure 2.1: Results of Harris corner detector on(a) Synthetic image (b) Texture image

Example 2.1. Consider a synthetically generated checkerboard image shown in Fig-

ure 2.1(a) we can see that almost all corners have been detected in that image. The

response of the Harris corner detector on a texture image gives the output given in Fig-

ure 2.1(b). We can notice that it performs better on this image except that it can’t detect

some of the corner with less intensity variation. If we reduce the threshold to detect low

contrast corners there are false detections.

The main draw back of Harris corner detector is that it is variant to scale changes. It is

invariant to intensity changes since it considers difference of intensity values not direct

value.


2.3 Harris Laplace Corner Detector

Corners can be detected by the Harris corner detector [8], but they are not scale invariant.

Harris Laplace Detector provides a scale invariant features. It was developed based on

Harris corner detector principle. This gives a scale adapted local features i.e. interest

point along with the scale of occurrence.

Harris Laplace Corner Detector is developed by Constructing scale space L(x, y;σ) which

is a set of Gaussian smoothed images. The Laplacian detector detects interest points

whose location and characteristic scale is given by the maximum of laplacian operator.

The Laplacian of Gaussian scale space is constructed for finding scale of interest point.

At every scale of image find corner points by second moment matrix and Harris corner

principle (2.2). A corner is said to be a interest point if it attains an extrema in 8

neighbors in current scale and 18 neighbors present in scale above and below the current

scale.

2.4 Scale Invariant Feature Transform

The scale Invariant Feature Transform [SIFT] was developed by David Lowe [6]. The

SIFT interest points are scale invariant and are robust under affine transformations.

SIFT is based on concept of Scale-Space theory developed by Lindberg’s [9] .

2.4.1 Basic Algorithmic Principle

From pyramid of images (images with increasing blur), SIFT detects a series of key

points which are mostly blob like structures and accurately locates their locations and

scales. Then it calculates dominant orientation θref over a region surrounding the key

point. The knowledge of location, scale and orientation is used to find a descriptor

for the key point.Descritpor is computed from the normalized patch around key point.

The descriptor encodes the spatial gradient distribution around the key point by a 128

dimensional vector. This descriptor makes SIFT more robust to rotation, scale and

translation variations. The main steps involved in SIFT as explained by Lowe are:

• Key-Point Detection: A pyramid of Gaussian blurred images of different scales is

formed and the extremas are located in the Difference of Gaussian (DoG) images.

• Key-Point Localization: Each key point is fine-tuned by the Taylor series model

to get subpixel accuracy and they are selected based on their stability.


• Key-Point Orientation Computation: One or more orientations are found for a

key point based on local gradients. This orientation is used in future which makes

SIFT rotation invariant.

• Key-Point Descriptor Computation: From orientation normalized patch around a

key point, descriptor is formed by local image gradients from selected scale and

orientation.

2.4.2 Key-Point Detection

Gaussian Scale Space

Lindberg in his detailed mathematical analysis [9] of the multi-scale representation of

images, showed that the Gaussian kernel was the unique scale-space kernel. Gaussian

scale space is simply a stack of increasing blurred images. This blurring

simulates the loss produced when photograph is taken with different zoom out factor.

For an image I(x,y), the Gaussian blurring image L(x, y;σ) is defined as

L(x, y;σ) = g(x, y;σ) ∗ I (2.3)

Where g(x,y;σ) is the Gaussian kernel which is parameterized by standard deviation σ is

given by g(x, y;σ) = 12πσ2 e

−(x2+y2)/2σ2. the gaussian smoothing operator satisfies semi

closed relation given by Equation 2.4. We can construct Gaussian scale space by using

the Gaussian kernel and its semi closed relation.

g(;σ1 + σ2) = g(;σ1) ∗ g(;σ2) (2.4)

Digital Gaussian Smoothing

the two dimensional Gaussian kernel is obtained by sampling and truncating the Gaus-

sian function given by

g(x, y;σ) = Ke−(x2+y2)/2σ2∀ |x| , |y| ≤ [4σ] (2.5)

Where [.] is the floor function which truncates a fraction to an integer and K is used

for normalization. The scale space of images is constructed by 2-D convolution of image

with digital Gaussian kernel.

In his scale-space theory, Lindbergh [9] proposed scale normalized Laplacian for detecting

scale. Lowe proposed Difference of Gaussian (DoG) function as an approximation for


scale normalized Laplacian and obtained stable key points of the image with scaling.

The approximation of Laplacian can be derived from solution of diffusion Equation 2.6

∂g

∂σ= σ∇2g (2.6)

∂g

∂σ≈g(x, y;Kσ) − g(x, y;σ)

Kσ − σ

From diffusion Equation 2.6, we can approximate above equation as

g(x, y;Kσ) − g(x, y;σ) ≈ (K − 1)xσ2∇2g

The Difference of Gaussian DoG, D(x,y;σ) as given in Equation 2.7 is the difference of

Gaussian smoothed images of standard deviation separated by a constant factor k.

D(x, y;σ) = L(x, y;Kσ)− L(x, y;σ) (2.7)

The scale space is constructed by Lowe by step value k = 21s where s is the number scales

in an octave within which σ of the Gaussian kernel doubles. The scheme of constructing

scale space is shown in Figure 2.2. Key points are extremas in DoG image which are

extracted as follows:

1. Linearly interpolate the input image to twice its size.

2. Pre-smooth the image with a Gaussian filter of σ = 1.6 before giving it was given

input to the first octave level. The input blur is assumed as 0.5. This must be

considered getting initial blur 1.6. Use semi closed property of Gaussian scale

space and calculate the extra blur to be added.

3. Blur the image with a Gaussian filter of variance σ, kσ, k2σ and so on to create

Gaussian smoothed images in the octave, where Lowe recommends s=3. The blur

of in sth level in o octave is given by σ = 1.6 × 2o× s

3

4. Subtract the Gaussian convolved images as shown in Figure 2.2 to get s+2 DoG

images.

5. Detect all extrema in DoG image as shown in Figure 2.2. The extrema can be

either the maxima or minima with respect to its 8-neighbours in the same scale

and 9 neighbors in the scale above and below the present scale.

6. Once the octave is processed, sub-sample the image to half its size in both dimen-

sions by taking every second pixel in row and column.The sub-sampled image is

then processed in the next octave level with steps 3 to 6 and the process continues

all octaves in the pyramid.


Number of octaves o = log2(min(row, col)), where row and col represent the width

and length of the image.

Figure 2.2: SIFT DoG Pyramid and Extrema Detection [6]

2.4.3 Key Point Localization

SIFT aims for a descriptor invariant to affine transformations which is necessary for

image registration, object detection. Sub-pixel accuracy of the key-point is a desirable

feature for addressing such problems. Till now location of the key point is confined to

sampling grid Such coarse location is not suitable to reach full scale and translation

invariance. SIFT refines the position of key point by using Taylor series expansion of

the DoG, D(x,y;σ) given by

D(X) = D +∂DT

∂XX +

1

2XT ∂

2D

∂X2X (2.8)

where D and its derivatives are evaluated at the key-point location X = (x, y, σ)T . Then

the local extremum can be obtained by setting the derivative of D(x) with respect to x

to zero which yields displacement vector from the key-point position, X ,given by

X = −∂2D

∂X2

−1∂D

∂X(2.9)

If X < 0.5 in any of the dimension then X value is updated, else the key point is skipped

and processes is continued with next point. Substituting the value of X in D(X) (2.8)

the function value at extrema is calculated. Lowe used this value to eliminate weak

unstable points. If function values is greater than threshold of 0.3, where the image

intensity is in [0 1], key point is considered as interest point.


Eliminating Points on Edges

The DoG extrema were found to contain strong response along edges which are not good

interest points with respect to rotations or affine transformations. To detect edges, Lowe

used the Hessian matrix (2.10). Since eigenvalues of Hessian matrix correspond to the

principal curvatures, their ratio will be deviating from unity by a large factor if the point

is an edge. If the eigenvalues are having opposite signs, then the point is a saddle point

and hence the point is not an extremum.

H =

[Dxx Dxy

Dxy Dyy

](2.10)

Lowe considered the ratio of largest eigenvalue to smaller one α/β = r. For an edge key

point it has a large value r > 10, Tr(H)2

Det(H) =(α+β)2

(αβ) = (rβ+β)2

(rβ2)= (1+r)2

r

By considering cedge = 10 as threshold we decide whether it is an edge point or not.

If Tr(H)2

Det(H) >(cedge+1)2

(cedge)key point is discarded.

2.4.4 Key Point Orientation

Rotation invariant descriptors can classified into two categories. One them is based on

image properties which are already rotation invariant. The other is that descriptors

which are normalized with respect to reference orientation. SIFT belongs to second

category.

The Key-point orientation corresponds to the dominant gradient orientation in the neigh-

borhood of the key point.Orientation can be found by the following procedure.

1. At each (x, y;σ) , compute the gradient magnitude and orientation using pixel

difference as given below

m(x, y) =

√(L(x+ 1, y;σ) − L(x− 1, y;σ))2 + (L(x, y + 1;σ)− L(x, y − 1;σ))2

θ(x, y) = tan−1(L(x,y+1;σ)−L(x,y−1;σ)L(x+1,y;σ)−L(x−1,y;σ) )

(2.11)

Weigh the gradient magnitudes by a Gaussian window with σ = 1.6 times the scale

of the key point. This will give more importance to the gradients closer to the key

point.

2. Compute an orientation histogram of 36 bins covering the 360 degrees of gradi-

ent orientations with each sample contributing its gradient magnitude to the bin

corresponding to its gradient orientation bin.


3. Fit a parabola to the 3 bins with the highest values to interpolate the peak position

and thus determine the key point orientation.If there are any bins with 80% or

more magnitude of the highest value bin, then they should also be used to create

new key points at the same (x, y, σ), but with different orientations corresponding

to these bins.

Thus the key point(x, y, σ, θ) is localized and now the descriptor for this key point has

to be computed from the gradients.

2.4.5 Key Point Descriptor

The local patch of 16x16 is taken in L(x, y, σ) around the key point. Split the region into

sixteen 4x4 regions and in each region, an 8 bin histogram is computed by the same way

as the orientation histogram with each sample contributing its gradient magnitude to

the corresponding orientation bin as shown in Figure 2.3. The sixteen 8-bin histograms

are then concatenated to get the 128-bin key point descriptor.In addition to these steps,

Lowe introduced 2 additional steps to minimize the effects of lighting on the descriptor.

Figure 2.3: SIFT descriptor [6]

1. Normalize the descriptor to unit length. This reduces effect of contrast changes

which adds a constant value to the image gradients in the neighborhood.

2. Threshold any bin value above 0.2 to 0.2 and normalize the descriptor again. This

reduces the effect of directional illumination changes which can cause large gradient


magnitude changes in particular direction. By limiting the bin values to 0.2, effect

of large gradient values on the descriptor is reduced.

Thus each SIFT key point (x, y, σ, θ) is described by 128-bin histogram descriptor.

2.4.6 Image Matching

SIFT has many applications like object recognition, image registration, and image

matching of them we considered image matching. It has been done by following steps

1. Compute SIFT descriptors for both images

2. Pair the similar key points of two images with Euclidian distance between the

descriptors of two key points is below a threshold. Lowe suggested 0.6 as threshold.

3. If ratio of distance of second best match to first best match is more than 80 then

key point is discarded.

sift features matching under scale variation with 0.3 as threshold

(a)

sift features matching under scale variation with 0.6 as threshold

(b)

Figure 2.4: Image Matching Using SIFT Features(a) shows matching with 0.3 threshold and (b) shows matching with 0.6 threshold

Example 2.2. Two images with different scales are taken and applied for matching

algorithm. SIFT detected 1022 and 939 key points in two images. Out of which for 0.6

as threshold we got 405 points as matched key points between the two images it is shown

in Figure 2.4(b). If threshold is 0.3, it showed 122 points only. It shown in Figure 2.4(a).

Depending on the application and nature of images we can vary threshold to get better

performance in matching.


2.5 Evaluation

This section deals with the performance evaluation of image interest point detectors.

We considered repeatability as performance factor, which is used to test the detector’s

performance in image deformations like compression, intensity variation, scale changes

and rotation. The SUSAN detector, the Harris corner detector, the Harris-Laplace

corner detector and the SIFT are evaluated. Since only SIFT has descriptor, only

interest point detection part of the algorithms are compared. The images used for the

evaluation were taken from the SIPI [13] data set. It contains three classes of images ,

namely textures, aerials and miscellaneous images. One image from each class is shown

in Figure 2.5. This offers a wider range of data to test the algorithms.

Figure 2.5: Three Image Classes from SIPI Data set

Repeatability Criterion

Repeatability Criterion test whether the detector is invariant to changes of imaging

conditions and data storage. It is defined by Hendaoui and Abdellaoui in [14] as Equa-

tion 2.12

rdi (Ii, Idi ) =

C(Ii, Idi )

mean(mi,mdi )

(2.12)

Where Ii, Idi are original image and deformed image respectively. C(Ii, I

di ) is num-

ber of matched interest points and mi, mdi are the number of interest points in Ii, I

di

respectively. For performance evaluation , six images from each class of the data set

were considered and the repeatability was averaged over the six images from each class

according Equation 2.13.

average repeatability, rdavg =

6∑i=1

rdi

6(2.13)


Intensity Variations

The overall brightness of the image is varied from -40 to 40 by using relation

Inew = Iimage ± d and image intensity is limited to [0 250]. The interest point positions

ideally should be unchanged from the original image under intensity variations. But

as shown from the plot, repeatability of the detectors falls with changes in intensity.

Gradients being the difference of pixel intensities should ideally be invariant under ad-

ditive intensity variations. However pixel intensities close to white (255) or dark (0)

may be lost in the process and hence the neighborhood gradients get affected. With

increasing additive values, more pixels become white or dark and so detectors perfor-

mance comes down. All detectors response is affected by brightness variations in all

three image classes. SUSAN gives a better response since it considers a 7 circular neigh-

borhood to detect a point and majority the pixels turning into black or white due to

intensity variation is less often. SIFT and SUSAN response is better than Harris and

Harris Laplace corner detector response. Results of intensity variation are shown in

Figure 2.6(a) for aerial images ,Figure 2.6(c) for texture images and Figure 2.6(b) for

miscellaneous images.

−40 −30 −20 −10 0 10 20 30 400

10

20

30

40

50

60

70

80

90

100brightness variation of AERIAL images

brightness variation

Ave

rage

Rep

eata

bilit

y

SIFTHrs−lpHarrissusan

(a)

−40 −30 −20 −10 0 10 20 30 400

10

20

30

40

50

60

70

80

90

100brightness variation of MISC images


Ave

rage

Rep

eata

bilit

y

SIFTHars−lptHarisSUSAN

(b)

−40 −30 −20 −10 0 10 20 30 4010

20

30

40

50

60

70

80

90

100brightness variation of TEXTURE images


Ave

rage

Rep

eata

bilit

y

SIFThrs−lpHarrissusan

(c)

Figure 2.6: Average Repeatability with Intensity


Image Rotations

The images and their rotated versions have been taken from SIPI [13] Image database.

There are texture images with rotations spanning from 0 to 200 degrees. SIFT has shown

better repeatability with descriptor since we normalize the orientation of interest point

before calculating the descriptor which gives rotation invariance to it. In the absence of

descriptor SIFT has shown less repeatability and it is almost constant for a any angle.

Harris and Harris Laplacian detectors show almost no repeatability since they consider

1D gradients for detection which are sensitive to rotation. SUSAN detector has shown a

better repeatability than Harris since it considers a circular neighborhood for detection,

rotation of image mostly doesn’t change circular neighborhood. Results are shown in

Figure 2.7

−80 −60 −40 −20 0 20 40 60 800

10

20

30

40

50

60

70

80

90

100rotation variation

rotation variation

Ave

rage

Rep

eata

bilit

y

SIFTSUSAN

Figure 2.7: Average Repeatability with Rotation

Scale Variations

The images and their scaled versions from 50% to 200%, are used to evaluate detectors

response to scaling. It may be noted that SIFT and Harris Laplacian detectors account

for scale invariance while SUSAN and Harris corners are not scale invariant. All detectors

show poor repeatability with scale except SIFT. Harris-Laplace has a better performance

than Harris but lower than SIFT. Results of scale variation are shown in Figure 2.8(a)

for aerial images , Figure 2.8(b) for miscellaneous images and Figure 2.8(c) for texture

images.

Compression Variations

Images were compressed in JPEG format in a scale of [1 99] where scale represents

inverse of compression ratio. As we performed lossy compression it changes values of

pixels which in tern changes the value of gradients. Hence every detector is expected

to show decrease in repeatability as compression increases. Detectors performed better

in compression of aerial and texture images which have redundant information. SIFT


0 0.5 1 1.5 2 2.5 3 3.5 40

10

20

30

40

50

60

70

80

90

100scale variation of AERIAL images

scale variation

Ave

rage

Rep

eata

bilit

y

SIFTHarrs−laptHarrisSUSAN

(a)

0 0.5 1 1.5 2 2.5 3 3.5 40

10

20

30

40

50

60

70

80

90

100scale variation of misc images

scale variation

Ave

rage

Rep

eata

bilit

y

SIFTHaris−laptHarrisSUSAN

(b)

0 0.5 1 1.5 2 2.5 3 3.5 40

10

20

30

40

50

60

70

80

90

100scale variation of texture images

scale variation

Ave

rage

Rep

eata

bilit

y

SIFTHaris−laptHarrisSUSAN

(c)

Figure 2.8: Average Repeatability with Scale

and Harris Laplace detectors performed better than others as they consider scale space

of images so they detect interest points in blurred or approximated scales also.Results

of compression variation are shown in Figure 2.9(a) for aerial images, Figure 2.9(b) for

miscellaneous images and Figure 2.9(c) for texture images.

2.6 Conclusion

The investigations in this chapter shows that SIFT exhibits better repeatability than

other detectors in most image deformations. In scale and rotation changes, SIFT also

has poor repeatability. The descriptor is responsible for rotation invariance of SIFT,

with it SIFT became more popular. The number of interest points detected is more for

SIFT. The SUSAN detector has better response to rotation. Harris Laplace detector

has better response than Harris corner detector in scale variations.


0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100compression of aerial image

compression scale

aver

age

repe

atab

ility

SIFTHars−lapHarrisSUSAN

(a)

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100compression of misc image

compression scale

aver

age

repe

atab

ility

SIFTHaris−lapHarrisSUSAN

(b)

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100compression of texture image

compression scale

aver

age

repe

atab

ility

SIFTHars−lapHarrisSUSAN

(c)

Figure 2.9: Average Repeatability with Compression

Chapter 3

Video Interest Point Detectors

A number of video interest point detectors have been proposed for detection of spatio

temporal interest points. some use robust images detectors frame by frame basis for

applications like duplicate video detection [15] and video stitching [16]. These are not

true video descriptors as they do not consider time dimension which is very critical

for videos. Image detectors have been extended for videos in many different forms.

SIFT [6] was extended to n-SIFT [1] by considering time gradients for saliency measure.

MoSIFT [2] considers optical flow in time domain and SIFT for spatial domain. STIP

[10] is extension of Harris Laplacian detector [8] which use 3D gradients. ESURF [3] is

direct extension of SURF to videos which use integral video concept and box filters. The

LIFT [12] is trajectory based techniques which utilizes the matching image descriptors

between frames. Local Invariant Feature Tracks (LIFT) form another class of video

which gives similarly moving and visibility moving features. we propose an extension of

LIFT to ELIFT by picking interest points on feature tracks. This chapter deals with

study of all these video descriptors and gives a comparative study of their performances

under deformations like scaling,change in brightness and compression. Video alignment

was done by using these descriptors and results were discussed.

3.1 n-SIFT

n-SIFT [1] was proposed by Cheung and Ghassan Hamarneh for the detection and

matching of interest points in medical images which includes 3D MRI images as well as

3D+time CT images. It was direct extension of SIFT from 2D to arbitrary nD images.

20

Chapter 3. Video Interest Point Detectors 21

Figure 3.1: n-SIFT Image pyramid and DoG images [1]

3.1.1 Interest Point Detection

The nD image is progressively convolved with nD Gaussian and sub-sampled for next

octave. Difference of Gaussian (DoG), is formed by difference between successive levels

in an octave. Each points is an extrema when compared to neighbors in n-dimensions

it is a interest point. The interest point detection algorithm as described by Cheung [1]

is given in Algorithm 3.1.

The pyramid construction is similar to SIFT and is shown in Figure 3.1. Steps shown in

Algorithm 3.1. is described as follows: We convolve nD input images with nD gaussian

of variance σ, kσ, k2σ and so on to create s+3 Gaussian smoothed images in an octave.

Their difference give s+2 DoG images. Interest points are the extrema in DoG images.

A voxel pj(x1, x2, ...xn) is defined as local extrema iff it satisfies the conditions

|pj(x1, x2, ...xn)| ≥ |pj+i0(x1+i1, x2+i2 , ...xn+in)| or

|pj(x1, x2, ...xn)| ≤ |pj+i0(x1+i1, x2+i2 , ...xn+in)|

∀i0 = −1, 0, 1,∀i1 = −1, 0, 1∀i2 = −1, 0, 1.....and∀in = −1, 0, 1

(3.1)

Subsample the image by half in all dimensions and is input to next octave. For a video

number of octaves is o = log2(min(m,n, t)) where m,n are frame dimensions and t is

number of frames. However, it has been noted that no meaningful interest points are

detected by n-SIFT after 3-4 octaves. So we take the number of octaves as min(4, o).

Initial blur is taken as 1.6 and s=3 as suggested by lowe [2].

3.1.2 n-SIFT Descriptor Computation

The n-SIFT uses the nD gradients to create 25n−3 dimensional descriptor for every

feature point which is similar to the descriptor of SIFT. The steps followed to calculate


Algorithm 1 n-SIFT Interest Point Detection [1]

1: I0 = I2: F = φ3: for i=0 to maxImageLevel do4: Ii,j=GaussianFilter

(Ii, k

jσ)Gaussian Filter Image Ii with sigma =kjσ

5: dIi,j = Ii,j − Ii,j−1

6: end for7: for j=1 to S+2 do8: PI=Find Extrema(dIi,j−1, dIi,j , dIi,j+1)9: for p∈PI do

10: F=F∪GenegateFeatures(p,Ii)11: end for12: Ii+1=scale(GaussianFilter((Ii, σ), 0.5

i))13: end for14: return F

Figure 3.2: n-SIFT each of the 4x4x4 regions for Descriptor computation [1].

the descriptor are as follows:

1. Select a 16n hypercube around the interest point as shown in Figure 3.2. Then

calculate nD gradients and express them in terms of the magnitude and the orien-

tation direction.

2. The magnitudes of gradients are weighted by a Gaussian centered at the interest

point location, it gives more weight to pixels near to the interest point.

3. Compute the (n-1) orientation bin histogram with each voxel gradient magnitude

added to the bin corresponding to its orientation.

4. Find the bin with highest value. This gives dominant orientation and it is sub-

tracted from neighboring voxel orientation to make it rotation invariant.

5. Divide the hypercube into 4n sub parts as shown in Figure 3.2, each part is de-

scribed by 8 bins for each of the (n-1) directions. Thus each sub region is rep-

resented by 8n−1 bins and in total, 4n8n−1 = 25n−3 dimensional feature vector is

created.


Figure 3.3: n-SIFT Example: Frames with interest points marked

6. Normalize the feature vector as in the case of SIFT to account for scaling from

Brightness variations.

Example 3.1. We applied n-SIFT algorithm on a sample of walking sequence. As shown

in Figure 3.3 the algorithm detects points around head, hands and legs where there is

motion. As seen from Figure 3.3, the algorithm detects interest points with no motion

also intensity variations in between the frames is the reason for such redundant points.

The most noticeable trait for n-SIFT is that the feature detection detects a large number

of spatial interest points without motion. Another problem detected with n-SIFT is its

large memory usage. Since it treats a video as 3D image and builds octaves from it, it

requires a large amount of memory.

3.2 MoSIFT

The MoSIFT also known as Motion-SIFT [2] was proposed by Chen and Hauptmann to

detect interest points which are used to describe motion and action recognition in videos.


The algorithm unlike the previous algorithms use optical flow for time motion instead

of temporal gradient. A detail view of this method is given in subsequent sections.

3.2.1 Interest Point Detection

MoSIFT uses SIFT for the detection of interest points in spatial domain on frame by

frame basis and optical flow is used for temporal domain. the flow chart of this algorithm

is given in Figure 3.4. We calculate interest points in spatial domain by using the SIFT

by considering each frame at a time. We construct Gaussian pyramid of octaves and

then taking their difference to get DoG. The interest points are the Extremas in DoG

as explained in Chapter 2. These interest points must be refined to get actual MoSIFT

interest points. The optical flow is considered as saliency measure for the time dimension.

Figure 3.4: MoSIFT Flow Chart [2].

The optical flow [17] is used to find motion interest points in the video. Optical flow

is based on brightness and smoothness. Let the image brightness at point (x,y) in an

image plane at time t is denoted by I(x,y,t). The brightness constraint assumes that the

brightness of particular point in the pattern is constant, which gives following:

dIdt = 0

⇒ ∂I∂x

∂x∂t +

∂I∂y

∂y∂t +

∂I∂t = 0

Ixux + Iyvy + It = 0

(3.2)


where vx,vy are the optical flow velocities in x and y directions in constitutive frames.

There are many techniques for solving this equation. Here, the least-squares method

proposed by Lucas-Kanade [17] is used.

Figure 3.4 shows the block diagram to compute optical flow between the successive

frames. Consider a pair of pyramids from the a successive frames to compute the optical

flow at the same image pyramid level at which the SIFT interest point is detected. The

magnitude of optical flow value at the SIFT interest point is above the threshold it

is considered as spatio-temporal interest point in video. We can match SIFT interest

points between frames and apply optical flow threshold on the matched points. This

reduces number of interest points and also their stability across the frames.

3.2.2 Descriptor Computation

MoSIFT uses the DoG and optical flow. So it is easier to acquire both spatial gradients

and optical flow. MoSIFT uses the histogram of gradient (HoG) and also the histogram

of optical flow (HoF) for describing an interest point. MoSIFT descriptor is obtained

by concatenating HoG and HoF to get a 256 bin vector.

SIFT descriptors is computed as described in Chapter2. For describing motion around

the interest point, the descriptor is computed from the local optical flow in the same

way SIFT descriptor is computed from the image gradients. The dominant orientation

normalization is not present here since the actual direction of motion is important. The

gradient optical flow has also both direction and magnitude of motion. Take a local

patch of 16x16 around the interest point and divide it into 4x4 patches. Each of 4x4

patch is described with 8 bin orientation histogram. Each optical flow value contributes

to the bin of its orientation direction with weight given by its magnitude. The sixteen

8-bin histograms are concatenated into a 128-bin optical flow histogram.

Example 3.2. We applied MoSIFT algorithm on a sample of walking sequence . As

shown in Figure 3.5 the algorithm detects points around head, hands and legs where

there is motion between frames, along with some points near them were detected due to

Gaussian weighing in the neighborhood .

However the main noticeable trait of this technique is that the uniform motion points

were captured in multiple frames which is redundant and the quantity of interest points

detected is less .


Frame 1 Frame 2

Frame 3 Frame 4

Figure 3.5: MoSIFT Example: Frames with interest points marked

3.3 STIP

3.3.1 Interest Points in the Spatio-Temporal Domain

The Space-Time Interest Points (STIP) was developed by Laptev [10] using

3D gradient structure along with the scale space representation to find interest points

in video. The video f(x, y, t) is convolved with 3D Gaussian g(.;σ2l , τ2l ) to give the scale

space representation:

L(x, y;σ2l , τ2l ) = g(.;σ2l , τ

2l ) ∗ f

where

g(.;σ2l , τ2l ) =

1√(2π)3σl4τl2

e−(x2+y2)

2σl2 −

t2

2τl2

Similar to the Harris corner detector we consider the second moment matrix given by

Equation 3.3. First spatial and temporal gradients are smoothed by Gaussian function.


The second moment matrix is given by

µ = g(;σi2, τi

2)

L2x LxLy LxLt

LxLy L2y LyLt

LxLt LyLt L2t

(3.3)

Where Lx,Ly,Lt are the first order derivatives of L(x, y;σl2, τl

2) and σi, τi are the inte-

gration scales related to σl, τl by σi = sσl, τi = sτl where s=2 is the preferred value.

Interest points in video are the one which have large eigen values of the matrix µ. Instead

of finding eigen values to find the interest point. we can apply indirect method followed

by Harris to detect spatial interest points. The Harris corner function for videos is

defined as Equation 3.4

H = det (µ)− k ∗ trace3 (µ)

= λ1λ2λ3 − k(λ1 + λ2 + λ3)3

(3.4)

With k=0.005 and if H is greater than a positive threshold, we say it is a spatio temporal

interest point. The spatial and temporal scales of the interest detected are important

for capturing events with different spatial and temporal scales.

3.3.2 Scale Selection in Space-Time

Interest points are determined as the maxima of scale-normalized Laplacian defined by

∇2normL = σ2τ

12 (Lxx + Lyy) + στ

32Ltt (3.5)

Hence the interest points are the ones which attain maximum in normalized Lapla-

cian Equation 3.5 corresponding to Harris corner function H given by Equation 3.4.The

processes to find such points is given in Algorithm 2.

Corner points are found in spatio temporal scale space by using Equation 3.4 and these

points are tracked in scale space for maxima of ∇2normL.

Example 3.3. We applied STIP algorithm on a sample of walking sequence. As shown

in Figure 3.6, algorithm detects points around head, hands and legs where there are

motion between frames, along with some points near them. In addition to those points

there other background pointa with no motion are being detected. This is due to intensity

variation between frames.


Algorithm 2 Detection Scale adapted Spatio temporal Interest Points

Require: Video f and spatial scales σ2l = σ2l,1, σ2l,2....σ

2l,n and temporal scales

τ2l = τ2l,1, τ2l,2....τ

2l,n

1: Construct scale space for initial spatial and temporal scales and in integration scales.

2: Find interest points pj =(xj, yj , tj , σ

2l,j, τ

2l,j

), for j = 1, 2, ...., N over scale space

using maxima of Equation 3.4.3: for each interest point j=1 to N do4: Compute ∇2

normL by using Equation 3.5 in present scale σ2l,j, τ2l,j and scale below

and above the present scale.5: Chose the scale which maximizes ∇2

normL6: if σ2i,j 6= σ2i,jorτ

2i,j 6= τ2i,j then

7: Redirect interest point pi to pj =(xj , yj, tj , σ

2l,j, τ

2l,j

)where τ2l,j,σ

2l,j are the

adapted scales and(xj , yj, tj

)is new position of interest point close to (xj, yj , tj)

8: set pj = pj and got o step 49: end if

10: end for

Figure 3.6: STIP Example: Frames with interest points marked


There are cases where spatial corners have also been detected as spatio temporal interest

points even though there is no motion around the point. This could be attributed to

intensity variations between the frames which can cause a spatial interest point with

high variation in both spatial domains to become a spatio temporal interest point with

high gradient in temporal domain resulting form brightness variation or noise rather

than any motion or event of interest around the same. Points surrounding corner are

also being detected due to gaussian blurring of different scales.

3.4 ESURF

Speeded Up Robust Features (SURF) [11] is extended to detect spatio-temporal interest

points by Willems and Tuytelaars in [3]. It uses 3D Hessian as the saliency measure.

Box filters and integral image concepts in SURF are extended to integral video to get

better execution performance.

3.4.1 Hessian based Interest Point Detection and Scale Selection

Spatio temporal interest points were detected by using the 3D Hessian matrix :

H(.;σ2, τ2) =

Lxx Lxy Lxt

Lyx Lyy Lyt

Ltx Lty Ltt

If det(H) > 0 it is treated as interest point. This doest imply that all eigen values are

positive as in 2D case, so in addition to blobs saddle points are also being detected.

Scale Selection

We use γ normalization to find correct scales σ0 and τ0 at the center of Gaussian blob.At

the center of blob only first term in determinant exists and remaining vanish. Thus

Lγnormxx Lγnorm

yy Lγnormtt = σ5τ

52LxxLyyLtt (3.6)

For scale selection using Equation 3.6 we need to optimize two parameters σ and τ .

Iterative method is used find the scales at which det(H) attains maxima over the scales.

3.4.2 Implementation Details

Integral Video SURF uses integral image which simplifies computation of sum of

values in a rectangular region to addition of 4 terms. similarly ESURF uses integral


video concept to compute sum of values within rectangular volume using 8 additions.

For a video of MxNxT an integral video at location (x,y,t) is equal to sum of all pixel

values over rectangular region spanned by (0,0) , (x,y) and summed over all frames

[0,t].For a video V, integral video IV is defined as Equation 3.7.

IV (x, y, t) =

t∑

i=1

y∑

j=1

x∑

k=1

V (i,j, k) (3.7)

ESURF uses determinant of 3D Hessian matrix as saliency measure, which have second

order derivatives in spatio temporal domain Dxx,Dyy ,Dtt,Dxt,Dxy,Dyt. They are com-

puted by rotated versions of box filters shown in Figure 3.7(a) and Figure 3.7(b) with 8

additions by using integral video.

(a) (b)

Figure 3.7: The two types of box filter approximations for the 2 + 1D second orderpartial derivatives in one direction and in two directions [3]

We build a pyramid of 3 octaves each octave consisting of 5 scales and ratio of scaling

between consecutive scales is 1.2 . This task can be parallelized due to box filter and

integral video each scale is independent of other. Determinant Hessian matrix is com-

puted over each scale in temporal and spatial domain. Combination of spatial and

temporal octaves oσandoτ results in pair of scales (σi, τi) gives a cube structure with

Hessian determinants. once cube is filled apply non maximum suppression to obtain

extrema in five spatio temporal scale space(x, y, t, σ, τ). The points at which extrema

occurred is the location of interest point and scale of interest point is the scale at which

it is detected .

Example 3.4. We applied ESURF algorithm on a sample of walking sequence. As

shown in Figure 3.8 algorithm detects points around head and jacket where there is

motion between frames, along with some points in the background.

ESURF detects blobs which are both in motion and are in background which gives

redundant information for some of applications.This algorithm execution time is less


Figure 3.8: ESURF Example: Frames with interest points marked

due to the integral video concept and approximation of derivatives with box filters .

since each octave is independent of others it can be paralleled in execution which is not

possible for all other detector which work in hierarchial way.

3.5 ELIFT

Local Invariant Feature Tracks (LIFT) [12] was proposed by Mezaris,Ioannis and Dimou.

Feature tracks are the set of visually similar interest points in video having temporal

continuity.So, tracks are set point which are visually similar and similarly moving points.

We extended this notion of tracks to get interest points in video of fixed camera by

considering magnitude and direction of motion.

3.5.1 Feature Track Extraction

Let S be video of T frames, S = {It}Tt=1. We apply SIFT [6] on each frame It to get

feature points and feature descriptors Φ = {φm}Mm=1 where M is number of detected


features and the descriptor φm is defined as φm = [φxm, φym, φdm]. φxm, φ

ym represents the

location of the interest point and φdm describes the interest point with 128 bin vector.

The interest point φm from It can have temporal correspondence with the point in It−1.

It is found by a local search in the neighbors of interest point by taking a square patch

of dimension 2δ + 1. A point φn ∈ Φt−1 is said to be tracked in It as φm if it satisfy

Equation 3.8

|φxm − φxn| ≤ δ,

|φym − φyn| ≤ δ,

d(φdm, φ

dn

)≤ dmin

(3.8)

δ = 7 and dmin = 0.6 are the values taken for evaluation.If there are more than one point

being matched to current point, we consider the one with having less distance. When

such an interest point φn exists, the interest point φm ∈ Φt is appended to the track

where the φn is present else φm is considered as the first element of the new feature track.

A set of feature tracks Ψ = {ψk}Kk=1 is formed by finding the temporal correspondences of

all interest points in all frames of S. The feature track ψk is defined as ψk = [ψxk , ψ

yk , ψ

dk],

where ψdk is average descriptor for feature track which is obtained by element wise

averaging of descriptors of elements of feature track and ψxk = [ψx,tk1

k , ψx,tk1+1k , ....ψx,tk2

k ]

, tk1 < tk2. ψyk is defined in the similar way as ψx

k . The trajectory of the feature is

ξk = [ψxk , ψ

yk ]. If the length of trajectory, tk2 − tk1, is more than 5 it is considered as

valid else it is discarded.

3.5.2 Interest Points on Feature Track

Interest points are the points on feature track with special features like points where track

changes its direction or motion. Select a feature track ψk with trajectory ξk. Algorithm

3 shows shows steps followed to detect interest points on feature track. Tracks with

average displacement greater than threshold are only considered for detection, because

background points which are stationary will have no motion and are not truly spatio

temporal interest points as shown in Figure 3.9(a). Points which have nonuniform

displacement are interest points on tracks and they become local maxima or minima

in track.The points where a track changes it direction or where it deviates from normal

path more than a threshold are said to be interest points as shown in Figure 3.9(b).

Example 3.5. We applied ELIFT algorithm on a sample of walking sequence. As shown

in Figure 3.10(b) algorithm detects points around head and jacket where there is motion

between frames, along with some points in the background due to round off errors.Points

which were tracked are shown in Figure 3.10(a) There are 1782 feature tracks with 49577

tracked points which were reduced to 2046 points.


010

2030

40

93

93.5

94

94.5

95286

286.5

287

287.5

288

y

tx

(a)

4950

5152

5354

95

95.5

96

96.5

9798

99

100

101

102

103

(b)

Figure 3.9: Figure (a) Track with no motion andFigure (b) Track with motion and deviation in path

Algorithm 3 Detection Interest Points on Feature Tracks

Require: Feature tracks ΨEnsure: Ψ 6= φ1: Interest points P = φ2: for Feature Track ψk=1 to K do3: Calculate displacement in X and Y directions xd and yd4: Calculate average displacement xdavg and ydavg5: if average displacement xdavg > 1 && ydavg > 1 then

6: calculate direction of displacement d = tan−1(ydxd

)

7: for Every point pi ∈ ψk do8: if pi is local maxima or minima in x or y then9: P = P ∪ pi

10: break;11: end if12: if Direction di is opposite to di−1 then13: P = P ∪ pi14: break;15: end if16: if |di−di−1|> 40 then17: P = P ∪ pi18: end if19: end for20: end if21: end for22: RETURN P

The main advantage of this algorithm is that it detects points with motion without

optical flow which is computationally expensive. There are some background points

still persisting. This is due jumping of features between tracks which has arisen due

to distance between them spatially is very small. They have almost same 16x16 patch

considered for descriptor computation in SIFT. Their distance between their descriptors


(a)

(b)

Figure 3.10: Figure (a) shows the traced features(LIFT) andFigure (b) shows the modified Interest points(ELIFT)


is very small. This problem can be addressed by increasing dmin given Equation 3.8 but

the actual interest points are being filtered so there should a tradeoff.

3.6 Evaluation

3.6.1 Evaluation Criteria

We considered repeatability and execution time as in [14] for performance measuring.

Repeatability

It determines how much percentage of spatio temporal interest points are being retained

in spite of geometric transforms and protomorphic deformations. If this rate is higher

detector is robust to the transformations. It is defined as the ratio of number of matched

points to the mean of number of detected interest points of two videos given by

Equation 3.9

r1,2 =c(V1, V2)

mean(n1, n2)(3.9)

where c (V1, V2) is the number of matched pair of points and n1 and n2 represent number

detected interest points in V1 andV2 respectively.

Execution Time

Execution time is time taken for detector for detection of interest points in the video.

It measures the computational complexity of the detector. we also measure the number

of points detected and determine if they are sparse or dense.

3.6.2 Experimental Results

All programs are being executed on Intel Core2duo processor computer with 16 GB

RAM, under Windows 7 professional operating system. Table 3.1 shows the amount of

time taken for execution and number of interest points detected. A video of 1000 frames

is taken with 1280x720 YouTube720p with frame rate 25fps is considered for evaluation.

Brightness change

video is subjected to uniform change in intensity by Inew = Iframe + d, where d is a

constant varying from -40 to 40 image intensity is limited to [0 255]. The corresponding

performance characteristics of the video descriptors is shown in Figure 3.11. n-SIFT

,ESURF and STIP performed better than MoSIFT and ELIFT. The number of interest

points are less for ELIFT and MoSIFT and due to saturation of intensity at lower or

upper level may remove some of the points so they are sensitive to brightness change.


−40 −30 −20 −10 0 10 20 30 400

10

20

30

40

50

60

70

80

90

100


Rep

eata

bilit

y

Brightness variation

ESURFnSIFTSTIPMoSIFTELIFT

Figure 3.11: Brightness change Vs Repeatability

0 0.5 1 1.5 2 2.5 3 3.5 40

10

20

30

40

50

60

70

80

90

100

scale variation

Rep

eata

bilit

y

Scale Variation Vs Repeatability

ESURFn−SIFTSTIPMoSIFTELIFT

Figure 3.12: Scale change Vs Repeatability

Scale changes

The videos and their spatially scaled versions from 50% to 200%, are used for evalu-

ating robustness of the descriptors to scaling. The result is shown in Figure 3.12.The

performance of all the algorithms is quite bad with down scaling of the video but with

up scaling there is some improvement in performance. This does not augur well for

matching or aligning images of different scales.

Compression

Video is compressed in MPEG format in a scale of [1 50] and results are presented in

Figure 3.13. As we performed lossy compression it changes values of pixels which changes


1 2 3 4 5 6 70

10

20

30

40

50

60

70

80

90

100 Compression Vs Repeatability

compression variation

repe

atab

ility

ESURFn−SIFTSTIPMoSIFTELIFT

Figure 3.13: Compression change Vs Repeatability

the value of gradients hence every detector is expected to show decrease in repeatability

as compression increases. n-SIFT and STIP has shown better repeatability and ESURF

which uses box filters and integral video concept to give additive losses of information

and thus it shows low repeatability.

Execution Time

A comparison table of time of execution and number of interest points detected is given

in Table 3.1.It is observed that ESURF has better performance in time of execution

which has derived from integral video usage. MoSIFT and n-SIFT require more time of

execution. MoSIFT and ELIFT detect less number of interest points with less number

of background points. n-SIFT and ESURF detect a large quantity of points which are

dense. n-SIFT, STIP and ESURF detect background points along with points with

motion.

Time ofExecution(sec)

No.pointsdetected

n-SIFT 1404.954570 31288

STIP 1127.339653 130261

ESURF 113.326833 261960

MoSIFT 1829.52802 72723

ELIFT 737.707349 12046

Table 3.1: Execution Time Comparison

Chapter 4

Conclusions and Future Scope

4.1 Conclusion

This thesis mainly dealt with spatio temporal interest point detectors of video and their

stability in geometric and photometric deformations. SUSAN detector, Harris corner

detector, Harris Laplace detector and SIFT spatial interest points detectors have been

studied and their performance has been evaluated. SIFT exhibits better repeatability

compared to other detectors under image deformation. SIFT is a better package with

robust interest point detection and descriptor along with suitable matching techniques.

The spatio temporal extension of these image interest point detectors has been studied

and their performance is evaluated in terms of repeatability and time of execution. n-

SIFT has better repeatability than others under video deformations. The proposed

ELIFT has least number of interest points and less number of points without motion.

MoSIFT detects second least number of points, while ESURF detects the maximum

number of interest points. ESURF have less execution time and next is ELIFT. While

ESURF has problems in detecting spatio temporal interest points with motion, n-SIFT

detects spatial interest points with and without motion.

4.2 Future Scope

The probable extensions of the work done in this thesis are following

• Further study is required for analyzing stability temporal scaling i.e having differ-

ent frame rates.

• Performance evaluation of algorithms with nonuniform brightness variation need

to be evaluated.

38

Bibliography 39

• Detectors performance along with corresponding descriptor need to be analyzed

for different application like video classification, video alignment and action recog-

nition.

• ELIFT need to be extended for moving camera video applications.

• Camera motion estimation and evaluation of the video descriptors on video se-

quences with a moving camera along with alignment of such sequences present

another challenging research topic.

Bibliography

[1] W. Cheung and G. Hamarneh. N-sift: N-dimensional scale invariant feature trans-

form for matching medical images. In Biomedical Imaging: From Nano to Macro,

2007. ISBI 2007. 4th IEEE International Symposium on, pages 720–723, April 2007.

doi: 10.1109/ISBI.2007.356953.

[2] Ming-yu Chen and Alexander Hauptmann. Mosift: Recognizing human actions in

surveillance videos. 2009.

[3] Geert Willems, Tinne Tuytelaars, and Luc Van Gool. An efficient dense and scale-

invariant spatio-temporal interest point detector. In Computer Vision–ECCV 2008,

pages 650–663. Springer, 2008.

[4] Tinne Tuytelaars and Krystian Mikolajczyk. Local invariant feature detectors: a

survey. Foundations and Trends R© in Computer Graphics and Vision, 3(3):177–280,

2008.

[5] Chris Harris and Mike Stephens. A combined corner and edge detector. In Alvey

vision conference, volume 15, page 50. Manchester, UK, 1988.

[6] DavidG. Lowe. Distinctive image features from scale-invariant key-

points. International Journal of Computer Vision, 60(2):91–110, 2004.

ISSN 0920-5691. doi: 10.1023/B:VISI.0000029664.99615.94. URL

http://dx.doi.org/10.1023/B%3AVISI.0000029664.99615.94.

[7] StephenM. Smith and J.Michael Brady. Susan a new approach to low

level image processing. International Journal of Computer Vision, 23(1):

45–78, 1997. ISSN 0920-5691. doi: 10.1023/A:1007963824710. URL

http://dx.doi.org/10.1023/A%3A1007963824710.

[8] International Journal of Computer Vision, 60(1), 2004. ISSN 0920 5691.

[9] Tony Lindeberg. Linear spatio-temporal scale-space. In Scale-Space Theory in

Computer Vision, pages 113–127. Springer, 1997.

40

http://dx.doi.org/10.1023/B%3AVISI.0000029664.99615.94

http://dx.doi.org/10.1023/A%3A1007963824710

Bibliography 41

[10] Ivan Laptev. On space-time interest points. International Journal of Computer

Vision, 64(2-3):107–123, 2005.

[11] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up ro-

bust features. In Ales Leonardis, Horst Bischof, and Axel Pinz, editors, Com-

puter Vision ECCV 2006, volume 3951 of Lecture Notes in Computer Science,

pages 404–417. Springer Berlin Heidelberg, 2006. ISBN 978-3-540-33832-1. doi:

10.1007/11744023 32. URL http://dx.doi.org/10.1007/11744023_32.

[12] Vasileios Mezaris, Anastasios Dimou, and Ioannis Kompatsiaris. Local invariant

feature tracks for high-level video feature extraction. In Analysis, Retrieval and

Delivery of Multimedia Content, pages 165–180. Springer, 2013.

[13] University of Southern California. Sipi image database. URL

http://sipi.usc.edu/database/database.php.

[14] R Hendaoui, M Abdellaoui, and A Douik. Synthesis of spatio-temporal interest

point detectors: Harris 3d, mosift and surf-mhi. In Advanced Technologies for Signal

and Image Processing (ATSIP), 2014 1st International Conference on, pages 89–94.

IEEE, 2014.

[15] Zhijie Zhang, Chongxiao Cao, Ruijie Zhang, and Jianhua Zou. Video copy detection

based on speeded up robust features and locality sensitive hashing. In Automation

and Logistics (ICAL), 2010 IEEE International Conference on, pages 13–18. IEEE,

2010.

[16] Lin-na Li and Nan Geng. Algorithm for sequence image automatic mosaic based on

sift feature. In Information Engineering (ICIE), 2010 WASE International Con-

ference on, volume 1, pages 203–206. IEEE, 2010.

[17] J. Barron and N. Thacker. Tutorial: Computing 2d and 3d optical flow. 2005.

http://dx.doi.org/10.1007/11744023_32

http://sipi.usc.edu/database/database.php

video descriptors

Documents

Transcript of video descriptors