3D motion detection using neural networks

CHAPTER 1

INTRODUCTION

In video surveillance, video signals from multiple remote locations are

displayed on several TV screens which are typically placed together in a control

room. In the so-called third generation surveillance systems (3GSS), all the parts of

the surveillance systems will be digital and consequently, digital video will be

transmitted and processed. Additionally, in 3GSS some 'intelligence' has to be

introduced to detect relevant events in the video signals in an automatic way. This

allows filtering of the irrelevant time segments of the video sequences and the

displaying on the TV screen only those segments that require the attention of the

surveillance operator. Motion detection is a basic operation in the selection of

significant segments of the video signals. Once motion has been detected, other

features can be considered to decide whether a video signal has to be presented to the

surveillance operator. If the motion detection is performed after the transmission of

the video signals from the cameras to the control room, then all the bit streams have

to be previously decompressed; this can be a very demanding operation, especially if

there are many cameras in the surveillance system. For this reason, it is interesting to

consider the use of motion detection algorithms operating in the compressed

(transform) domain.

In this thesis we present a motion detection algorithm in the compressed

domain with a low computational cost. In the following Section, we assume that

video is compressed by using motion JPEG (MJPEG), i.e. each frame is individually

JPEG compressed.

Motion detection from a moving observer has been a very important

technique for computer vision applications. Especially in recent years, for

autonomous driving systems and driver supporting systems, vision-based navigation

method has received more and more attention worldwide.

1

One of its most important tasks is to detect the moving obstacles like cars,

bicycles or even pedestrians while the vehicle itself is running in a high speed.

Methods of image differencing with the clear background or between adjacent

frames are well used for the motion detection. But when the observer is also moving,

which leads to the result of continuously changing background scene in the

perspective projection image, it becomes more difficult to detect the real moving

objects by differencing methods. To deal with this problem, many approaches have

been proposed in recent years. Previous work in this area has been mainly in two

categories: 1) Using the difference of optical flow vectors between background and

the moving objects, 2) calibrating the background displacement by using camera’s

3D motion analysis result. Calculate the optical flow and estimate the flow vector’s

reliability between adjacent frames. The major flow vector, which represents the

motion of background, can be used to classify and extract the flow vectors of the real

moving objects. However, by reason of its huge calculation cost and its difficulty for

determining the accurate flow vectors, it is still unavailable for real applications. To

analysis the camera’s 3D motion and calibrate the background is another main

method for moving objects detection. For on-board camera’s motion analysis, many

motion-detecting algorithms have been proposed which always depend on the

previous recognition results like road lane-marks and horizon disappointing. These

methods show some good performance in accuracy and efficiency because of their

detailed analysis of road structure and measured vehicle locomotion, which is,

however, computationally expensive and over-depended upon road features like

lane-marks, and therefore lead to unsatisfied result when lane mark is covered by

other vehicles or not exist at all. Compare with these previous works, a new method

of moving objects detection from an on-board camera is presented in this paper. To

deal with the background-change problem, our method uses camera’s 3D motion

analysis results to calibrate the background scene. With pure points matching and the

introduction of camera’s Focus of Expansion (FOE), our method is able to determine

camera’s rotation and translation parameters theoretically by using only three pairs of

matching points between adjacent frames, which make it faster and more efficient for

real-time applications.

2

A neural network, also known as a parallel distributed processing network,

is a computing paradigm that is loosely modeled after cortical structures of the brain.

It consists of interconnected processing elements called nodes or neurons that work

together to produce an output function. The output of a neural network relies on the

cooperation of the individual neurons within the network to operate. Processing of

information by neural networks is characteristically done in parallel rather than in

series (or sequentially) as in earlier binary computers or Von Neumann machines.

Since it relies on its member neurons collectively to perform its function, a unique

property of a neural network is that it can still perform its overall function even if

some of the neurons are not functioning. In other words it is robust to tolerate error

or failure. All neural networks take numeric input and produce numeric output. The

transfer function of a unit is typically chosen so that it can accept input in any range,

and produces output in a strictly limited range (it has a squashing effect).

An artificial neural network (ANN), also called a simulated neural network

(SNN) or commonly just neural network (NN) is an interconnected group of artificial

neurons that uses a mathematical or computational model for information processing

based on a connectionist approach to computation. In most cases an ANN is an

adaptive system that changes its structure based on external or internal information

that flows through the network.

There are different topologies of neural networks that may be employed for

time series modeling. In our investigation we used radial basis function networks

which have shown considerably better scaling properties, when increasing the

number of hidden units, than networks with sigmoid activation function.

RBF networks were introduced into the neural network literature by Broom

head/Lowe and Poggio/Girosi in the late 1980s. The RBF network model is

motivated by the locally tuned response observed in biologic neurons, e.g. in the

visual or in the auditory system. RBFs have been studied in multivariate

approximation theory, particularly in the field of function interpolation. The RBF

neural network model is an alternative to multilayer perceptron which is perhaps the

3

most often used neural network architecture. A radial basis function network (RBF),

therefore, has a hidden layer of radial units, each actually modeling a Gaussian

response surface. Since these functions are nonlinear, it is not actually necessary to

have more than one hidden layer to model any shape of function: sufficient radial

units will always be enough to model any function.

In surveillance system estimation of motion is of great importance, which

enables the various types of operations to be performed on the detected object. When

using motion estimation, an assumption is made that the objects in the scene have

only translational motion. This assumption holds as long as there is no camera pan,

zoom, changes in luminance, or rotational motion (quite an assumption!).

After the process of estimation, the detected motion has to be extracted. With

the obtained boundary, two objects (with background) can then be extracted from

two image frames (both current image frame and previous image frame). Extracting

the moving object from its background can be done by the edge enhancement

network and the background remover.

In algorithm level, complexity, regularity and precision are main factors that

directly affect the power consumed in extracting an algorithm for motion estimation.

Concurrency and modularity are the requirements on algorithms that are intended to

execute on low power architecture. This project aims to reduce the power

consumption of motion estimation at algorithm level and architectural level by using

neural network concept.

4

1.1 PROBLEM STATEMENT

The goals for this thesis have been the following.

One goal has been to compile an introduction to the motion detection

algorithms. There exist a number of studies but complete reference on real time

motion detection is not as common .we have collected materials from journals, papers

and conferences and proposed approach that can be best to implement a real time

motion detection.

Another goal has been to search for algorithms that can be used to implement

the RBF neural network.

A third goal is to evaluate their performance with regard to motion detected.

These properties were chosen because they have the greatest impact on the

implementation effort.

A final goal has been to design and implement an algorithm including object

extraction. This should be done in high level language or matlab. The source code

should be easy to understand so that it can serve as a reference on the standard for

designers that need to implement real time motion detection.

5

CHAPTER 2

OVERVIEW OF NEURAL NETWORKS

Neural network theory is sometimes used to refer to a branch of computational

science that uses neural networks as models to simulate or analyze complex

phenomena and/or study the principles of operation of neural networks analytically.

It addresses problems similar to artificial intelligence (AI) except that AI uses

traditional computational algorithms to solve problems whereas neural networks use

'networks of agents' (software or hardware entities linked together) as the

computational architecture to solve problems. Neural networks are trainable systems

that can "learn" to solve complex problems from a set of exemplars and generalize

the "acquired knowledge" to solve unforeseen problems as in stock market and

environmental prediction. i.e., they are self-adaptive systems.

Traditionally, the term neural network has been used to refer to a network of

biological neurons. In modern usage, the term is often used to refer to artificial

neural networks, which are composed of artificial neurons or nodes. Thus the term

'Neural Network' has two distinct connotations:

1. Biological neural networks are made up of real biological neurons that are

connected or functionally-related in the peripheral nervous system or the

central nervous system. In the field of neuroscience, they are often identified

as groups of neurons that perform a specific physiological function in

laboratory analysis.

2. Artificial neural networks are made up of interconnecting artificial neurons

(usually simplified neurons) designed to model (or mimic) some properties of

biological neural networks. Artificial neural networks can be used to model

the modes of operation of biological neural networks, whereas cognitive

models are theoretical models that mimic cognitive brain functions without

6

http://en.wikipedia.org/wiki/Algorithm

necessarily using neural networks while artificial intelligence are well-crafted

algorithms that solve specific intelligent problems without using neural

network as the computational architecture.

2.1 The brain, neural networks and computers

While it is accepted by most scientists that the brain is a type of computer, it

is a computer with a vastly different architecture to the computers that most of us are

familiar with. The brain is massively parallel, even more so than advanced

multiprocessor computers. This means that simulating the behavior of a brain on

traditional computer hardware is necessarily slow and inefficient.

Neural networks, as used in artificial intelligence, have traditionally been

viewed as simplified models of neural processing in the brain, even though the

relation between this model and brain biological architecture is very much debated.

To answer this question, David Marr has proposed various levels of analysis which

provide us with a plausible answer for the role of neural networks in the

understanding of human cognitive functioning.

The question of what is the degree of complexity and the properties that

individual neural elements should have in order to reproduce something resembling

animal intelligence is a subject of current research in theoretical neuroscience.

Historically computers evolved from von Neumann architecture, based on

sequential processing and execution of explicit instructions. On the other hand

origins of neural networks are based on efforts to model information processing in

biological systems, which may rely largely on parallel processing as well as implicit

instructions based on recognition of patterns of 'sensory' input from external sources.

In other words, rather than sequential processing and execution, at their very heart,

neural networks are complex statistic processors.

7

2.2 Artificial Neural networks

An artificial neural network (ANN), also called a simulated neural network

(SNN) or commonly just neural network (NN) is an interconnected group of artificial

neurons that uses a mathematical or computational model for information processing

based on a connectionist approach to computation. In most cases an ANN is an

adaptive system that changes its structure based on external or internal information

that flows through the network.

In more practical terms neural networks are non-linear statistical data

modeling tools. They can be used to model complex relationships between inputs and

outputs or to find patterns in data.

2.3 Background

An artificial neural network involves a network of simple processing

elements (neurons) which can exhibit complex global behavior, determined by the

connections between the processing elements and element parameters. One classical

type of artificial neural network is the Hopfield net.

In a neural network model, simple nodes (called variously "neurons",

"neurodes", "PEs" ("processing elements") or "units") are connected together to form

a network of nodes — hence the term "neural network". While a neural network does

not have to be adaptive per se, its practical use comes with algorithms designed to

alter the strength (weights) of the connections in the network to produce a desired

signal flow.

In modern software implementations of artificial neural networks the

approach inspired by biology has more or less been abandoned for a more practical

approach based on statistics and signal processing. In some of these systems neural

networks or parts of neural networks (such as artificial neurons) are used as

components in larger systems that combine both adaptive and non-adaptive elements.

8

2.4 Models

Neural network models in artificial intelligence are usually referred to as

artificial neural networks (ANN); these are essentially simple mathematical models

defining a function . Each type of ANN model corresponds to a class of

such functions.

Fig 1 Artificial Neural Network

9

Fig 2 A complex neural network

2.5 Employing artificial neural networks

Perhaps the greatest advantage of ANN is their ability to be used as an arbitrary

function approximation mechanism which 'learns' from observed data. However,

using them is not so straightforward and a relatively good understanding of the

underlying theory is essential.

Choice of model: This will depend on the data representation and the

application. Overly complex models tend to lead to problems with learning.

Learning algorithm: There are numerous tradeoffs between learning

algorithms. Almost any algorithm will work well with the correct hyper

parameters for training on a particular fixed dataset. However selecting and

tuning an algorithm for training on unseen data requires a significant amount

of experimentation.

Robustness: If the model, cost function and learning algorithm are selected

appropriately the resulting ANN can be extremely robust.

With the correct implementation ANN can be used naturally in online learning and

large dataset applications. Their simple implementation and the existence of mostly

10

local dependencies exhibited in the structure allows for fast, parallel implementations

in hardware.

2.6 Types of neural networks

2.6.1 Feed forward neural network

The feed forward neural networks are the first and arguably simplest type of

artificial neural networks devised. In this network, the information moves in only one

direction, forward, from the input nodes, through the hidden nodes (if any) and to the

output nodes. There are no cycles or loops in the network.

2.6.2 Single-layer perceptron

The earliest kind of neural network is a single-layer perceptron network,

which consists of a single layer of output nodes; the inputs are fed directly to the

outputs via a series of weights. In this way it can be considered the simplest kind of

feed-forward network. The sum of the products of the weights and the inputs is

calculated in each node, and if the value is above some threshold (typically 0) the

neuron fires and takes the activated value (typically 1); otherwise it takes the

deactivated value (typically -1). Neurons with this kind of activation function are

also called McCulloch-Pitts neurons or threshold neurons.

A perceptron can be created using any values for the activated and

deactivated states as long as the threshold value lies between the two. Most

perceptrons have outputs of 1 or -1 with a threshold of 0 and there is some evidence

that such networks can be trained more quickly than networks created from nodes

with different activation and deactivation values.

Perceptrons can be trained by a simple learning algorithm that is usually

called the delta rule. It calculates the errors between calculated output and sample

output data, and uses this to create an adjustment to the weights, thus implementing a

form of gradient descent.

11

Single-unit perceptrons are only capable of learning linearly separable

patterns; in 1969 in a famous monograph entitled Perceptrons Marvin Minsky and

Seymour Papert showed that it was impossible for a single-layer perceptron network

to learn an XOR function.

2.6.3 Multilayer layer perceptron

This class of networks consists of multiple layers of computational units,

usually interconnected in a feed-forward way. Each neuron in one layer has directed

connections to the neurons of the subsequent layer. In many applications the units of

these networks apply a sigmoid function as an activation function.

The universal approximation theorem for neural networks states that every

continuous function that maps intervals of real numbers to some output interval of

real numbers can be approximated arbitrarily closely by a multi-layer perceptron

with just one hidden layer. This result holds only for restricted classes of activation

functions, e.g. for the sigmoid functions.

Multi-layer networks use a variety of learning techniques, the most popular

being back-propagation. Here the output values are compared with the correct

answer to compute the value of some predefined error-function. By various

techniques the error is then fed back through the network. Using this information, the

algorithm adjusts the weights of each connection in order to reduce the value of the

error function by some small amount. After repeating this process for a sufficiently

large number of training cycles the network will usually converge to some state

where the error of the calculations is small. In this case one says that the network has

learned a certain target function. To adjust weights properly one applies a general

method for non-linear optimization that is called gradient descent. For this, the

derivative of the error function with respect to the network weights is calculated and

the weights are then changed such that the error decreases (thus going downhill on

the surface of the error function). For this reason back-propagation can only be

applied on networks with differentiable.

12

Fig 3 XOR perceptron

A three layer Perceptron net capable of calculating XOR. The numbers within

the perceptrons represent each perceptrons' explicit threshold. The numbers that

annotate arrows represent the weight of the inputs. This net assumes that if the

threshold is not reached, zero (not -1) is output. Note that the bottom layer of inputs

is not always considered a real perceptron layer.

2.6.4 Radial basis function (RBF) network

Radial Basis Functions are powerful techniques for interpolation in

multidimensional space. A RBF is a function which has built into a distance criterion

with respect to a centre. Radial basis functions have been applied in the area of

neural networks where they may be used as a replacement for the sigmoidal hidden

layer transfer characteristic in multi-layer perceptrons.

2.6.5 Echo State Network

The Echo State Network (ESN) is a recurrent neural network with a sparsely

connected random hidden layer. The weights of output neurons are the only part of

the network that can change and be learned. ESN are good to (re)produce temporal

patterns.

13

2.6.6 Stochastic neural networks

A stochastic neural network differs from a regular neural network in the fact

that it introduces random variations into the network. In a probabilistic view of

neural networks, such random variations can be viewed as a form of statistical

sampling, such as Monte Carlo sampling.

2.6.7 Neuro-fuzzy networks

A neuro-fuzzy network is a fuzzy inference system in the body of an artificial

neural network. Depending on the FIS type, there are several layers that simulate the

processes involved in a fuzzy inference like fuzzification, inference, aggregation and

defuzzification. Embedding an FIS in a general structure of an ANN has the benefit

of using available ANN training methods to find the parameters of a fuzzy system.

14

CHAPTER 3

RBF NETWORK

3.1 Radial Functions Radial functions are a special class of function. Their characteristic feature is that

their response decreases (or increases) monotonically with distance from a central

point. The centre, the distance scale, and the precise shape of the radial function are

parameters of the model, all fixed if it is linear.

A typical radial function is the Gaussian which, in the case of a scalar input, is

Its parameters are its centre c and its radius r. The figure illustrates a Gaussian RBF

with centre c = 0 and radius r = 1.

A Gaussian RBF monotonically decreases with distance from the centre. In contrast,

a multiquadric RBF which, in the case of scalar input, is

monotonically increases with distance from the centre. Gaussian-like RBFs are local

(give a significant response only in a neighbourhood near the centre) and are more

commonly used than multiquadric-type RBFs which have a global response.

3.2 Radial Networks

A RBF is a function which has built into a distance criterion with respect to a

centre. Radial basis functions have been applied in the area of neural networks where

they may be used as a replacement for the sigmoidal hidden layer transfer

characteristic in multi-layer perceptrons. RBF networks have 2 layers of processing:

In the first, input is mapped onto each RBF in the 'hidden' layer. The RBF chosen is

usually a Gaussian. In regression problems the output layer is then a linear

combination of hidden layer values representing mean predicted output. The

15

interpretation of this output layer value is the same as a regression model in statistics.

In classification problems the output layer is typically a sigmoid function of a linear

combination of hidden layer values, representing a posterior probability.

Performance in both cases is often improved by shrinkage techniques, known as

ridge regression in classical statistics and known to correspond to a prior belief in

small parameter values (and therefore smooth output functions) in a Bayesian

framework.

RBF networks have the advantage of not suffering from local minima in the

same way as multi-layer perceptrons. This is because the only parameters that are

adjusted in the learning process are the linear mapping from hidden layer to output

layer. Linearity ensures that the error surface is quadratic and therefore has a single

easily found minimum. In regression problems this can be found in one matrix

operation. In classification problems the fixed non-linearity introduced by the

sigmoid output function is most efficiently dealt with using iterated reweighed least

squares.

RBF networks have the disadvantage of requiring good coverage of the input

space by radial basis functions. RBF centers are determined with reference to the

distribution of the input data, but without reference to the prediction task. As a result,

representational resources may be wasted on areas of the input space that are

irrelevant to the learning task. A common solution is to associate each data point

with its own centre, although this can make the linear system to be solved in the final

layer rather large, and requires shrinkage techniques to avoid over fitting.

Associating each input datum with an RBF leads naturally to kernel methods

such as Support Vector Machines and Gaussian Processes (the RBF is the kernel

function). All three approaches use a non-linear kernel function to project the input

data into a space where the learning problem can be solved using a linear model.

Like Gaussian Processes, and unlike SVMs, RBF networks are typically trained in a

Maximum Likelihood framework by maximizing the probability (minimizing the

error) of the data under the model. SVMs take a different approach to avoiding over

fitting by maximizing instead a margin. RBF networks are outperformed in most

16

classification applications by SVMs. In regression applications they can be

competitive when the dimensionality of the input space is relatively small.

3.3 RBF Architecture

Artificial networks typically have three layers: an input layer, a hidden layer

with a non-linear RBF activation function and a linear output layer. The output,

, of the network is thus

Where N is the number of neurons in the hidden layer, ci is the center vector for

neuron i, and ai are the weights of the linear output neuron. In the basic form all input

are connected to each hidden neuron. The norm is typically taken to be the Euclidean

distance and the basis function is taken to be Gaussian.

.

The Gaussian basis functions are local in the sense that .

Changing parameters of one neuron has only a small effect for input values that are

far away from the center of that neuron.

RBF networks are universal approximates on a compact subset of . This means

that a RBF network with enough hidden neurons can approximate any continuous

function with arbitrary precision.

The weights ai, , and β are determined in a manner that optimizes the fit between

and the data.

17

Fig 4 Architecture of a radial basis function network.

3.4 Training

In a RBF network there are three types of parameters that need to be chosen

to adapt the network for a particular task: the center vectors ci, the output weights wi,

and the RBF width parameters βi. In the sequential training of the weights are

updated at each time step as data streams in.

For some tasks it makes sense to define an objective function and select the

parameter values that minimize its value. The most common objective function is the

least squares function

Where,

.

18

We have explicitly included the dependence on the weights. Minimization of the

least squares objective function by optimal choice of weights optimizes accuracy of

fit.

3.5 Interpolation

RBF networks can be used to interpolate a function when the

values of that function are known on finite number of points:

. Taking the known points xi to be the centers of the

radial basis functions and evaluating the values of the basis functions at the same

points gij = ρ( | | xj − xi | | ) the weights can be solved from the equation

It can be shown that the interpolation matrix in the above equation is non-singular, if

the point’s x_i are distinct, and thus the weights w can be solved by simple linear

algebra:

3.6 Function approximation

If the purpose is not to perform strict interpolation but instead more general

function approximation or classification the optimization is somewhat more complex

because there is no obvious choice for the centers. The training is typically done in

two phases first fixing the width and centers and then the weights. This can be

justified by considering the different nature of the non-linear hidden neurons versus

the linear output neuron.

19

3.7 Training the basis function centers

Basis function centers can be either randomly sampled among the input instances or

found by clustering the samples and choosing the cluster means as the centers. The

RBF widths are usually all fixed to same value which is proportional to the

maximum distance between the chosen centers.

3.8 Pseudoinverse solution for the linear weights

After the centers ci have been fixed, the weights that minimize the error at the output

are computed with a linear pseudo inverse solution:

,

Where the entries of G are the values of the radial basis functions evaluated at the

points xi: gji = ρ (| | xj − ci | |).

The existence of this linear solution means that unlike Multi-layer perceptron (MLP)

networks the RBF networks have a unique local minimum (when the centers are

fixed).

3.9 Advantages/Disadvantages

RBF trains faster than a MLP.

Another advantage that is claimed is that the hidden layer is easier to

interpret than the hidden layer in an MLP.

Although the RBF is quick to train, when training is finished and it is

being used it is slower than a MLP, so where speed is a factor a MLP may

be more appropriate.

20

CHAPTER 4

OVERVIEW OF MOTION DETECTION ALGORITHMS

Given a number of sequential video frames from the same source the goal is

to detect the motion in the area observed by the source. When there is no motion all

the sequential frames have to be similar up to noise influence. In the case when

motion is present there is some difference between the frames. For sure, each low-

cost system has some aspect of noise influence. And in case of no motion every two

sequential frames will not be the identical. This is why the system must be smart

enough to distinguish between noise and real motion. When the systems are

calibrated and stable enough the character of noise is that every pixel value may be

slightly different from that in other frame. And in first approximation it is possible to

define some noise per pixel threshold parameter (adaptable for any given state) the

meaning of which is how the pixel value (of the same oriented pixel in two

sequential frames) might differ but actually the indicating value is the same one.

More precisely, if the pixel with coordinates (Xa,Ya) in frame A differs from the

pixel with coordinates (Xb,Yb) in frame B less than on TPP (threshold per pixel)

value so we will see them as pixels with equal values. And we can write it by

formulae:

Pixel (Xa, Ya) equal to Pixel (Xb, Yb) I

if

{abs (Pixel (Xa,Ya)-Pixel(Xb,Yb)) < TPP }

By adapting the TPP value to current system state we can make the system to be

noise-stable. By applying this threshold operation to every pixel pair we may assume

that all the preprocessed pixel values are noise-free. The element of noise that is not

cancelled will be significantly small relative to other part.

Ok, if so we have to post-process these values to detect the motion if any. As

it was memorized above we have manipulate with different pixels inside two

sequential frames to make conclusion about the motion.

21

Firstly, to make the system sensitive enough we have not to fix the TPP value

too big. It mean that keeping the sensitivity of the system high in any two frames

there will be some little number (TPP related) of different pixels. And in this case we

have not to see them as noise. It is the first of the reasons to define a TPF (threshold

per frame) value (adaptable for any given state) the meaning of which is how many

pixels at least, inside two sequential frames must differ in order to see them as

motion. The second reason to deal with TPF is to filter (to drop) small motion. For

instance, by playing with TPF values we can neutralize motion of the small object

(bugs etc.) by still detect the motion of people. And we can write the exact meaning

of TPF by formulae:

Let’s define NDPPP to be the Number of Different Pre-Processed by TPP Pixels.

So,

There is a motion i.e. NDPPP > TPF.

Both of TPP and TPF values are variable through the UI to get the optimal system

sensitivity. Also the TPF value has its visual equivalent and it is used as following.

After the pixels pre-processing (by TPP) lets color all static (which do not include

motion) pixels by lets say black color and all the dynamic (which indicate the

motion) pixels will be left with their original color. This will bring the effect of

motion extraction. In the other words, all the static parts of the frames will be black,

and only the moving parts will be seen normally. The enabling/disabling of this

effect is possible to control through the GUI.

The Camera Manager provides routines for acquiring video frames from

CCD cameras. Any process can request a video frame from any video source. The

system manages a request queue for each source and executes them cyclically.

22

CHAPTER 5

PROPOSED DETECTION SYSTEM

This chapter presents the main software design and implementation issues. It

starts by describing the general flow chart of the main program that was implemented

in MATLAB. It then explains each component of the flow chart with some details.

Finally it shows how the graphical user interface GUI was designed.

5.1 Basic Architecture

Videodaughter-card

Camerasystem

Network card

Motion detection Algorithm

Videodaughter-card

Camerasystem

Videodaughter-card

Camerasystem

Network card

Motion detection Algorithm

Fig 5 A basic architecture of surveillance system

The above block diagram shows the surveillance system which consists of a

camera system which monitors the particular area, a video daughter card which

transmits the video signal to electrical signal, a network card which helps in

connecting to a network and motion detection algorithm (SAD and Correlation)

along with RBF network.

23

5.2 Main Program Flow Chart

The main task of the software was to read the still images recorded from the

camera and then process these images to detect motions and take necessary actions

accordingly. Figure 6 below shows the general flow chart of the main program.

24

Start

Setup & Initializations

What is Flag

value?

Flag=0Flag=1

Image Acquisition

Motion Detection Algorithm Break & clear

Is image > threshold

YesNo

Actions on Motion

Detection

Data Record

Stop

Figure 6 Main Program Flow Diagram

It starts with general initialization of software parameters and objects setup.

Then, once the program started, the flag value which indicates whether the stop

button was pressed or not is checked. If the stop button was not pressed it start

reading the images then process them using one of the two algorithms as the operator

was selected. If a motion is detected it starts a series of actions and then it go back to

read the next images, otherwise it goes directly to read the next images. Whenever

the stop button is pressed the flag value will be set to zero and the program is

stopped, memory is cleared and necessary results are recorded. This terminates the

program and returns the control for the operator to collect the results.

The next sections explain each process of the flow chart in figure 6 with some

details.

5.2.1 Setup and Initializations

25

Start

Launch GUI

Start button pressed

YesNo

Read Threshold Value

Stop

Read Algorithm Type

Setup Serial Port

Setup Video Object

Figure 7 Setup and Initializations Process

Figure 7 show the flow chart for the setup and initialization process. This process

includes the launch of the graphical user interface (GUI) where the type of motion

detection algorithm is selected and threshold value (the amount of sensitivity of the

detection) is being initialized.

Also, during this stage a setup process for both the serial port and the video object is

done. This process takes approximately 15 seconds to be completed,(depending on

the specifications of the PC used) for the serial port it starts by selecting a

communication port and reserving the memory addresses for that port, then the PC

connect to the device using the communication setting that was mentioned in the

previous chapter. The video object is part of the image acquisition process but it

should be setup at the start of the program.

5.2.2 Image acquisition

26

Figure 8 Image acquisitions Process

Start

Stop

Read First Frame

Convert to Grayscale

Read Second Frame

Convert to Grayscale

After setup stage the image acquisition starts as shown in figure 8 above. This

process reads images from the PC camera and save them in a format suitable for the

motion detection algorithm.

There were three possible options from which one is implemented. The first

option was by using auto snapshots software that takes images automatically and

save them on a hard disk as JPEG format, and then another program reads these

images in the same sequence as they were saved. It was found that the maximum

speed that can be attained by this software is one frame per second and this limits the

speed of detection. Also, synchronization was required between both image

processing and the auto snapshot software’s where next images need to be available

on the hard disk before processing them.

The second option was to display live video on the screen and then start

capturing the images from the screen. This is a faster option from the previous

approach but again it faced the problem of synchronization, when the computer

monitor goes into a power saving mode where black images are produced all the time

during the period of the black screen.

The third option was by using the image acquisition toolbox provided in

MATLAB 6.5.1 or higher versions. The image acquisition toolbox is a collection of

functions that extend the capability of MATLAB. The toolbox supports a wide range

of image acquisition operations, including acquiring images through many types of

image acquisition devices, such as frame grabbers and USB PC cameras, also

viewing a preview of the live video displayed on monitor and reading the image data

into the MATLAB workspace directly.

For this project video input function was used to initialize a video object that

connects to the PC camera directly. Then preview function was used to display live

video on the monitor. Get snapshot function was used to read images from the

camera and place them in MATLAB workspace.

27

The later approach was implemented because it has many advantages over the

others. It achieved the fastest capturing speed at a rate of five frames per seconds

depending on algorithm complexity and PC processor speed. Furthermore, the

problem of synchronization was solved because both capturing and processing of

images were done using the same software.

All read images were converted it into a two dimensional monochrome

images. This is because equations in other algorithms in the system were designed

with such image format.

5.2.3 Motion Detection Algorithm

A motion detection algorithm was applied on the previously read images.

There were two approaches to implement motion detection algorithm. The first one

was by using the two dimensional cross correlation while the second one was by

using the sum of absolute difference algorithm. These are explained in details in the

next two sub sections.

5.3 Motion Detection Using Sum of Absolute Difference (SAD)

This algorithm is based on image differencing techniques. It is

mathematically represented using the following equation:

Where is the number of pixels in the image used as scaling factor,

is the image at time ,

is the image at time and

is the normalized sum of absolute difference for that time.

In an ideal case when there is no motion

28

and . However noise is always presented in images and a better model of the

images in the absence of motion will be

Where is a noise signal.

The value that represents the normalized sum of absolute difference can be

used as a reference to be compared with a threshold value as shown in figure 9

below.

The figure also shows a test case that contains a large change in the scene being

monitored by the camera this was done by moving the camera. During the time

before the camera was moved the SAD value was around 1.87 and when the camera

was moved the SAD value was around 2.2. If the threshold for detection was fixed

around the value less than 2.2 it will continuously detect motion after the camera stop

moving.

29

Figure 9 Direct Thresholds for SAD Values

This approach solve the need for continuously re-estimate the threshold

value. Choosing a threshold of 1*10-3 will detect the times when only the camera is

moved. This results into a robust motion detection algorithm that can not be affected

by illumination change and camera movements.

5.3.1 Actions on Motion Detection

Before explaining series of actions happen when motion is detected it is

worth to mention that the values of variance that was calculated whether it was above

or below the threshold will be stored in an array, where it will be used later to

produce a plot of frame number Vs. the variance value. This plot helps in comparing

the variance values against the threshold to be able to choose the optimum threshold

value.

Whenever the variance value is less than threshold the image will be dropped

and only the variance value will be recorded. However when the variance value is

greater than threshold sequence of actions is being started as shown in figure 10

below.

30

Figure 10 Actions on Motion Detection

TimeDateFrame#

Update Log File

Trigger Serial PortDisplay

Image

Convert Image to Frame

Stop

Start

As the above flow chart show a number of activities happen when motion is

detected. First the serial port is being triggered by a pulse from the PC; this pulse is

used to activate external circuits connected to the PC. Also a log file is being created

and then appended with information about the time and date of motion also the frame

number in which motion occur is being recorded in the log file. Another process is to

display the image that was detected on the monitor. Finally the image that was

detected in motion will be converted to a movie frame and will be added to the film

structure.

5.3.2 Break and clear Process

After motion detection algorithm applied on the images the program checks if

the stop button on GUI was pressed. If it was pressed the flag value will be changed

from one to zero and the program will break and terminate the loop then it will return

the control to the GUI. Next both serial port object and video object will be cleared.

This process is considered as a cleaning stage where the devises connected to the PC

through those objects will be released and the memory space will be freed.

5.3.3 Data Record

Finally when the program is terminated a data collection process starts where

variable and arrays that contain result of data on the memory will be stored on the

hard disk. This approach was used to separate the real time image processing from

results processing. This has the advantage of calling back these data whenever it is

required. The variables that are being stored from memory to the hard disk are

variance values and the movie structure that contain the entire frames with motion.

At this point the control will be returned to the GUI where the operator can callback

the results that where archived while the system was turned on. Next section will

explain the design of the GUI highlighting each button results and callbacks.

31

Fig 11 Flow chart for SAD algoritham

Fig 12 Frame separation

Quadrant1 Quadrant2

Quadrant3 Quadrant4

Fig 13 Divide Quadrants

START

IMAGE ACQUISTION

FRAME SEPARATION

DIVIDE QUADRANTS

SUM OFABSOLUT DIFFERENCE

>T DATARECORD

32

5.3.4 Graphical User Interface Design

The GUI was designed to facilitate interactive system operation. GUI can be

used to setup the program, launch it, stop it and display results.

33

End

Start

Clear all Previous Work

Variable Initialization & Setup

Launch program

Call Selected main Program

Terminate Program

View Results

Start AgainNOYes

Exit

Figure 14 GUI flow Chart

During setup stage the operator is promoted to choose a motion detection

algorithm and select degree of the detection sensitivity Whenever the start/stop

toggle button is pressed the system will be launched and the selected program will be

called to perform the calculations until the start/stop button is pressed again which

will terminate the calculation and return control to GUI. Results can be viewed as a

log file, movie and plot of frame number vs. variance value. Figure 14 illustrate a

flow chart of the steps performed using the GUI.

5.4 Motion detection using Correlation Network

A correlation neural network (CNN) which accounts for velocity sensitive

responses of neurons is suitable for analog circuit implementation of motion-

detection systems and has been successfully implemented on CMOS. The CNN

utilizes local motion detectors to correlate signals sampled at one location in the

image with those sampled after a delay at adjacent locations; however, an edge-

detection process is required in practical motion detection systems with the CNNs.

The term correlation can also mean the cross-correlation of two functions or

electron correlation in molecular systems. In probability theory and statistics,

correlation, also called correlation coefficient, indicates the strength and direction of

a linear relationship between two random variables. In general statistical usage,

correlation or co-relation refers to the departure of two variables from independence,

although correlation does not imply causation. In this broad sense there are several

coefficients, measuring the degree of correlation, adapted to the nature of data. A

number of different coefficients are used for different situations. The best known is

the Pearson product-moment correlation coefficient, which is obtained by dividing

the covariance of the two variables by the product of their standard deviations.

34

5.4.1 Mathematical properties

The correlation ρX, Y between two random variables X and Y with expected values μX

and μY and standard deviations σX and σY is defined as:

Where E is the expected value of the variable and cov means covariance. Since μX =

E(X), σX2 = E(X2) − E2(X) and likewise for Y, we may also write

The correlation is defined only if both of the standard deviations are finite and both

of them are nonzero. It is a corollary of the Cauchy-Schwarz inequality that the

correlation cannot exceed 1 in absolute value.

The correlation is 1 in the case of an increasing linear relationship, −1 in the case of a

decreasing linear relationship, and some value in between in all other cases,

indicating the degree of linear dependence between the variables. The closer the

coefficient is to either −1 or 1, the stronger the correlation between the variables.

If the variables are independent then the correlation is 0, but the converse is not true

because the correlation coefficient detects only linear dependencies between two

variables. Here is an example: Suppose the random variable X is uniformly

distributed on the interval from −1 to 1, and Y = X2. Then Y is completely determined

by X, so that X and Y are dependent, but their correlation is zero; they are

uncorrelated. However, in the special case when X and Y are jointly normal,

independence is equivalent to uncorrelated ness. A correlation between two variables

is diluted in the presence of measurement error around estimates of one or both

variables, in which case disattenuation provides a more accurate coefficient.

35

5.4.2 Geometric Interpretation of correlation

The correlation coefficient can also be viewed as the cosine of the angle

between the two vectors of samples drawn from the two random variables.

This method only works with centered data, i.e., data which have been shifted

by the sample mean so as to have an average of zero. Some practitioners prefer an

uncentered (non-Pearson-compliant) correlation coefficient. See the example below

for a comparison.

As an example, suppose five countries are found to have gross national

products of 1, 2, 3, 5, and 8 billion dollars, respectively. Suppose these same five

countries (in the same order) are found to have 11%, 12%, 13%, 15%, and 18%

poverty. Then let x and y be ordered 5-element vectors containing the above data: x

= (1, 2, 3, 5, 8) and y = (0.11, 0.12, 0.13, 0.15, 0.18). By the usual procedure for

finding the angle between two vectors, the uncentered correlation coefficient is:

Note that the above data were deliberately chosen to be perfectly correlated: y = 0.10

+ 0.01 x. The Pearson correlation coefficient must therefore be exactly one.

Centering the data (shifting x by E(x) = 3.8 and y by E(y) = 0.138) yields x = (-2.8, -

1.8, -0.8, 1.2, 4.2) and y = (-0.028, -0.018, -0.008, 0.012, 0.042), from which

as expected.

36

5.4.3 Interpretation of the size of a correlation

Several authors have offered guidelines for the interpretation of a correlation

coefficient. Cohen (1988), for example, has suggested the following interpretations

for correlations in psychological research, in the table in the bottom.

As Cohen himself has observed, however, all such criteria are in some ways

arbitrary and should not be observed too strictly. This is because the interpretation of

a correlation coefficient depends on the context and purposes. A correlation of 0.9

may be very low if one is verifying a physical law using high-quality instruments,

but may be regarded as very high in the social sciences where there may be a greater

contribution from complicating factors

Correlation Negative Positive

Small −0.29 to −0.10 0.10 to 0.29

Medium −0.49 to −0.30 0.30 to 0.49

Large −1.00 to −0.50 0.50 to 1.00

Table 1

Fig 15 An unit network of two-dimensional CCN.

37

Fig 16 Flow chart for correlation

START

IMAGE ACQUISTION

FRAME SEPARATION

DIVIDE QUADRANTS

CORRELATION NETWORK

DATARECORD

Decision

38

CHAPTER 6

PROPOSED OBJECT EXTRACTION

Many attempts have been made to extract data from video and film in a form

suitable for use by animators and modelers. Such an approach is attractive, since

motions and movements for people and animals may be obtained in this way that

would be difficult using mechanical or magnetic motion capture systems. Visual

extraction is also appealing since it is non-intrusive and has the potential to capture,

from film, the motion and characteristics of people or animals long dead or extinct.

Almost all attempts to perform visual extraction have been based around

bespoke computer vision applications which are difficult for non-experts to use or

adapt to their own needs. This paper presents a generic approach to extracting data

from video. Whilst our approach allows low-level information to be extracted we

show that higher-level functionality is available also. This functionality can be

utilized in a manner that requires little knowledge of the underlying techniques and

principles. Our approach is to approximate an image using principal component

analysis, and then to train a multi-layer perceptron to predict the feature required by

the user. This requires the user to hand-label the features of interest in some of the

frames of the image sequence. One of the aims of this work is to keep to a minimum

the number of frames that need to need labeled by the user. The trained multi-layer

perceptron is then used to predict features for images that have never been labeled by

the user.

Other attempts to extract useful information from video sequences include the

use of edge-detection and contour or edge tracking, template matching and template

tracking. All such systems work well in some circumstances, but fail or require

adaptation to meet the requirements of new users. For instance, in the case of

template tracking, the user needs to be aware of the kinds of features that can be

tracked well in an image and also choose a suitable template size. This is not a trivial

task for non-specialists.

39

6.1 Method

The main steps in extraction using our system are detailed below:

The user selects the sequence (or set) of images for which they wish data to be

extracted from. This may well comprise of several shorter clips taken from different

parts of a film.

These images have some pre-processing performed on them (principal components

analysis) to reduce each image to a small set of numbers.

The user decides what feature(s) they wish to extract and labels this feature by

hand in a fraction of the images chosen at random. The labeling process may

involve clicking on a point to be tracked, labeling a distance or ratio of distances,

measuring an angle, making a binary decision (yes/no, near/far etc.) or classifying the

feature of interest into one of several classes.

Once this ground-truth data is available, a neural network is trained to predict the

feature values in images that have not been labeled by the user.

6.2 Feature Extraction

Principal components analysis (also known as eigenvector analysis) has been used

extensively in computer vision for image reconstruction, pattern matching and

classification.

Given the ith image in a sequence of images, each of which consists of M

pixels, we form the vector xi by concatenating the pixels of the image in raster scan

order and removing the mean image of the sequence. The matrix X is created

using the xi's as column vectors. Traditionally, the principal modes, qi, are extracted

by computing

XXTqi=iqi(1)

Where i's are the Eigen values.

40

a measure of the amount of variance each of the eigen vectors accounts for.

Unfortunately, the matrix XXT is typically too large to manipulate since it is of size

M by M. Such computation is wasteful anyway since only N principal modes are

meaningful, where N is the number of example images. In all our work N M.

Therefore we compute:

XTXu=i ui(2)

and we can obtain the qi's that we actually require using:

qi = Xui(3)

In practice only the first P modes are used, P30N.

The principal mode extracted from a short film clip is shown in Figure 1 and is

used later to help an animator to construct a cartoon version of the clip.

It is tempting to think that such modes could be used directly to predict, say, the

rotation of the man's shoulders. However, the second mode also encodes information

about shoulder movement and it is only by combining information from many modes

that rotation can be reliably predicted.

41

REFERENCE;

[1] ' Special issue on third generation surveillance systems', froc. IEEE, 2001, 89,

JAIN, R., KASTURI, R., and SCHUNCK, B.G. This paper gives the detailks

about the surveillance systems

[2] 'Machine vision' (McGraw-Hill Inc., 1995) PONS, J., PRADES-NEBOT, J.,

ALBIOL, A,, and MOLINA, J.his paper provides the details about artificient

intelligence.

[3] 'Motion video sensor in the compressed domain'. SCS Euromedia Conf.,

Valencia, Spain, 2001, This paper provides the details about algorithms in

compressed domain.

[4] Y. Song, A perceptual approach to human motion detection and labeling. PhD

thesis, California Institute of Technology, 2003. This paper provides the details about

human motion detection

[5] N. Howe, M. Leventon, and W. Freeman, “Bayesian reconstruction of 3D human

motion from single-camera video,” Tech. Rep. TR-99-37, Mitsubishi Electric

Research Lab, 1999 This paper provides the details about 3d human detection.

[6] L. Goncalves, E. D. Bernardo, E. Ursella, and P. Perona, “Monocular tracking of

the human arm in 3D,” in Proc. 5th Int. Conf. Computer Vision, (Cambridge, Mass),

pp. 764– 770, 1995.This paper provides the details about 3d human detection.

[7] S. Wachter and H.-H. Nagel, “Tracking persons in monocular image sequences,”

Computer Vision and Image Understanding, vol. 74, pp. 174–192, 1999.

This paper provides details about motion detection in image sequences.

42

3D motion detection using neural networks

Documents

Transcript of 3D motion detection using neural networks