[IEEE 2012 IEEE Conference on Technologies for Practical Robot Applications (TePRA) - Woburn, MA,...

6
Human Visual System Inspired Object Detection and Recognition Debashree Mandal, Karen Panetta, Fellow, IEEE Department of Electrical and Computer Engineering Tufts University Medford, MA, USA [email protected], [email protected] Sos Agaian, Senior Member, IEEE Department of Electrical and Computer Engineering University of Texas at San Antonio San Antonio, TX, USA [email protected] Abstract— This paper presents a new generic framework for human visual system inspired object detection and recognition and introduces the idea of feature extraction based on the human visual sensitivity. These methods can greatly enhance robotic vision applications. Additionally a new computationally effective object detection algorithm is presented based on image morphology and visual sensitivity. This new method surpasses the performance of the existing method based on traditional edge detectors. We also present the effectiveness of the algorithm on under-illuminated images. Keywords-Human Visual System, Object Detection, Feature Extraction, Robot Vision, Image Morphology I. INTRODUCTION Computer vision techniques are widely used to address real world problems that involve processing of voluminous image data requiring massive computing efforts and high level of processing accuracy. Developing computer vision algorithms for real time applications in robot vision systems requires sophisticated forms of image processing methods. Such applications include solving complex object recognition and classification problems in security and medical applications. In security and automatic surveillance systems computer vision techniques are widely used to identify threat objects like guns or other weapons from scanned images. Other well-known applications include face detection and detecting eyes from within facial images. For example, a majority of car accidents are caused by driver fatigue. Hence automated systems have been developed that can track the eyes of the driver in subsequent frames of images and warn the driver in case the eyes remain closed for a subsequent amount of time. Pedestrian detection for security surveillance is also based on computer vision techniques. Medical applications include the detection of tumors and lesions in CT scan images. All these methods are fundamentally based on training the system based on the application and then characterize a new object based on the training set. In computer vision applications, feature vector extraction is used widely as a method of object characterization. The primary goal behind this object representation technique is to reduce the dimensionality of the problem by generating a compressed object representation that will result in less processing efforts and memory requirements. Feature vector algorithms that are based on the binary edge map of the image include Cell Edge Distribution (CED) and Principal Projected Edge Distribution (PPED) [1] [2]. Morphological operations are also used on binary edge maps for extraction of essential features from images [3]. The Histogram of Oriented Gradients (HOG) [4] is a method of feature extraction that is based on the gradient image instead of edge maps. Feature vectors can be extracted from simply the pixel intensity values. The Local Binary Patterns (LBP) [5] [6] method of feature extraction uses the intensity values as basis for feature extraction from images. One of the widely used methods of feature extraction based on statistical methods is the Principal Component Analysis (PCA) [7]. Traditionally, the feature vector extraction algorithms do not take into consideration the characteristics of the human visual system. In this paper, we propose an improved framework for feature extraction that is based on the human visual system. The remainder of this paper is organized as follows. Section II provides background information regarding human visual system (HVS) based image decomposition and feature extraction based on HVS. Section III describes the improved algorithm of detecting eyes from facial images based on HVS and morphological image processing. Section IV presents experimental results and comparison. II. BACKGROUND INFORMATION A. Human Visual System (HVS) based image decomposition HVS based image decomposition aims at emulating the way in which human eyes respond to visual stimulus [8]. Information received by the human eye is characterized by attributes like brightness, edge information, color shades etc. Brightness is actually a psychological sensation associated with the amount of light stimulus entering the eye. Due to the great adaptive ability of the eye, human eye cannot measure the absolute brightness rather it measures the relative brightness [9]. In this context contrast C can be defined as the ratio of difference in luminance of an object B o and its immediate surrounding B s . Mathematically, C= B o -B s B s = B B s . (1) The visual increment threshold (or just noticeable difference) is defined as the amount of light B T necessary to add to a visual field of intensity B such that it can be 978-1-4673-0856-4/12/$31.00 ©2012 IEEE 145

Transcript of [IEEE 2012 IEEE Conference on Technologies for Practical Robot Applications (TePRA) - Woburn, MA,...

Page 1: [IEEE 2012 IEEE Conference on Technologies for Practical Robot Applications (TePRA) - Woburn, MA, USA (2012.04.23-2012.04.24)] 2012 IEEE International Conference on Technologies for

Human Visual System Inspired Object Detection and Recognition

Debashree Mandal, Karen Panetta, Fellow, IEEE Department of Electrical and Computer Engineering

Tufts University Medford, MA, USA

[email protected], [email protected]

Sos Agaian, Senior Member, IEEE Department of Electrical and Computer Engineering

University of Texas at San Antonio San Antonio, TX, USA

[email protected]

Abstract— This paper presents a new generic framework for human visual system inspired object detection and recognition and introduces the idea of feature extraction based on the human visual sensitivity. These methods can greatly enhance robotic vision applications. Additionally a new computationally effective object detection algorithm is presented based on image morphology and visual sensitivity. This new method surpasses the performance of the existing method based on traditional edge detectors. We also present the effectiveness of the algorithm on under-illuminated images.

Keywords-Human Visual System, Object Detection, Feature Extraction, Robot Vision, Image Morphology

I. INTRODUCTION Computer vision techniques are widely used to address real

world problems that involve processing of voluminous image data requiring massive computing efforts and high level of processing accuracy. Developing computer vision algorithms for real time applications in robot vision systems requires sophisticated forms of image processing methods. Such applications include solving complex object recognition and classification problems in security and medical applications. In security and automatic surveillance systems computer vision techniques are widely used to identify threat objects like guns or other weapons from scanned images. Other well-known applications include face detection and detecting eyes from within facial images. For example, a majority of car accidents are caused by driver fatigue. Hence automated systems have been developed that can track the eyes of the driver in subsequent frames of images and warn the driver in case the eyes remain closed for a subsequent amount of time. Pedestrian detection for security surveillance is also based on computer vision techniques. Medical applications include the detection of tumors and lesions in CT scan images.

All these methods are fundamentally based on training the system based on the application and then characterize a new object based on the training set. In computer vision applications, feature vector extraction is used widely as a method of object characterization. The primary goal behind this object representation technique is to reduce the dimensionality of the problem by generating a compressed object representation that will result in less processing efforts and memory requirements. Feature vector algorithms that are based on the binary edge map of the image include Cell Edge

Distribution (CED) and Principal Projected Edge Distribution (PPED) [1] [2]. Morphological operations are also used on binary edge maps for extraction of essential features from images [3]. The Histogram of Oriented Gradients (HOG) [4] is a method of feature extraction that is based on the gradient image instead of edge maps. Feature vectors can be extracted from simply the pixel intensity values. The Local Binary Patterns (LBP) [5] [6] method of feature extraction uses the intensity values as basis for feature extraction from images. One of the widely used methods of feature extraction based on statistical methods is the Principal Component Analysis (PCA) [7]. Traditionally, the feature vector extraction algorithms do not take into consideration the characteristics of the human visual system. In this paper, we propose an improved framework for feature extraction that is based on the human visual system.

The remainder of this paper is organized as follows. Section II provides background information regarding human visual system (HVS) based image decomposition and feature extraction based on HVS. Section III describes the improved algorithm of detecting eyes from facial images based on HVS and morphological image processing. Section IV presents experimental results and comparison.

II. BACKGROUND INFORMATION

A. Human Visual System (HVS) based image decomposition HVS based image decomposition aims at emulating the

way in which human eyes respond to visual stimulus [8]. Information received by the human eye is characterized by attributes like brightness, edge information, color shades etc. Brightness is actually a psychological sensation associated with the amount of light stimulus entering the eye. Due to the great adaptive ability of the eye, human eye cannot measure the absolute brightness rather it measures the relative brightness [9]. In this context contrast C can be defined as the ratio of difference in luminance of an object Bo and its immediate surrounding Bs. Mathematically,

C= Bo-BsBs

= ∆BBs

. (1)

The visual increment threshold (or just noticeable difference) is defined as the amount of light ∆BT necessary to add to a visual field of intensity B such that it can be

978-1-4673-0856-4/12/$31.00 ©2012 IEEE 145

Page 2: [IEEE 2012 IEEE Conference on Technologies for Practical Robot Applications (TePRA) - Woburn, MA, USA (2012.04.23-2012.04.24)] 2012 IEEE International Conference on Technologies for

discriminated from a reference field of the same intensity B. At low light intensities, near absolute visual threshold, the luminance increment threshold is constant; then with increasing intensity, the threshold converges asymptotically to Weber behavior, i.e., ∆BT α B.

The characteristic response of the human observer can be represented in the log∆BT logB plane. A piecewise linear approximation of the curve is presented in Fig. 1.

Figure 1: Linear approximation of the Increment Threshold log∆BT as a

function of Reference Intensity logB

The Weber behavior is generally expressed by the unit slope of the logarithmic curve. The preceding region with slope 1/2 is known as the De Vries-Rose region. It has been shown that if the central visual processor behaves as an optimum probabilistic detector, the incremental visual threshold follows the square root law, i.e. ∆BT α √B. However, in the actual case this rule is followed in a small restricted region. The region after the Weber region has a slope of 2 and represents the saturation region. Though this behavior is not quite commonly exhibited by the retinal cone, yet we can expect this type of behavior in some restricted cases.

The linear equations defining the De Vries-Rose region, the Weber region and the saturation region are given respectively by,

log∆BT= 12

* logB + logK2 (2)

log∆BT=logB + logK1 (3)

log∆BT=2* logB + K3 (4)

Here K1 , K2 and K3 are constants. The value of B corresponding to logB= x is assumed to be Bx for i = 1, 2, 3.

Assuming x0 and Bt be the maximum values of logB and B, we can write

Bx = α Bt , i = 1, 2, 3. (5)

Here 0<α1<α2<α3<1 and they represent the first set of parameters for HVS based image decomposition. The values of α essentially is based upon the three different regions of the human visual response. The second parameter β represents the

slope of the Weber region in terms of ∆BTB max

. Equations (2), (3) and (4) are solved to yield the values of the constants as,

K1= ∆BTB

= β100

∆BTB max

(5)

K2= K1 Bx2 (6)

K3= K1Bx3

(7)

The background intensity image B can be obtained by taking the local mean at each and every point in the image and is given by [10] [11],

x,y = 12

14

∑ X i,j + Q1

4√2∑ X k,lQ' +X(x,y) ÷2 (8)

Here B x,y represents the background intensity at each pixel and X x,y is the input image. Q represents all the pixels that are directly left, right, up and down of the pixel and Q’ is all of the pixels diagonally one pixel away. ∆BT can be represented by any standard gradient detection algorithm. For our experiments we have used the Sobel operator to estimate the gradient image X'(x,y).

Using this information, the image is first partitioned into the different regions of human visual response. These different regions are characterized by the formula for the minimum difference between two pixel intensities for the human visual system to register a difference. Next, these three regions are thresholded, removing the pixels which do not constitute a noticeable change for a human observer, placing these in a fourth image. The formulae are:

Im1=X x,y for Bx2≥ B(x,y)≥ Bx1 & X'(x,y)B x,y

≥ K2 (8)

Im2=X x,y for Bx3≥ B(x,y)≥ Bx2 & X'(x,y)B x,y

≥ K1 (9)

Im3=X x,y for Bx3≤ B(x,y) & X'(x,y)B x,y 2 ≥ K3 (10)

Im4=X x,y for all remaining pixels (11)

The thresholded images given by (8), (9) and (10) represent the regions where human eyes can perceive noticeable difference in intensity with respect to the image background. The fourth image given by (11) represents regions in image where the intensity values remain constant according to the human visual perception. Typically the Weber region is given the maximum weightage for image decomposition. The thresholding parameter β determines the amount of information to be placed in the fourth region as compared to the other regions representing the intensity change information in an image. Fig. 2 shows an example of image decomposition based on HVS.

B. Generic Architecture For HVS Based Object Detection and Recognition Human Visual System based image processing has been

traditionally used for edge detection from images. In this paper

146

Page 3: [IEEE 2012 IEEE Conference on Technologies for Practical Robot Applications (TePRA) - Woburn, MA, USA (2012.04.23-2012.04.24)] 2012 IEEE International Conference on Technologies for

Figure 2: (a) Original Image, HVS based Images in (b) Weber Region (c) De Vries-Rose Region (d) and Saturation region, (e) Remaining image pixels and

(f) Result of arithmetic addition of (b), (c), (d) and (e)

we present a novel architecture for HVS based object detection and recognition. Typically an object detection and recognition system has three phases namely training, testing and classification.

The training phase typically consists of extracting feature vectors from the training database containing objects of interest. In the proposed method, we decompose the training images into HVS based sub-images and extract feature vectors from each of them. In the next step the feature vectors are weighed and are combined together, a process referred to as feature fusion. Fig. 3A represents the training phase of the proposed architecture.

Figure 3A: Schematic Diagram Representing the HVS based Training Phase

for Object Detection

Figure 3B: Schematic Diagram Representing the HVS based Testing Phase for

Object Detection

Figure 3C: Schematic Diagram Representing the Classification Phase

In testing phase the test image undergoes HVS based decomposition and feature extraction from each of the sub-images. Fig. 3B represents the testing phase.

The third phase classifies the feature vectors from the test images. Classification can be based on a set of rules applied to the feature vectors obtained in the test image. We use this method in our experiments for eye detection. In more generic terms classification consists of comparing feature vectors from the training and testing database and then taking a decision depending on the similarity of the training and testing feature vectors. The classifier used in this type of classification can be simple distance classifiers using Manhattan or Euclidean distances or can be more sophisticated form of classifiers like the Support Vector Machine (SVM) or Neural Networks. Fig. 3C shows the classification procedure.

III. A NEW METHOD FOR HVS BASED EYE DETECTION In this paper we present a new algorithm for detecting eyes

from facial images based on the edge density of the image. The original algorithm proposed in [3] has been modified by incorporating the human visual response. Experimental results are presented to show the improvement with respect to the original algorithm. Specifically, we demonstrate the improvement with images suffering from poor a lighting condition, which is known to plague images taken from robot platforms used in military applications. Our algorithm uses morphological image processing for extracting edge density information from the image.

147

Page 4: [IEEE 2012 IEEE Conference on Technologies for Practical Robot Applications (TePRA) - Woburn, MA, USA (2012.04.23-2012.04.24)] 2012 IEEE International Conference on Technologies for

Figure 4: Schematic Diagram for Eye detection Based on HVS based

Image Decomposition and Morphological Image Processing

Fig. 4 shows the flow diagram of the HVS based algorithm for eye detection from facial images. Eyes are generally represented as circular structures and hence the objective of eye detection algorithm is to look for circular features in the facial image.

Morphological operations are usually performed on binary images. In this algorithm, the decomposed images in the different regions are converted to binary images by applying the Sobel edge detector. Since the algorithm processes binary images after the edge detection step, this algorithm is computationally efficient and is suited applications where hardware utilization is a constraint. Further, the training database is not required in this case because eyes are represented as circular structures on the binary image. Hence the objective is to find blob like features from the image after morphological operations

The algorithm takes in an input image and converts it to grayscale. An example input image and its grayscale version is shown in Fig. 5(a). It is assumed that the input image satisfies the flowing conditions:

• The image is a head-shoulder image.

• Both eyes are visible.

• The head is tilted by not more than 45 degrees

• The eyes are not closed

A. Image Decomposition The 2-dimensional image is then subjected to HVS based

decomposition resulting in four images; in the Weber region,

De-Vries Rose region, Saturation region and the “Other” region.

B. Edge Detection Each of the images is then subjected to edge detection. In

our experiments the Sobel edge detector has been used. The canny edge detector tends to bring out more details in the image using 2 threshold values. When we apply morphological operations on a canny edge map, due to dilation, the detailed regions get connected and it is difficult to find distinct blobs representative of the eye.

C. Morphological Operations The binary edge maps are subjected to morphological

operations namely dilation, hole filling and erosion. Dilation grows or thickens objects in a binary image. Dilation enhances the eyes region edges. The image is dilated twice using circular structuring elements of radius 4. The edges near the eyes thicken and holes in the eyes get filled after this step. To make sure no holes remain in the eye region, an additional step of hole filling is incorporated. During the dilation phase, some unwanted edges are also enhanced. To remove these unwanted edges, the image is eroded three times. Fig 5(b), 5(c) and 5(d) shows the results of morphological dilation and erosion in the Weber, “Other” and Saturation region. The De-Vries Rose region does not contain any useful information and hence not shown.

D. Fusion The morphologically processed images in the different

regions of the human visual response are then fused using arithmetic operations. From our experiments we have seen that for well illuminated images, majority of the information lies in the Weber and Other region of the image. Hence the De-Vries Rose and the Saturation regions can be excluded from processing thereby reducing the computational overhead. However for images that are not well illuminated and have regions of shadows in them, we need to use all the components. After this step the resultant binary image contains blobs which need to be classified as eyes. Fig 5(e) (left) shows the result after fusion.

E. Classification The blobs are classified as eyes according to the following

set of rules applied successfully:

• The aspect ratios (width/height) of the eyes regions are between 0.8 and 8.

• The orientation angle of eyes is not greater than 45 degrees.

• Size of both eyes is comparable. The ratio of the size of the blobs corresponding to the eyes should be between 0.4 and 2.5.

• The line joining two eyes shouldn’t have slope greater than 45 degree.

• The eyes are not too close to the borders of the image.

The final result is shown in Fig 5(e) (right).

148

Page 5: [IEEE 2012 IEEE Conference on Technologies for Practical Robot Applications (TePRA) - Woburn, MA, USA (2012.04.23-2012.04.24)] 2012 IEEE International Conference on Technologies for

Figure 5: (a) Original Image and its grayscale version (b) Result after

dilation and erosion in the Weber region (c) Result after dilation and erosion in the “Other” region (d) Result after dilation and erosion in the Saturation region,

(e) Image after fusion (left) and Final Result (right)

IV. EXPERIMENTAL RESULTS The algorithm is tested using the MATLAB Image

Processing Toolbox. The algorithm has been tested using PICS facial database [12]. 93 images were selected where the images were taken under regular condition. Fig. 6 shows some of the results. The facial images are mostly frontal facial images and some of the subjects had glasses on. Table I and Table II show the percentage accuracy and compare the HVS based method with the original non HVS based algorithm. We compare at two stages of the algorithm: the blob like feature extraction stage and the final eye extraction stage after classification. The alpha and beta parameters for HVS decomposition have been determined experimentally. For images having good illumination, the optimum parameter values are α10; α2 = 0.1; α3 =0.9; 0.05 . We have also showed the variation of the feature extraction rate with the parameter beta which is the thresholding parameter for HVS decomposition. The plot is given in fig. 7.

We have also performed experiments with images taken under poor lighting conditions. It has been observed that for images taken under poor lighting conditions, the region surrounding the eyes tend to have shadows. We have increased the span of the De-Vries Rose region to capture the details in presence of shadows. The modified parameters are: α10; α2 = 0.3; α3 =0.9; 0.03 . Fig. 8 shows an example image where the left half of the image is darker than the right half. Non-HVS based processing fails to extract blobs in this case.

Figure 6: Experimental Results

TABLE I. COMPARISON OF DETECTION RATE IN FEATURE EXTRACTION STEP

METHOD TOTAL IMAGES CORRECT DETECTION

PERCENTAGE ACCURACY

HVS BASED 92 86 93.4%

NON-HVS 92 74 80.4

TABLE II. COMPARISON OF DETECTION RATE IN FINAL EYE DETECTION STEP

METHOD TOTAL IMAGES CORRECT DETECTION

PERCENTAGE ACCURACY

HVS BASED 92 80 86.9%

NON-HVS 92 66 71%

149

Page 6: [IEEE 2012 IEEE Conference on Technologies for Practical Robot Applications (TePRA) - Woburn, MA, USA (2012.04.23-2012.04.24)] 2012 IEEE International Conference on Technologies for

Figure 7: Variation of the rate of change of blob extraction vs. the HVS thresholding parameter Beta: The results outperforms non-HVS based

processing for a range of beta from .03-.09

Figure 8: (a) Results after morphological operations in the Weber Region,

(b) Results after morphological operations in the De-Vries Rose Region, (c) Results after morphological operations in the “Other” Region, (d) Fusion of (a), (b) and (c), (e) Final Result (f) Results after morphological processing

with non-HVS method, (g) Final result without HVS decomposition

V. CONCLUSION AND FUTURE WORK In this paper we have presented a generic framework based on human visual system based image decomposition and have shown an example system of eye detection which outperforms the results of non-HVS based system. The method is fast and computationally efficient making it suitable for robot vision systems. This method works faster than standard template matching algorithms for eye detection. As a next step we have applied HVS based processing towards the development of face recognition system using eigenfaces. Comparable results have been obtained using HVS based processing.

REFERENCES [1] Y. Suzuki and T. Shibata, "Multiple-clue face

detection algorithm using edge-based feature

vectors," Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP '04). IEEE International Conference on, vol. 5, pp. V- 737-40, May 2004.

[2] S. Nercessian, K. Panetta, and S. Agaian, "Improving edge-based feature extraction using feature fusion," Systems, Man and Cybernetics, 2008. SMC 2008. IEEE International Conference on pp. 679-684, 12-15 Oct. 2008.

[3] M. Shafi and P. W. H. Chung, "Eyes extraction from facial images using edge density," Cybernetic Intelligent Systems, 2008. CIS 2008. 7th IEEE International Conference on, pp. 1-6, Sept 2008.

[4] N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection," Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, pp. 886-893 June 2005.

[5] D. Maturana, D. Mery, and Á. Soto, "Face Recognition with Local Binary Patterns, Spatial Pyramid Histograms and Naive Bayes Nearest Neighbor Classification," Chilean Computer Science Society (SCCC), 2009 International Conference of the, pp. 125-132, Nov 2009.

[6] T. Ojala, M. Pietikainen, and T. Maenpaa, "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 24, pp. 971-987, July 2002.

[7] A. I. Pustianu, A. Serbencu, and D. C. Cernega, "Mobile robot control using face recognition algorithms," System Theory, Control, and Computing (ICSTCC), 2011 15th International Conference on, pp. 1-6, Oct 2011.

[8] M. K. Kundu and S. K. Pal, "Thresholding for edge detection using human psychovisual phenomena," Pattern Recognition Letters 4, pp. 433-441, December 1986.

[9] G. Buchsbaum, "An Analytical Derivation of Visual Nonlinearity," IEEE Transactions on Biomedical Engineering, vol. BME-27, May 1980.

[10] E. Wharton, S. Agaian, and K. Panetta, "A logarithmic measure of image enhancement," Proc. SPIE 6250, 62500P (2006), 2006.

[11] K. Panetta, E. Wharton, and S. Agaian, "Human Visual System-Based Image Enhancement and Logarithmic Contrast Measure," Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 38, pp. 174-188, Feb 2008.

[12] PICS, 2003. Psychological Image Collection at Stirling. Available from <http://pics.psych.stir.ac.uk/>, University of Stirling Psychology Department.

150