PFAD- REAL TIME FACE DETECTION SCHEME

26
P-FAD: Real-time Face Detection Scheme on Embedded Smart Cameras Chapter 1 INTRODUCTION Embedded smart cameras have made a dramatic shift towards distributed surveillance systems by combining sensing, processing and communicating on a single platform. A critical issue in embedded smart cameras is resource limited, which poses great challenging in designing fast and efficient vision algorithms. Therefore, it should be very important to consider the vision algorithm’s efficiency, memory requirements and portability to an embedded processor during the algorithm design. Face detection has been one of the most studied topics in the computer vision literature, and is the step stone to all facial analysis algorithm . As a fundamental computer vision problem, the goal of face detection is, given an arbitrary image, to determine whether or not there are any faces in the images and, if present, return the image location and extent of each face . Most of the face detection approaches can be categorized as knowledge-based, feature-based, template-based and appearance- based methods . However, little attentions have been paid to the algorithm’s efficiency in processing time as well as meeting the real-time requirement in some resource-limited applications. For 1 Dept. of Electronics and Communication Engineering, MBCET.

description

REAL TIME FACE DETECTION SCHEME USED IN EMBEDEDD SMART CAMERAS

Transcript of PFAD- REAL TIME FACE DETECTION SCHEME

ACKNOWLEDGEMENT

PAGE

P-FAD: Real-time Face Detection Scheme on Embedded Smart Cameras

Chapter 1

INTRODUCTION

Embedded smart cameras have made a dramatic shift towards distributed surveillance systems by combining sensing, processing and communicating on a single platform. A critical issue in embedded smart cameras is resource limited, which poses great challenging in designing fast and efficient vision algorithms. Therefore, it should be very important to consider the vision algorithms efficiency, memory requirements and portability to an embedded processor during the algorithm design.

Face detection has been one of the most studied topics in the computer vision literature, and is the step stone to all facial analysis algorithm . As a fundamental computer vision problem, the goal of face detection is, given an arbitrary image, to determine whether or not there are any faces in the images and, if present, return the image location and extent of each

face .

Most of the face detection approaches can be categorized as knowledge-based, feature-based, template-based and appearance-based methods . However, little attentions have been paid to the algorithms efficiency in processing time as well as meeting the real-time requirement in some resource-limited applications. For example, Hsu Etal propose a face detection method in colour image needs 540 seconds to process a 640x480 image on a 1.7GHz CPU. Rowley Etal present a neural network-based upright frontal face detection system takes approximately 383 seconds to process a 320x240 image. The so-called most successful and fastest Viola-Jones detector can process 384x288 images at the speed of 15 FPS (Frames Per Second) on a conventional desktop and 2 FPS on a low power 200 mips Strong ARM processors, nevertheless their image size is too small to be preferable (a 640x480 resolution is common used) and the 2 FPS on Strong Arm is apparently unacceptable in a real-time application based on embedded smart cameras. To obtain the real-time performance in video streams, several optimized face detectors appeared. Most of them use the Viola-Jones face detector and optimize the software and/or hardware implementation to improve the systems performance. Table 1.1 summarizes some embedded system-oriented implementations.

TABLE 1.1

IMPLEMENTATION OF VIOLA JONES DETECTOR

Chapter 2

DESIGN GOALS

2.1 PROBLEM DEFINITION

In the traditional algorithm design , more attention are drawn to the detection accuracy performance rather than the processing efficiency as well as resource-limited conditions. On the other hand, for some hardware or/and software optimized implementations in Table I, their FPS are moderate. However, all of these implementations platforms are ASIC, DSP and FPGA who are highly specialised and customised processors. In a real smart camera networks application, every camera mote may have changing task with the varying situation. So, our goal was to design a light-weight face detector on embedded smart camera with general purpose processor, where it consumes little resource and achieves real-time and acceptable detection performance.

2.2 PROPOSED SOLUTION

This work is based on the observation that computation and storage overhead increase proportionally to its pixel manipulation in image processing. A natural way is to construct a hierarchical scheme: identifying face candidates with little computation and manipulation on full image and then eliciting the true faces from candidates with reliable algorithm. The key challenge in the hierarchical scheme is how to construct a multi-layer architecture, in which the complex processing can be split from the pixel manipulation and guarantee detection accuracy simultaneously. This problem is solved by proposing Pyramid-like Face Detection (P-FAD) that consists of five layers whose operating units decrease dramatically from top to down while the operations on every unit increase gradually. PFAD addresses this challenge using a 3-stage coarse, shift and refine process. P-FAD first imposes the coarse operations on every pixel for skin detection. It is extremely efficient without losing the robustness to the changing environment. P-FAD then make a shift between operating units, that means using schemes layer 2-4 to shift the operating from pixel manipulation to contour points, grouped regions and face candidates. Finally, P-FAD presents a modified Viola-Jones detector, to refine the final results. The scheme was implemented both on a notebook and embedded smart camera platform. Experimental results demonstrate the P-FADs resource-aware properties that could process a VGA image in just 7.23ms on a notebook and 28.3ms on a light-weight embedded smart camera while still hold the acceptable detection accuracy compared to the Viola-Jones haar detectors OpenCV implementation. Moreover, P-FAD is not customised/optimised for any given hardware platform, so its resource-aware properties can also be ported to other general-purposed smart camera platforms.

Chapter 3

PYRAMID-LIKE FACE DETECTION SCHEME

In this section, hierarchical framework for face detection on embedded smart camera is briefly introduced and then focus on tackling the challenging issues in constructing the hierarchical scheme. P-FAD is a hierarchical detection scheme. More specifically, as shown in Fig.3.1, P-FAD is consists of five layers: skin detection, contour point detection, dynamic group, region merge and filter, and haar face detection. P-FAD first uses a relatively coarse skin detection to detect the skin then through contour point detection, dynamic group, region merge and filter, P-FAD shifts operating unit from pixels to contour points, regions and face candidates; and the finally results are refined by the haar face detection. The hierarchical detection scheme is tailored to implement real time detection scheme with low computation and storage overhead, where the operating units decrease dramatically from top to down while the operations on each unit are increasing. It could make pixel manipulation as few as possible to make a significant shift in time cost and guarantee the detection accuracy through further complex process. Thus, PFAD has an inverted pyramid-like appearance on the scales of every layers operate units while the processs complexity is increasing with a pyramid-like shape. To achieve low overhead and high detection accuracy, the following critical issues should be answered in P-FAD:

Derive the efficiency detection regions with operating on full images as few as possible.

Achieve a robust detector with high accuracy by considering the changing environment such as illumination and different individuals.

Fig 3.1. Pyramid like architecture. MA is short for memory access operation.NI means normal instruction.

Chapter 4

LAYERS OF PYRAMID LIKE FACE DETECTION SCHEME

4.1 SKIN DETECTION

In this implementation of the scheme, there was no interface to scale the graphics engine frequency, so the only metric that came into play is the processor frequency. Additional knobs would definitely have a more profound effect on the energy consumption as well as on peak power. An algorithm for predicting the EM (such as a weighted average of the previous samples) as an enhanced way of fine-tuning was not developed, but instead the failure of traditional schemes in some scenarios where the proposed mechanism succeeds was showcased.

Skin detection is the first layer with pixel manipulation of P-FAD. Because the pixel manipulation accounts for the most processing time in image process, a crucial issue in skin detection is to consider the process complexity. To reduce processing time significantly, the basic design principle is to present a relatively coarse but highly time-saving skin detection. In P-FAD, skin detection is based on skin-colour information as skin colour provides computationally effective yet, robust information against rotations, scaling and partial occlusions. Further, the skin colour is modelled in CbCr subset of YCbCr colour space. CbCr subset can eliminate the luminance effect and provide nearly best performance among different colour spaces To classify a pixel as a skin-pixel or none-skin-pixel, we choose the widely used Gaussian mixture models (GMM), which has relatively simplified parameters without losing accuracy, to represent the skin-colour distribution through its probability distribution function (PDF) in CbCr subspace, defined as:

where t denotes the frame index, x(t) is a two-dimension colour vector in CbCr subspace, ni(x(t),i,t,i,t) is the ith Single Gaussian model (SGM) component contributes to mixed model with a weight i. The SGM is a elliptical (two-dimension)

Gaussian joint probability distribution function, determined by its mean vector and the covariance matrix i,t .At last, the pixel with the colour vector x(t) can be judged as a skin-colour pixel or not through comparing the p(x(t)) with a predefined

threshold. The main difficulties to implement the GMM in P-FAD are

the following:

The fixed Gaussians parameters i,t ,i,t obtained by offline training procedure from a large face dataset is not robust to the changing environment.

The computation overhead is high in Eq. (1) on every pixels for judging a pixel as a skin colour pixel or not.

FIG 4.1 . FLOW CHART OF REGION FORMATION PROCEDURE

To solve these two problems, an adaptive GMM skin colour detection algorithm with online learning and simplify judging criteria is used. The pseudo-code is given in Algorithm 1. First, two sample sets (Sskin, Sfake) derived from the final output of P-FAD are used to train the adaptive GMM online, the training speed (t) has different sign symbol for two sets which denotes the learning and forgetting the two sets current distribution information respectively (line 5 to line 7). Then we update the parameters i,t ,i,t,wi,t of each SGM based on current learning parameter (t) (line 9 to line 15). Note the threshold 2.5 li,tl means the confidence interval with the 95% probability to confirm the fact that current skin colour distribution belong to the given SGM (line 10). Moreover, the SGM to/from GMM is to be added/removed if there is no SGM that can approximately represent the current skin-colour distribution (line 17 to line 22). To the end, a pixel is judged as a skin colour pixel or not based on our simplified rectangle judgment.

The basic idea of simplification is to approximate the equal value boundary of p(x(t)) by a single ellipse and project this ellipse onto two axis to get its minimum enclosing rectangle. Specifically, we use the Eq. (3) to calculate the mean values of GMMs parameters (i,t) and i,t as approximate ellipses position and shape, denote as (t) and i,t respectively. Then, we decomposition the covariance matrix i,t of ellipse to get the ellipses rotation angle C and the length of axis , so we can use Eq. (4) to obtain the ellipses minimum enclosing rectangles width and height, denote as Wcr(t) and Wcb(t) respectively. Obviously, the centre position of rectangle is the same with approximate ellipses.

Where D is a diagonal matrix and d11, d22 represent the elements of matrix D. We can use threshold as our judgement so that we get the length of approximate ellipses axis: . And ellipse rotation angle can be easily computed from orthogonal matrix C.

4.2 Contour points detection, dynamic group, and region merge

After skin detection, the next step in P-FAD is to shift the operating from pixel manipulation to region. To refine the process, P-FAD presents a 3-layer architecture, where the operating units are formed from contour points, grouped regions to face candidates. Correspondingly, the operations on each unit increase from several normal instructions (NIs), tens NIs to hundreds NIs. The AIMD based (Additive Increase and Multiplicative Decrease) contour points detection scheme and dynamic group based point classification method for foreground detection on embedded smart cameras. Here region merge and filter is used for an integrated region formation procedure, as shown in Fig.3. The region merge is to merge the small regions, which are usually split by the eyebrows, glasses, to a complete face candidate; the filter procedure is to eliminate the non-face regions through some prior knowledge, such as height-width ratio ranging from 1.1 to 1.5 in our scheme. Table II shows the time consumption comparison between our proposed region formation and the traditional component find way approaching. The implement of component find way approach is obtained from OpenCV 2.3 [8], and the mask size of morphological filter is 3x3 and the number of iterations is 3 to obtain a good filter performance. The results show our region formation method make a wonderful time saving compared to the traditional component finding method with (Condition I) & without (Condition II) morphological filtering. Moreover, the time complexity of integrated region formation scheme is adaptive to the number of faces.

TABLE 4.1.

TIME CONSUMPTION OF REGION FORMATION

4.3 Modified haar detector

The above four layers in P-FAD may produce 0 to 5 face candidates in a frame. To verify the final output, we choose a most successful and fastest Viola-Jones haar feature face detector as the final layer. In a typical configuration, a VGA image would produce nearly 881484 sub-windows to be classified as face or not in Viola-Jones detector. However, the sub-windows are several hundred by exploiting the above four layers in P-FAD. Thus, the computation overhead of Viola-Jones detector in P-FAD is reduced significantly. Moreover, the fully cascade structure in Viola-Jones detector is not almost required because P-FAD can present an early rejection in above four layers. It is time-consuming when a face sub-window goes through the whole cascade. In the traditionally situation, this overhead can be compensated by the large time saving in the early rejection of non-face sub windows. To determine which stage to start in P-FAD, first the cascades time consumption of different start stage is modelled.

wheret = [t(1) _ _ _ t(n)]T , t(x) denotes the time consumption if the start stage is x and there are n stages in a cascade structure. Nreject and Naccept are the number of rejected and accepted sub-windows in the whole cascade structure respectively. t reject andt accept are the expectation time consumption to reject and accept a sub-window respectively.

where pij in Matrix P denotes the probability of the sub window starting from the i-th stage would be rejected in stage j, aij in matrix A is the sum of features from stage i to j. Assume that the time consumption is proportional (k times) to the number of processed features. The _(_) maps the matrix to a vector whose elements are the elements in matrixs dialog. Thus, the detectors time consumption is determined by two set of arguments: F = [F(1) _ _ _ F(n)]T , F(x) is the number of features in stage x; and P = [P(1) _ _ _ P(n)]T , P(x) is the probability to reject the non-face sub-windows in stage x. From the OpenCV 2.3 baseline face detector, we can getF, the cascade detectors feature quantity distribution in every stage. Details can be seen in [8]. P(x) is assumed to be linearly increasing from 50% to 99% which is obedient to the cascade structure [11]. Practically scanning conditions is simulated using above formula and also get an implementation on a 2.2GHz notebook. Simulation and experimental results in Fig. 3 show that the time cost function is convex for the start stage. Then the minimum value of start stage can be only obtained at the first and last stage. In P-FAD, according to Fig. 3, the optimal start stage is depend on the ratio = Nreject/Naccept. The overhead rate = (ttotal(1)-ttotal(n))/(ttotal(1)+ttotal(n)) is defined as the effect of choosing first stage to start, while negative value means the choice could save time. Fig. 4 shows that when the is extremely large, which meets the traditional Viola- Jones detector situation, the is near to -1 indicates choosing first stage to start is undoubtedly saving most processing time. However, when is smaller than 12, the will be positive.

Thus, in P-FAD, when the number of non-face sub-windows is few, choosing the last stage is optimal for time-consuming. Based on the following discussion, we may implement a modified Viola-Jones detector in P-FAD according to the online estimation of . Note that the total number of sub windows is determined by its scan strategy before the classifying, so we only need to estimate the Naccept. In our implementation, just assume Naccept approximately equals to the number of face candidates, which is reasonable when checking the statistical data in Table III. Obviously, further work could be done to improve the estimations accuracy.

FIG 4.2 . Viola Jones detector's time cost for different start stages

FIG4.3 The overload of time consumption when starting at first stag

Table 4.2 . Detection results of video sequence

Chapter 5

P-FAD'S ALGORITHMIC COMPLEXITY

In this section, the computation complexity of our scheme is concluded. First of all, note that the schemes whole computation is mainly determined by the P-FADs first and second layer while last three layer can be omitted for their extremely fewer operating units. Specifically, Layer 1 and Layer 2 are pixel manipulation and they are completed simultaneously in a single image scan which could reduce the repeat access to memory. Suppose the image size is N, layer 1 needs N or 2N memory access to get the CbCr value, N to 4N comparison instructions to run our simplified rectangle judgment and N instructions to link the second layer. Layer 2 needs at most N/ (usually = 10) memory access to store the contour points and 3N normal instructions. As a result, pixel manipulation totally needs N to 2.1N memory access as well as 5N to 8N normal instruction. Secondly, the time consumption in Layer 3 and Layer 4 is extremely low for its dynamic properties and relatively much fewer operating units, see Table II. At last, the time consumption of the Viola-Jones detector in P-FAD is reduced significantly because there are only hundreds sub-windows to be classified. In the traditional Viola-Jones detector, the number of sub-windows is nearly O(N^2) for its scaling and shifting, and the processing time is proportionally to it. Moreover, our modified Viola-Jones detector can reduce the time consumption further. in P-FAD is O(N), which is similar with the simple image process functions, and much lower than the traditional Viola-Jones detector with computation overhead O(N^2).

Chapter 6

EXPERIMENTAL RESULTS

Face detection scheme on our embedded smart camera platforms and a 2.20GHz notebook respectively to evaluate its fast-processing as well as robust performance. The embedded smart camera platform consists of a Intel Xscale microprocessor PXA270 and a image sensor OV9655, which is based on the CITRIC architecture.

First,schemes adaptive GMM algorithm is evaluated on a video sequence. The skin-tone detections PD (Probability of Detection) and FA (probability of False Alarm) among adaptive GMM algorithm and two fixed rectangle model in Fig.5. The six selected frames are corresponding to the picture 2, 3, 5, 8, 10, 16 respectively in Fig.6. It can be seen that last three frames are darker than the first three frames because a man stood by the windows, and different persons in different frame have various pose. Adaptive model can be robust to this kind of

FIGURE 6.1. Frame from test video sequence

of environmental stage. At last, P-FAD scheme is implemented on embedded smart camera to show its resource-aware property. Fig. 7 shows the limited capability of embedded platform by listing the run time of basic image processing function in three frequencies. The face detection costs only 28.3 ms to process a VGA image, which costs almost the same with a typically background subtraction operation.

FIGURE 6.2 TIME CONSUMPTION ON EMBEDDED SMART CAMERAS

Chapter 6

CONCLUSIONS

P-FAD, a hierarchical framework for reducing the computing and storage cost of face detection in embedded cameras was implemented. The goal was to reduce the pixel manipulation without compromising the detection performance. This goal was met by devising a 3-stage coarse, shift and refine process, which shifts the operating unit from pixel to contour points, regions and face candidates and reserves more complex processing for the given promising units. The experimental results exhibit the P-FAD schemes resource-aware properties that could process a VGA image in just 7.23ms on a notebook and 28.3ms on a light-weight embedded smart camera while still hold the acceptable detection accuracy compared to the Viola-Jones haar detectors OpenCV implementation.

REFERENCES

[1] Qiang Wang, Jing Wu, Chengnian Long and Bo LI, "P-FAD: Real-time Face Detection Scheme on Embedded Smart Cameras",Shanghai Jiao Tong University, Shanghai, China

[2] L. Acasandreni and A. Barriga, Accelerating Viola-Jones face detection for embedded and SoC environments, in Proc. ICDSC Conf., 2011.

[3] M. Bramberger, J. Brunner, B. Rinner, and H. Schwabach, Real-Time Video Analysis on an Embedded Smart Camera for Traffic Surveillance, in Proc. ITAS Conf., Toronto, May 2004, pp. 174-178.

[4] P. Chen, P. Ahammad, C. Boyer, CITRIC: A Low-Bandwidth Wireless Camera Network Platform, in Proc. ICDSC Conf., Aug. 2008, pp. 1-10.

[5] J. Cho, S. Mirzaei, and R. Kastner, Fpga-based face detection system using haar classifers, in Proceeding of the ACM/SIGDA international symposium on Filed programmable gate arrays, 2009.

[6] R.-L. Hsu, A.-M. Mohamed and A.K. Jain, Face detection in color images, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 696-706, May 2002.

[7] P. Kakumanu, S. Makrogiannis, N. Bourbakis, A survey of skin-color modeling and Detection methods, Pattern Recognition, vol. 40, pp. 1106-1122, 2007.

[8] R. Kleihorst, M. Reuvers, B. Krose and H. Broers, A smart camera for face recognition, in Proc. ICIP Conf, 2004.

[9] OpenCV: http://www.opencv.org.cn/opencvdoc/2.3.2/html/index.html [10] H.A. Rowley, S. Baluja and T. Kanade, Neural network-based face detection, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 20, no. 1, pp. 23-38, 1998.

[11] K. Suzuki, I. Horiba, N. Sugie, Fast connected-component labeling based on sequential local operations in the course of forward raster scan followed by backward raster scan, in Proc. ICPR Conf., Aug. 2000, vol. 2, pp. 434-437.

20

Dept. of Electronics and Communication Engineering, MBCET.