Vision Based Person Detection for Safe Navigation of ... · PDF fileVision Based Person...

Vision Based Person Detection for SafeNavigation of Commercial Vehicle

Songlin Piao and Karsten Berns

University of Kaiserslautern, 67663, Germany,[email protected], [email protected],https://agrosy.informatik.uni-kl.de

Abstract. Vision based solution of safe navigation for commercial ve-hicles with fish-eye camera is presented in this paper. This work aimsto develop a system which is able to detect persons or objects aroundcommercial vehicles for preventing any accidents. This is achieved byintegrating classifier based window searching algorithm and ego-motionbased algorithm into the system. The classifier is trained using cascadedsupport vector machine and ego-motion is estimated only based on cap-tured images from camera. Test results show that designed system pro-vides high detection rate when it is applied in commercial vehicle.

1 Introduction

Nowadays, safety issue is becoming more and more important in autonomousmobile robots. The word ”Autonomous” receives increasing popularities in theindustry of mobile vehicles. One of the main challenges ahead of industry is topromote the safety rate during the autonomous operation. Vehicle manufacturestry to reduce accident rate by integrating a diverse range of safety techniquesinto their product. However, it seems not easy for commercial vehicles to fulfillthis requirement. There were 4216 fatalities caused by large trucks and buses inU.S. in 2012 according to Motor Carrier Safety Progress Report [1]. The reasonof such kind of accidents can be categorized into two main types. First, there aremany blind areas which driver cannot observe directly around these machinesbecause of limited vision scope, the tremendous size and diverse shapes. Second,though various sensors are mounted around the vehicle to detect blind spots,the sensory systems posses only the functionality of measuring data and arenot able to process them meanwhile. These realistic problems motivated us todevelop and design safety system described in those scenarios.

Similar as other object detection tasks, human detection is a challenging en-deavor because of Sensing noise, Environmental variations, Similarity to back-ground signal, Appearance variability and unpredictability, Similarity to otherpeople and Active deception according to the statement [2]. Environmental vari-ations may be the most significant factor due to unexpected or abrupt changes inoutdoor environment affecting directly the quality of data sensing. For example,sudden variations of lighting, foggy, snowing or raining weather conditions could

2

introduce tremendous noises to the sensory system especially to vision systems.These kind of factors would directly have a strong impact on higher false positiverate and lower detection rate. False positive rate decides the usability whereasthe detection rate leads to the performance of the safety system.

Various types of sensors have been used to detect humans so far. Radarsensors are widely utilized in the automotive industry nowadays since it wasfirst successfully implemented in the military industry decades ago. Typical longrange radar sensor from Bosch is now already in 4th generation since its firstintroduction into the market in 1990’s [3]. Adaptive Cruise Control which isavailable in most cars nowadays employs this sensor. The radar sensors arealso applied in human detection in [4], [5]. There are also system implemen-tations which used laser scanners and thermal infrared systems as alternativeones. Braun et al. detected human face with RGB camera and tracked humanlegs using laser scanner in [6], and Fuerstenberg developed pedestrian protectionsystem with only laser scanner in [7]. Although all of these sensors could achieverelatively high detection rate, they lose most of visible information so that it isnot suitable for short range object detection. Instead, vision sensors capture allvisible data and cost relatively cheaper. Based on real time requirements, oneof the main problems is the reduction of computational effect of the detectionalgorithm. We managed to solve this problem by applying linear support vectormachine (SVM) and integral images.

Discussing about detection algorithms, a classifier needs to be trained off-line for the specific object. Two kinds of greedy searching methods are mainlyused. One is to fix the size of searching window and to change the size of thewhole image in each pyramid layer. The other one is to fix the size of the wholeimage and to adapt the size of searching window for each scanning level. Dalaland Triggs used histogram of oriented gradients (HOG) to detect human [8].They trained human detection classifier using linear SVM and used first typeof searching method to scan whole image. Instead, Viola and Jones appliedHaar-like features and the second searching strategy mentioned above to scanface area in the image [9]. The classifier was trained using Adaboost and theprogram could run in real time. The trick they used is integral image whichcould calculate sum of the pixels inside specific region within constant timecomplexity. The performance of the detection algorithm depends on featuresused to extract descriptor, length of descriptor, learning algorithm and windowscanning strategy.

The remainder of this paper is organized as follows. Section 2 introduces fish-eye camera and its mathematical model used in the system. Section 3 discusseslocal binary pattern (LBP) feature applied into the system and window scan-ning algorithm. Section 4 introduces training procedure of the linear and kernelintersection SVM and its potential problems more in detail. Section 5 will showthe process of estimating ego-motion of a vehicle relative to the ground planebased on vision sensor and the principles of detecting foreground object basedon this ego-motion. The experimental result will be described in section 6 andsection 7 gives conclusion and future work.

3

(a) Mobotix Camera (b) Captured Image

(c) Re-projection Error (d) Calibration

Fig. 1. Vision System Overview and Calibration Results

2 Fish-Eye Vision Sensor

One of the big advantages of fish-eye camera is the view range of 360 degree.Comparing with stereo cameras which need too much time on calculating depthimage in preprocessing step, only one calibration step is enough in fish-eye cam-era. Once it is calibrated and the height of camera is known, then the distancebetween detected object and the point which corresponds to the center of cameracould be estimated.

There are only a few literatures available on fish-eye camera. Scaramuzza etal. proposed omni-directional visual odometry in [10], and Gandhi and Trivediused omni-camera to detect interesting events such as independently movingpersons and vehicles in [11] by estimating parametric planar motion model. Bothof them are not focused on solving safety issues using such sensor.

4

The vision sensor used in this work is Mobotix Hemispheric Camera1 asshown in Fig. 1(a). It can provide 180 degree horizontal field of view and 160degree vertical field of view. One of the troublesome problems is that it is difficultto demonstrate human in the posture of upright in the fish-eye image scene, butmost of state-of-art algorithms assume that humans are upright in the imagescene. Therefore, it is necessary to do a pre-processing step to get a normalview. We exploit the algorithm and toolbox from [12] to calibrate the camerato get mapping relationship between world coordinate and camera coordinate.Fig. 1(b) is one of the image used for calibration with chessboard; Fig. 1(c)shows the distribution of re-projection error of each point for all the chessboard,different colors correspond to the different images of the chessboard; Fig. 1(d)displays the position of every chessboard with respect to the reference frame ofthe fish-eye camera. Then the critical area predefined inside omni-view can bemapped to the normal view with these parameters. In this procedure the reversemapping technique is chosen to overcome aliasing effect. Final image processingwill be done on the image size of 720 by 240 as shown in the right-handed side ofFig. 2. Width of the right image corresponds to angle inside left one and heightof the right image corresponds to radius inside left one. It is not necessary tomap all omni-view image but interesting area into the perspective view becausethe distortion is very large on the pixels which are near to the center of omniview. In the case of large distortion, sensors like ultrasonic or other vision basedalgorithms can be applied to detect objects in such area. Each region of interestcould be represented by four parameters like anglemin, anglemax, radiusmin

and radiusmax. The unit of angle and radius are respectively radian and pixel.These values are set to 2.2236radian, 6.1596radian, 40.0pixel, 270.0pixel in thedesigned system.

Fig. 2. Projecting the image in fish-eye camera to a landscape view

3 Classifier based Detection

A lot of work about vision based human detection have been done by otherresearchers. In the past, template matching based methods were used to detect

1 www.mobotix.com

5

(a) (b)

Fig. 3. Local Binary Pattern Descriptor

humans inside the image, but recently, descriptor based methods are becomingpreferring choice for human detection. A human detection classifier is trained bylearning extracted positive and negative descriptors from the samples of train-ing dataset. Then greedy searching strategy is applied to scan whole image withpre-trained classifier to judge the existence of human in specific position. SinceHOG [8] was first proposed in 2005, many descriptors were proposed to get op-timized balance between speed and accuracy. Some of the researchers tried touse GPU to get higher speed, but these kinds of implementations were compu-tationally expensive so that it is not suitable for on-board robotic system. Wuet al. proposed a real-time human detection algorithm using contour cues withCENTRIST descriptor [13]. Inspired by [13], contour information is further usedin the designed system to extract human descriptor, but Local Binary pattern(LBP) replaces the original CENTRIST feature. As seen in Fig. 3, each patchis divided into several grids, and then green box scans the whole patch2 andmakes final query descriptor. In case of Fig. 3, the patch is divided to 3 by 4small grids. For each green box, the length of LBP descriptor is 256 and the totallength of the final descriptor is 256 × 6 = 1536 as there are 2 × 3 = 6 possiblepositions of green box. For each pixel inside green box the index of LBP descrip-tor can be calculated as shown in Fig. 3(b). The index can be represented as(C1C2C3C4C5C6C7C8)2 where Ci is set to 1 if the corresponding neighbor pixelvalue is higher than the current pixel value otherwise it is set to 0. Cascaded de-tector which combines linear SVM and histogram intersection kernel SVM(HIKSVM) is trained at the end as shown in Fig. 4. Human detector which has patchsize of 90 by 40 and 108 by 36 have been trained, respectively.

2 Here patch is Sobel edge image.

6

Fig. 4. Cascade in Detection

4 Training Procedure

SVM is proven as a good option for binary classification problem. As discussedin the previous section, there are two types of SVM, which are applied in thesystem; One is linear SVM and the other is HIK SVM. Given a training set whichconsists of positive set P and negative set N , linear SVM is trained by iterativeapproach. Each sample in the training set is represented by the extracted featurevector described in the previous section. In the first iteration, the classifier istrained by P and randomly selected subset N1 from N . We note this classifieras L1. Similarly, L2 is trained by P and N2, where N2 is the hard example onthe negative set N produced by classifier L1. In general, classifier Ln is trainedby P and Nn, where Nn is the collection of the hard example which is classifiedas positive by all the classifiers L1, · · · , Ln−1. Then the final linear SVM couldbe trained by the positive sample set P and combination of negative sample setfrom each iteration procedure, which can be denoted as Nfinal = N1∪, · · · ,∪Nn.

However, there is some problem in this linear SVM. If this final trainedclassifier Lfinal would be applied to other negative test dataset Ntest, it is ofhigh possibility to have hard examples inside the indistinct area as shown inFig. 5(b). This indistinct area could be depicted as two positive and negativesupport vectors as shown in Fig. 5(a). HIK SVM is designed to solve this problem,it is trained by positive set P and hard examples produced on the new negativetest set Ntest by final linear SVM classifier Lfinal. By cascading these two typesof SVM, false positive rate could be reduced to much lower level and precision ofthe classifier also remains as high as possible, but this cascaded structure cannotaffect recall rate instead. According to our research, recall rate mainly dependson the selection of positive training set P . More details about recall rate wouldbe discussed in the experiment section.

7

(a) (b)

Fig. 5. SVM Training

5 Ego-Motion based Detection

Ego-motion is very important for the vehicle localization, especially in absenceof GPS information or with slowly updating frequency. This algorithm is usedto detect objects with independent motion or with obviously different height tothe ground plane. Gandhi and Trivedi proposed similar method in [11], but theyconsidered parametric planer motion model, which means they first did transfor-mation from omni-view to perspective view and then estimated the homography.On the contrary, homography matrix is estimated directly in the proposed sys-tem, which is faster than the previous one. If X1 and X2 are coordinates ofthe same scene point in camera frame 1 and 2, then their relationship can berepresented as

X2 = RX1 + T (1)

where R and T are rotation and translation matrix. If assume X1 lies on theplane represented by normal vector n ∈ R3, then the relationship can be writtenas

nTX1 = h, (2)

substituting Eq. (2) into Eq. (1), we get

X2 = (R+TnT

h)X1. (3)

Eq. (3) is the definition of homogeneous relationship between X1 and X2 andthis can be simply written as

X2 = HX1. (4)

It is important to mention that in Eq. (4) scaling factor is already consid-ered. First, good corner points are calculated by Shi-Tomasi corner detection

8

(a) (b)

Fig. 6. Ego Motion Example

algorithm [14], and then these points are tracked by optical flow based KLTfeature tracker. The result of this first part is shown in Fig. 6(a). Random sam-ple consensus (RANSAC) algorithm is applied to estimate accurate homographymatrix in the next step for filtering out possible errors.

After estimation of final homography matrix H, back projection needs tobe implemented from current frame to the previous frame using X1 = H−1X2

to calculate corresponding point in the previous frame. Using Eq. (5), regionwhich has large difference with current pixel value can be estimated. The resultis shown in Fig. 6(b). Red points correspond to the region which may containforeground.

|f(X2)− f(X1)| > θ,X1 ∈ Frameprevious (5)

Post processing is needed after this step. First, eroding and dilating opera-tions are applied in order to filter out too small errors. Second, contour searchingalgorithm is further applied on the result image from first step to extract possibleshape information of the object. Third, several conditions are further checkedfor each detected contour. These conditions are summarized as below:

– If the size of the contour is too small or too big, then it is filtered out.– If the ratio between area of contour and area of the minimum enclosing circle

of that contour is too small, then the contour is filtered out.

Fig. 7 shows the result of ego-motion based algorithm. Fig. 7(a) shows theanalyzed result of optical flow; blue points are the outliers calculated by theRANSAC algorithm. Most of the outliers are on the boundary of the imageand vehicle itself. Fig. 7(b) shows the middle result after back projection ofhomography matrix H. There are a lot of red points on the boundary of thevehicle because of the vibration. Fig. 7(c) shows the final result where possiblecontour is colored as white and false contour is colored as gray.

The test video used in Fig. 7 consists of 198 frames, where there are in total76 people appeared around the machine. Our Ego Motion algorithm successfullydetects humans 65 times with 2 times false detection, which means the precision

9

(a) Optical Flow (b) Foreground

(c) Final Result

Fig. 7. Ego Motion Result

and recall rate of the proposed algorithm on this video are about 97% and 86%,respectively.

6 EXPERIMENTS

Currently, there is no public benchmark dataset available for fish-eye camera,therefore, we have made ground truth by ourselves. There are in total 430 framesin this test sequence. Test results from designed human detection algorithm andego-motion based algorithm are shown in Table 1.

Area (Rq ∩Rg)

Area (Rq ∪Rg)>0.5 (6)

Matching criteria from Eq. (6) is used to compare two rectangles for thedetection algorithm. The overall recall rate of the detection algorithm is 55.47%,but it becomes 70.46% when the distance between human and the camera is inthe range of 1−3m because of minimal distortion in this range in the transformedimage. Here the distance describes the journey between object and the pointwhich go through the center point of camera vertically.

10

Fig. 8 shows the recall rate of human detection algorithm in different rangesmore in detail. The ranges used in the test are 0− 1m, 1− 2m, 2− 3m, 3− 4m,4 − 5m, 5 − 8m. Normally, there is large distortion in the transformed imagein the near distance range; on the other point, if distance is too far away fromcamera, then objects are observed too small to be detected. This is why therecall rates in the range from 0m to 1m and 5m to 8m are very low. The recallrate inside range from 3m to 4m is relatively lower than other ranges of beforeand after because there were two humans appeared in this range, where occludedhuman was counted as a missing object in this case.

Ego-motion based detector is designed in order to increase the recall ratewithin the range where distortion is very large so that system could still detecthumans while normal human detection algorithm would not work well. The recallrate of Ego-motion based detector reaches 91.38% in the test set. The correctnessof this algorithm is based on the estimated homogeneous matrix H as alreadyshown in Eq. 4. In this test set, there were totally 429 homogeneous matrix Hcomputed where 37 of them were falsely estimated. This happens because ofabrupt change of sunlight, shadow of the vehicle and vibration of the vehicle.

Detection (Overall) Detection (1m - 3m) Ego Motion

Recall 55.47% 70.46% 91.38%

False Positive Per Frame 0.27 0.20 0.15

Average Time Per Frame 16.17 ms 16.17 ms 30.80 ms

Table 1. Test Results

Fig. 9 is kind of visual result of designed system. After detection of humans,distance from vehicle and height of each detected human will be estimated usingcamera calibration information as in Fig.9(a) and then the results were drawnin the original image as illustrated in Fig. 9(b). Fig. 9(c) is showing the radarinformation of which objects around vehicle can be perceived.

7 CONCLUSIONS AND FUTURE WORK

A safety system for commercial vehicle using fish-eye camera is proposed. Thesignificant contribuitons of this paper can be summerized as follows. This isthe first research that employs fish-eye camera to solve the safety issues for thecommercial vehicles. Classifier and ego-motion based detection apporaches arerespectively introduced, implemented and compared as well inbetween basedon experimental results. In order to promote the detection rate on commercialvehicles, a fusion of both methods could be the next target in the scope as wellas the object learning from different viewing angles.

11

Fig. 8. Recall Rate Analysis In Different Range

Acknowledgements

This research has been supported by European Union Seventh Framework Pro-gram (FP7/2007-2013) under grant agreement number 285417 and German Fed-eral Ministry of Education and Research (BMBF).

References

1. Administration, F.M.C.S.: Motor carrier safety progress report (as of september30, 2013). Technical report

2. Teixeira, T., Dublon, G., Savvides, A.: A survey of human-sensing: Methods fordetecting presence, count, location, track, and identity. (2010)

3. Bosch: Fourth-generation long-range radar

4. Chang, S., Mitsumoto, N., Burdick, J.: An algorithm for uwb radar-based humandetection. In: Radar Conference, 2009 IEEE. (May 2009) 1–6

5. Tutusaus, M., Koponen, S.: Evaluation of Automotive Commercial Radar forHuman Detection. (2008)

6. Braun, T., Szentpetery, K., Berns, K.: Detecting and following humans with amobile robot. In: Proceedings of the EOS Conference On Industrial Imaging andMachine Vision. (June 2005)

7. Fuerstenberg, K.: Pedestrian protection using laserscanners. In: Intelligent Trans-portation Systems, 2005. Proceedings. 2005 IEEE. (Sept 2005) 437–442

8. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. InSchmid, C., Soatto, S., Tomasi, C., eds.: International Conference on ComputerVision & Pattern Recognition. Volume 2., INRIA Rhone-Alpes, ZIRST-655, av. del’Europe, Montbonnot-38334 (June 2005) 886–893

9. Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Comput. Vision57(2) (May 2004) 137–154

12

(a)

(b) (c)

Fig. 9. Result Show

10. Scaramuzza, D., Siegwart, R.: Appearance-guided monocular omnidirectional vi-sual odometry for outdoor ground vehicles. Robotics, IEEE Transactions on 24(5)(2008) 1015–1026

11. Gandhi, T., Trivedi, M.: Parametric ego-motion estimation for vehicle surroundanalysis using an omnidirectional camera. Machine Vision and Applications 16(2005) 85–95

12. Scaramuzza, D., Martinelli, A., Siegwart, R.: A flexible technique for accurateomnidirectional camera calibration and structure from motion. In: Proceedingsof the Fourth IEEE International Conference on Computer Vision Systems. ICVS’06, Washington, DC, USA, IEEE Computer Society (2006) 45–

13. Wu, J., Tan, W.C., Rehg, J.M.: Efficient and effective visual codebook generationusing additive kernels. J. Mach. Learn. Res. 999888 (November 2011) 3097–3118

14. Shi, J., Tomasi, C.: Good features to track. In: 9th IEEE Conference on ComputerVision and Pattern Recognition, Springer-Verlag (1994)

Vision Based Person Detection for Safe Navigation of ... · PDF fileVision Based Person...

Documents

Transcript of Vision Based Person Detection for Safe Navigation of ... · PDF fileVision Based Person...