Fast Object Recognition from 3D Depth Data with Extreme Learning Machine

1

Click here to load reader

Transcript of Fast Object Recognition from 3D Depth Data with Extreme Learning Machine

Page 1: Fast Object Recognition from 3D Depth Data with Extreme Learning Machine

Fast Object Recognition from 3D Depth Datawith Extreme Learning Machine

Somar Boubou, Tatsuo Narikiyo and Michihiro Kawanishi

Control System Laboratory, Toyota Technological Institute, Nagoya, Japan

Introduction

Object recognition from RGB-D sensors has recently emerged as a renowned andchallenging research topic. The current systems often require large amounts oftime to train the models and to classify new data. We proposed an effective andfast object recognition approach from 3D data acquired from depth sensors suchas Structure or Kinect sensors.Our contribution in this work is to present a novel fast and effective approachfor real-time object recognition from 3D depth data:

I First, we extract simple but effective frame-level features, which we name asdifferential frames, from the raw depth data.

I Second, we build a recognition system based on Extreme Learning Machineclassifier with a Local Receptive Field (ELM-LRF).

Feature extraction /Differential frames/

Our presented differential frames are designed to capturethe geometric characteristics of the object 3D surface inorder to facilitate object recognition from depth data.

Let’s consider an input depth frame array DU×V captured by a depthsensor such the one shown inequation 1 where depthinformation is given as afunction of pixel coordinates.

D =

d(1, 1) . . . d(U, 1)... d(u, v) ...

d(1,V ) . . . d(U,V )

(1)

Differential values on u and v axes are defined in terms of α and β cosinefunctions as follows:

dαu,v = 1 + cos(α) = 1 +nu−1,v.nu+1,v

|nu−1,v||nu+1,v|∈ [0, 2]

dβu,v = 1 + cos(β) = 1 +nu,v−1.nu,v+1

|nu,v−1||nu,v+1|∈ [0, 2].

(2)

Finally, dα,βu,v = dαu,v + dβu,v ∈ [0, 4]. Therefore, differential frame isdefined as:

Dα,β =

dα,β1,1 . . . d

α,βU,1

... dα,βu,v...

dα,β1,V . . . dα,βU,V

(3)

Extreme Learning Machine classifier

Extreme learning machine (ELM) was initially presented for single-hiddenlayer feed-forward neural networks with additive neurons:

I No need for an iterative tuning of the hidden neurons parameters.I Extremely fast learning speed.I Achieve good generalization performance.

Output function of ELM: f (x) =L∑

i=1

βihi(x) = h(x)β

I β = [β1, ..., βL]T : the vector of the output weights.I L: number of nodes in the hidden layer.I hi : is a nonlinear piecewise continuous function called as the activation

function of the neuron i.I h(x) = [h1(x), ..., hL(x)]: the output vector of the hidden layer with

respect to the d -dimensional input data x = [x1, ..., xd].

Now, given a data set with N training samples and m classes, the hiddenlayer output matrix H is given as:

H =

h(x1)...

h(xN)

=

h1(x1) . . . hL(x1)... . . . ...

h1(xN) . . . hL(xN)

(4)

An efficient closed-form solution based on the orthogonal projectionmethod:

β =

{HT( I

C+ HHT)−1T if N ≤ L

( IC

+ HTH)−1HTT if N > L, (5)

Where TN×m is the training data target matrix.1/C is a positive value added to the diagonal of HTH or HHT in thecalculation of the output weight β to find a resultant solution is stabler andtends to have better generalization performance

Local Receptive Field

I Generate an initial weight matrixAinit ∈ Rr2×K from the standardstandard Gaussian distribution.

I Orthogonalize it using singular valuedecomposition (SVD) method.

I Columns of resulted orthogonal matrix A,ak’s, are the orthonormal basis of Ainit.

Figure: A chart for ELM-LRFimplementation

The convolution node (i , j) in the k-th feature map serves here as thenode activation function and it is given as:

ci ,j ,k =r∑

m=1

r∑n=1

am.n,k(xi+m−1,j+n−1)

i = 1, ..., (U − r + 1), j = 1, ..., (V − r + 1)

(6)

A square/square-root pooling function is used in this work and pooling mapk is calculated as:

hp,q,k =

p+e∑i=(p−e)

p+e∑j=(p−e)

c2i ,j ,k

if (i , j) is out of bound : ci ,j ,k = 0(7)

For one input sample x ∈ Rd=U.V, we simply concatenating the values ofall calculated hp,q,k into a raw vector h(x) ∈ RL=K .(U−r+1).(V−r+1).Now, by putting the rows of N input training samples together, we obtainmatrix H ∈ RN×L.

Results: Self-collected dataset

Depth Features Accuracy Training time Testing time(10 folds CV) (sec) (sec)

HONV (LinSVM) 69.13% 0.56 0.03DHONV (LinSVM) 79.75% 0.42 0.03

Raw depth frames (ELM-LRF) 96.25% 0.306 0.003Presented approach 98.75% 0.234 0.003

I The dataset contains a total 800 sampledepth images.

I Input frame size 320× 240 pixels.I Random viewpoints located on a horizontal

plane, around the targeted object.I Full unsegmented depth images.

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure: Objects of the laboratorydataset : (a) electric-kettle,(b) water-pitcher, (c) pill-bottle,(d) tape-dispenser, (e) tennis-ball,(f) robo-car, (g) haptic-device, and(h) paper-box.

Results: RGB-D benchmark dataset

I RGB-D object dataset consists of 51 categories which has 300 objects anaround 50,000 sample depth images in total.

I Data was recorded with a camera mounted at three different heightsapproximately 30◦, 45◦ and 60◦ above the horizon.

I Reference pose is chosen for each category (e.g., the handles of the coffeemugs are always on the right 0◦).

I Irregular frame size.I Pre-segmented objects.I Three experimental schemes

are used:. Category recognition.. Alternating Contiguous

Frames (ACF).. Leave-Sequence-Out

(LSO).

Depth Features Category Instance (ACF) Instance (LSO)

Spin-images 53.1% 32.4% 32.3%RF 66.8% 52.7% 45.5%

kSVM 64.7% 51.2% 46.2%IDL 70.2% 54.8% N/A

HONV (LinSVM) 38% 47.69% 32.6%DHONV (LinSVM) 58.8% 69.98% 44.1%

CNN-RNN 78.9% N/A N/AKernel depth 78.8% N/A 54.3%

Unsupervised Feature Learning 81.2% N/A 51.7%Shape HKDES (depth) 65.8% N/A 36.7%

Gradient HKDES (depth) 70.8% N/A 39.3%Fus-CNN (HHA) 83.0% N/A N/A

Fus-CNN (jet) 83.8% N/A N/ARaw depth frames (ELM-LRF) 52.07% 65.79% 51.34%

Presented approach 57.88% 70.08% 60.00%

Conclusions

I We tested our presented approach on two datasets:. A self-collected dataset. A public RGB-D object dataset.

I Our approach proved to be more robust to the changes of the viewpoint.I Lower computational time and better classification achievement make the

presented approach readily applicable for online recognition applications.

http://www.toyota-ti.ac.jp emails: {boubou,n-tatsuo,kawa}@toyota-ti.ac.jp