A multi-view recurrent neural network for 3D mesh segmentation

Computers & Graphics 66 (2017) 103–112

Contents lists available at ScienceDirect

Computers & Graphics

journal homepage: www.elsevier.com/locate/cag

Special Issue on SMI 2017

A multi-view recurrent neural network for 3D mesh segmentation

Truc Le, Giang Bui, Ye Duan

∗

Department of Computer Science, University of Missouri, MO 65211, Columbia

a r t i c l e i n f o

Article history:

Received 30 March 2017

Revised 17 May 2017

Accepted 25 May 2017

Available online 13 June 2017

Keywords:

Mesh segmentation

Multi-view

3D deep learning

CNN

RNN

LSTM

a b s t r a c t

This paper introduces a multi-view recurrent neural network (MV-RNN) approach for 3D mesh segmen-

tation. Our architecture combines the convolutional neural networks (CNN) and a two-layer long short

term memory (LSTM) to yield coherent segmentation of 3D shapes. The imaged-based CNN are useful

for effectively generating the edge probability feature map while the LSTM correlates these edge maps

across different views and output a well-defined per-view edge image. Evaluations on the Princeton Seg-

mentation Benchmark dataset show that our framework significantly outperforms other state-of-the-art

methods.

© 2017 Elsevier Ltd. All rights reserved.

1

i

m

m

t

F

s

b

b

i

a

s

t

d

c

f

t

r

t

w

w

S

o

s

p

s

d

v

t

r

[

c

m

e

f

i

i

t

t

s

t

e

[

b

s

m

p

f

h

0

. Introduction

Mesh segmentation is a classical, yet challenging problem

n computer graphics for many decades. Unfortunately, the seg-

entation problem is ill-posed and there is no general objective

easurement that can universally be applied in any case. Judging

he quality of a segmentation largely depends on application.

or instance, in a LiDAR scan of urban environment, a desired

egmentation should distinguish between different instances of

uildings, people, cars, trees, ground, etc. However, in a part-

ased annotation of a 3D model (e.g. human), the requirement

s usually to segment head, torso, left/right arms, left/right legs

nd sometimes more details such as thumb, index finger, and

o on depending on specific task. Consequently, in the scope of

his paper, we aim to tackle the mesh segmentation as a data

riven approach. Given a training dataset of input mesh and the

orresponding desired segmentation, we design a deep learning

ramework to learn the pattern of the segmentation given by the

raining dataset so that it can segment an unseen mesh. As a

esult, we make no geometric or topological assumptions about

he shape, nor exploit any hand-crafted descriptors.

In this paper, we propose a multi-view recurrent neural net-

ork (MV-RNN) deep learning framework to segment 3D model

hich significantly outperforms prior methods on the Princeton

egmentation Benchmark dataset [1] . It is worth mentioning that

ur goal is to partition the 3D model and not to do the semantic

∗ Corresponding author.

E-mail address: [email protected] (Y. Duan).

t

g

s

t

ttp://dx.doi.org/10.1016/j.cag.2017.05.011

097-8493/© 2017 Elsevier Ltd. All rights reserved.

egmentation. In semantic segmentation, the two wings of an air-

lane are assigned a single label wing . On the other hand, in mesh

egmentation, the two wings belong to two different regions and

o not have semantic label. In general, semantic segmentation pro-

ides better understanding of a 3D model. However, mesh segmen-

ation still has its merits such as guiding mesh processing algo-

ithms including skeleton extraction [2,3] , modeling [4] , morphing

5] , shape-based retrieval [6] and texture mapping [7] . Moreover, in

ontrast to semantic segmentation which requires a fixed set of se-

antic labels, many mesh segmentation algorithms could be gen-

ralized to unseen object categories. As a result, instead of identi-

ying surface area of the 3D model within a segment, we predict

ts boundary (or edge). The benefits of doing so are twofold. First,

t is usually more expensive to obtain dense surface annotations

han boundary annotations from humans. Second, we only have

wo semantic labels, i.e. boundary versus non-boundary, which is

impler for the framework to learn than using hundreds of seman-

ic labels (e.g. hand, torso, leg, head, etc.). In fact, detecting 3D

dges could be useful for other tasks such as suggestive contours

8,9] and ridge-valley detection [10] .

Our approach belongs to the multi-view paradigm which has

een shown success recently for many visual recognition tasks

uch as classification and segmentation [11–14] . Typically, in the

ulti-view segmentation, a 3D model is rendered with multi-

le views to generate multi-view images, each of which is fed-

orward to a (shared weights) convolutional neural network to ob-

ain densely labeled images before being mapped back to 3D. In

eneral, a multi-view approach for segmentation must overcome

everal technical obstacles. Firstly, there must be enough views

o minimize occlusions and cover the shape surface. This can be

http://dx.doi.org/10.1016/j.cag.2017.05.011

http://www.ScienceDirect.com

http://www.elsevier.com/locate/cag

http://crossmark.crossref.org/dialog/?doi=10.1016/j.cag.2017.05.011&domain=pdf

mailto:[email protected]

http://dx.doi.org/10.1016/j.cag.2017.05.011

104 T. Le et al. / Computers & Graphics 66 (2017) 103–112

Fig. 1. Given a 3D model, we try to detect boundary between segments by using

multi-view approach. We apply non-maximum suppression [15] to the MV-CNN re-

sults shown on the second row for visualization. The main drawback of MV-CNN

is its inconsistency across multiple views (e.g. the elbow and arm regions). On the

other hand, our MV-RNN could correlate multiple views and generate more coher-

ent results.

o

v

s

(

a

t

e

o

a

h

i

a

f

l

f

v

a

i

w

a

N

w

t

u

s

o

e

a

e

m

e

a

o

u

c

r

l

(

p

c

e

s

s

a

(

t

e

o

t

3

S

n

d

t

c

p

t

h

p

l

s

achieved by generating a large number of views equally distributed

around the object. Secondly, shape parts can be visible from more

than one view, thus, the proposed method should effectively cor-

relate information from multiple views. The main drawback of the

existing multi-view approaches such as the multi-view convolution

neural network (MV-CNN) [11,12] is that different views may not

be correlated and hence a 3D area may correspond to totally dif-

ferent outcomes from different views. Let us take an example of a

standing person rotating counter-clockwise ( Fig. 1 ). When the view

is front facing, the boundary between the torso and the right arm

is a real boundary. At certain time, the right arm starts to be oc-

cluded. Then the boundary between the torso and the right arm is

no longer real boundary, but the MV-CNN cannot distinguish them

because it does not correlate the result over different views.

We propose MV-RNN to overcome this limitation by treating

the sequential multiple views as a temporal sequence, and apply-

ing recurrent neural network to capture the redundancy between

adjacent views. More specifically, in this paper we employ the long

short term memory (LSTM) as the recurrent neural unit. The multi-

view outputs from CNN are correlated through a two-layer LSTM

to obtain consistent fine detail responses for every view. Finally,

the boundary pixels are back-projected onto 3D shape surface fol-

lowed by region growing and Conditional Random Field (CRF) to

obtain the final segmentation. The main contribution of our paper

is the MV-RNN, which is, to the best of our knowledge, the first

network treating multiple views as a temporal sequence and ap-

plying LSTM to correlate adjacent views. Moreover, since the pro-

posed framework is purely data driven, it can be easily adapted or

extended to other tasks in shape modeling such as suggestive con-

tours [8,9] and ridge-valley detection [10] .

In the next section, we briefly discuss existing methods related

to 3D segmentation with emphasis on deep learning. To make the

paper self-contained, we review the recurrent neural network in

Section 3 . Section 4 describes our approach in depth followed by

experimental results in Section 5 . Section 6 concludes our work.

2. Related work

Hand-crafted features : Before the era of deep learning, peo-

ple proposed many approaches (region growing [16,17] , hierarchi-

cal clustering [3,18,19] , spectral clustering [20] , k -means [21] , nor-

malized cut [22] , random walk [23] , heat walk [24] , etc.) based

n local features to segment a 3D model such as planarity of

arious forms, higher degree geometric proxies (cylinders, cones,

pheres, etc.), dihedral angles between triangles [25] , curvatures

Gaussian curvature or mean curvature) [26] , geodesic distances on

mesh, slippage, symmetry, convexity, medial axis, shape diame-

er [27] and motion characteristics [28] . Shamir et al. [28] , Agathos

t al. [29] and Theologou et al. [30] gave a comprehensive overview

f methodologies in 3D segmentation. In general, these approaches

re usually built on some particular property of 3D objects and

ence do not generalize well.

Image-based CNN : CNNs [31–34] are currently the main stream

n many visual recognition problems and have been extensively

pplied to image semantic segmentation [35–39] . For example,

ully convolutional network (FCN) [36] was a breakthrough in deep

earning based image semantic segmentation. In this approach,

ully connected layers in a standard CNN are replaced by con-

olutions with large receptive fields, and segmentation image is

chieved using coarse class score maps obtained by feed forward-

ng an input image. However, the deconvolution part of the net-

ork responsible for upsampling is fixed to bilinear interpolation

nd only the CNN part of the network is fine-tuned. In contrast,

oh et al. [37] proposed the deconvolution network (DeconvNet)

ith unpooling layers followed by convolutions, which increases

he network’s capability to learn more complex deconvolution than

sing just bilinear interpolation.

The holistically nested-edge detection (HED) [40] casts the clas-

ical edge detection as a CNN-based problem. An interesting idea

f this work is that the final edge map is fused from multiple

dge maps obtained at different scales. The multi-scale edge maps

re side outputs of a VGG-16 network [32] and hence the shallow

dge maps give fine detail edges while the deeper ones capture the

ore salient edges. The final result is linearly combined from all

dge maps at multiple scales. Our MV-RNN approach adopts HED

s a sub-module for our CNN part thanks to its high performance

n natural images.

Deep learning for 3D : While deep learning has been very pop-

lar in 2D images for many years, it has just been applied in 3D re-

ently because unlike pixels in 2D images, 3D objects do not have

egular structure. As a result, in the early period, people use deep

earning as a tool to learn high level features from low level cues

usually hand crafted). The unsupervised shape segmentation pro-

osed by Shu et al. [41] starts by over-segmenting the input model,

omputing patch-based local features and then uses stacked auto-

ncoder to learn high level features followed by Graph-Cut based

egmentation. Guo et al. [42] compute local features at different

cales for each triangle and arrange them into a rectangular im-

ge, which is feed forward through a convolutional neural network

CNN) to predict the semantic label for each triangle. Although

hese two frameworks use deep learning techniques (stacked auto-

ncoder, CNN) to learn high level features from local low level

nes, they do not exploit the full potential of deep learning.

A natural extension from 2D image to 3D shape is to discretize

he 3D object into 3D voxel and apply 3D convolutions on it. The

D ShapeNet [43] used this approach for 3D object classification.

u et al. [11] was the first one to apply multi-view convolutional

eural network (MV-CNN) for 3D recognition. The 3D shape is ren-

ered in multiple views, each of which is passed through the iden-

ical image-based CNN. Features obtained from multiple views are

ombined via a view pooling (which is the max pooling) and then

assed through another CNN to predict the final object label. Be-

ween volumetric and multi-view CNN, the later typically gives

igher accuracy [13] . One reason might be due to the higher com-

utation and memory cost of using 3D convolutions which in turn

imits the image resolution [13] . A similar result has also been ob-

erved in other 3D data such as videos [44–46] .

T. Le et al. / Computers & Graphics 66 (2017) 103–112 105

Table 1

The Rand Index scores of segmentation for each category with different methods. Smaller is better.

Object catergories MV-RNN MV-CNN [Shu2016] WcSeg RandCuts ShapeDiam NormCuts CoreExtra RandWalks FitPrim KMeans

Human 0.106 0.196 0.116 0.128 0.131 0.179 0.152 0.225 0.219 0.153 0.163

Cup 0.100 0.100 0.096 0.171 0.219 0.358 0.244 0.307 0.358 0.413 0.459

Glasses 0.066 0.115 0.173 0.173 0.101 0.204 0.141 0.301 0.311 0.235 0.188

Airplane 0.085 0.157 0.150 0.089 0.122 0.092 0.186 0.256 0.248 0.166 0.211

Ant 0.021 0.044 0.001 0.021 0.025 0.022 0.047 0.065 0.068 0.086 0.131

Chair 0.051 0.078 0.040 0.103 0.184 0.111 0.088 0.187 0.156 0.212 0.213

Octopus 0.022 0.060 0.036 0.029 0.063 0.045 0.061 0.051 0.067 0.101 0.101

Table 0.072 0.091 0.040 0.091 0.383 0.184 0.093 0.244 0.131 0.181 0.369

Teddy 0.035 0.055 0.024 0.056 0.045 0.057 0.121 0.114 0.128 0.132 0.182

Hand 0.076 0.122 0.135 0.116 0.090 0.202 0.155 0.155 0.189 0.202 0.154

Plier 0.054 0.143 0.151 0.087 0.109 0.375 0.183 0.093 0.230 0.169 0.263

Fish 0.146 0.253 0.288 0.203 0.297 0.248 0.394 0.273 0.388 0.424 0.413

Bird 0.059 0.119 0.171 0.101 0.107 0.115 0.184 0.124 0.250 0.196 0.190

Armadillo 0.060 0.120 0.073 0.081 0.092 0.090 0.116 0.141 0.115 0.091 0.117

Bust 0.162 0.351 0.275 0.266 0.232 0.298 0.316 0.315 0.298 0.300 0.334

Mech 0.121 0.369 0.073 0.182 0.277 0.238 0.159 0.387 0.211 0.306 0.425

Bearing 0.080 0.104 0.056 0.122 0.124 0.119 0.183 0.398 0.246 0.188 0.280

Vase 0.106 0.216 0.212 0.161 0.133 0.239 0.236 0.226 0.246 0.257 0.387

FourLeg 0.135 0.213 0.140 0.152 0.174 0.161 0.208 0.191 0.218 0.185 0.193

Average 0.082 0.154 0.118 0.123 0.153 0.176 0.172 0.211 0.215 0.210 0.251

Fig. 2. Overview of our MV-RNN approach. Given an input 3D mesh model, we render it with a sequence of ordered viewpoints. Each of view is passed through an identical

(shared weights) CNN to obtain a boundary probability map, which is correlated by a two-layer LSTM followed by a fully connected layer. The consistent edge images from

multiple views are unprojected back to 3D followed by a region growing and CRF for boundary smoothing.

i

v

i

a

m

e

u

r

i

s

i

a

[

m

n

f

e

3

i

r

c

r

w

p

t

b

(

V

s

w

[

s

R

R

t

t

a

c

p

n

Xie et al. [12] used multi-view depth images via extreme learn-

ng machine to generate per-view segmentation and combine them

ia Graph-Cut. This method works pretty fast due to the easy train-

ng of the extreme learning machine but it does not give high

ccuracy. Later, Kalogerakis et al. [47] proposed a more complete

ulti-view framework. They first render the 3D model with differ-

nt views, each of which is processed through a shared CNN before

nprojected to 3D. The label consistency is solved by a conditional

andom field (CRF), which is part of a network and is optimized

n an end-to-end manner. Although this approach uses the CRF to

olve the consistency after unprojection to 3D, the semantic label

mages from multiple views are obtained in a max-pooling manner

nd they are still not correlated.

Recently, Su et al. proposed the PointNet [48] and SyncSpecCNN

49] which consume directly non-regular 3D data (point cloud and

esh, respectively). These networks demonstrate the flexibility of

eural networks in many visual problems. However, in term of per-

ormance, these structures still fall behind MV-CNN approaches (if

quipped large enough number of views) [48] .

. Background on recurrent neural network

In contrast to normal feed-forward neural network which

s a one-shot function, recurrent neural network (RNN) runs

epeatedly through time which simulates human brain processing

apability. An RNN is a composition of identical feed-forward neu-

al networks, one for each moment, or step in time, which we

ill refer to as RNN cells. These cells operate on their own out-

ut, allowing them to be composed. They can also operate on ex-

ernal input and produce external output. Note that this is a much

roader definition of an RNN depending on the choice of RNN cells

e.g. Vanilla RNN, LSTM, etc.). Here is the algebraic description of a

anilla RNN cell.

t = φ(

W x t + Us t−1 + b

)(1)

here φ is the activation function (e.g. sigmoid, tanh, ReLU

31,50] ); Assuming d and h are the state input and output sizes, re-

pectively, s t ∈ R

h is the current state (and current output); s t−1 ∈

h is the prior state; x t ∈ R

d is the current input; W ∈ R

h ×d , U ∈

h ×h and b ∈ R

h are weights and biases.

Although being simple and quite powerful, Vanilla RNN has cer-

ain disadvantages. First, it is very difficult to exploit post informa-

ion if information constantly morphs, which leads to the degener-

tion problem [34] . Second, gradient vanishing and exploding are

ommon in training Vanilla RNN because we train it by the back-

ropagation over time algorithm. If the gradients explode, we can-

ot train our model. If they vanish, it is difficult for us to learn


Fig. 3. Multiple views ordered in a helix-like sequence.

Fig. 4. LSTM learning process (only four views are shown due to space limit). First

row : Input shaded images to a CNN. Second row : Outputs from CNN. Third to

Tenth rows : Edges returned from LSTM during training. Last row : Ground truth

edges.

long-term dependencies, since back-propagation will be too sensi-

tive to recent distractions.

To tackle the drawbacks of Vanilla RNN, the long short-term

memory (LSTM) unit [51] was introduced to ensure the integrity of

information thanks to its written memories. Furthermore, LSTMs

use gates as a mechanism to selectively control and coordinate

writing (i.e. the cell memory is written, read and forgot selec-

tively).

Unlike Vanilla RNN, an LSTM network is well-suited to learn

from experience to classify, process and predict time series when

there are time lags of unknown size and bound between important

events. Consequently, LSTM achieved the best known results in

natural language text compression, unsegmented connected hand-

writing recognition. Recently, researchers have been integrating

LSTMs to computer vision tasks such as image segmentation [52] ,

activity recognition, image captioning, video description [46] , 3D

object reconstruction [14] .

4. Multi-view recurrent neural network (MV-RNN)

Given an input 3D shape, our goal is to segment it into parts

based on the prior knowledge learned from a pre-segmented train-

ing dataset. We design a MV-RNN network to this end. Our net-

work architecture is visualized in Fig. 2 . It takes as input a set of

images from multiple views which are equally distributed over the

3D model; segments these images by generating per-view bound-

ary probability maps; correlates them by a two-layer LSTM fol-

lowed by a fully connected layer and returns the consistent edges

which are back projected to the 3D surface and finally integrated

by a CRF. In the following sections, we elaborate the input to our

network, its layers and the training procedure.

4.1. Input

The input to our whole algorithm is a 3D shape represented

as a polygonal mesh. As a pre-processing step, we normalize and

scale it to fit into the unit sphere. Then our algorithm renders the

object in K different views (we set K = 60 based on our experi-

ments). We first equally partition the unit sphere into K regions


u

t

l

o

w

p

t

w

[

W

[

s

t

w

w

r

p

c

1

p

o

4

t

C

W

H

p

b

a

[

f

i

t

(

i

p

b

n

l

p

c

b

4

m

n

r

t

t

p

r

d

o

u

d

i

o

c

h

w

p

s

a

t

o

m

e

t

a

t

c

a

4

w

1

a

s

o

u

o

t

L

i

H

m

g

e

e

y

t

4

a

i

i

t

m

a

t

t

f

u

t

c

sing [53] . These regions serve as camera locations. More impor-

antly, to make these views learnable for LSTM, we arrange these

ocations in sequence so that adjacent locations are next to each

ther such as in Fig. 3 . To make all views oriented consistently,

e choose the camera up vector pointing to a very far away fixed

oint (e.g. [0, 0, 100]). The camera always looks at the origin since

he model is normalized.

In general, CNN is quite robust to lighting illumination, so

e render shaded, grayscale images using Phong reflection model

54] with light source always behind the camera for every view.

e also experimented with depth images (with HHA encoding

55] ), normal images and/or combined them together but the re-

ult is not better than using the shading images alone. To make

he training faster, we opt to use image resolution of 128 × 128

ithout sacrificing the overall segmentation accuracy of the frame-

ork.

In addition, for each camera setting, we store the 3D vertex cor-

esponding to each pixel. The correspondence is determined by the

roximity of the 3D point unprojected from the 2D pixel and the

losest 3D vertex (the distance between them must be less than

0 −3 , otherwise there is no corresponding 3D vertex with that

ixel). The stored information is used for the back projection later

n.

.2. CNN module

The shaded images produced in the previous step are processed

hrough identical image-based CNN. There are many choices of

NN architecture such as FCN [36] , DeconvNet [37] and HED [40] .

e opt to choose HED because of its edge detection nature. Each

ED module outputs a grayscale image of the same size as the in-

ut shaded image (i.e. 128 × 128), which is the boundary proba-

ility map. Specifically, in our implementation, we employ the HED

rchitecture suggested in [40] , which adopted the VGG-16 network

32] for dense prediction by truncating after the pool5 layer and

using multiple side outputs. Since the HED is trained on RGB color

mages, we need to replicate our shaded grayscale images into

hree channels.

Fig. 4 shows the boundary probability maps in multiple views

only four views are shown here). As we can see the probabil-

ty maps are not well-localized nor consistent. The inconsistency

roblem cannot be solved by optimizing individual view alone, but

y aggregating them together in a more intelligent way. Recurrent

eural networks (RNN) represent a type of neural networks with

oop connections [56] , which allow them to capture long-range de-

endency by gates and memory structures (such as LSTM [51] ). In

onsequence, multiple views can be cast as time series which can

e learned by such LSTM.

.3. LSTM module

As mentioned in Section 1 , the goal of this layer is to correlate

ultiple views and generate consistent boundary maps. An LSTM

etwork is well-suited here which treats view sequence as time se-

ies. First, we unroll the 2D boundary probability maps and ground

ruth boundary maps into vectors of size 128 × 128 = 16384 . A

wo-layer LSTM (with one LSTM stacked over the other) is de-

loyed so that the first LSTM takes the sequence of ordered (un-

olled) boundary probability maps, produces a sequence of hid-

en states for the second LSTM to eventually output the sequence

f coherent boundary maps. We use the same number of hidden

nits (1024) for both peephole LSTMs [57] with the following up-

ates.

t = sigmoid

(W i x t + U i c t−1 + b i

)(2)

b

f t = sigmoid

(W f x t + U f c t−1 + b f

)(3)

t = sigmoid

(W o x t + U o c t−1 + b o

)(4)

t = f t ◦ c t−1 + i t ◦ tanh

(W c x t + b c

)(5)

t = o t ◦ c t (6)

here x t ∈ R

d is the current input; h t ∈ R

h is the current out-

ut; c t (and c t−1 ) ∈ R

h are the current (and prior) memory

tate, W i , W f , W o , W c ∈ R

h ×d , U i , U f , U o , U c ∈ R

h ×h , b i , b f , b o , b c ∈ R

h

re weights and biases and ‘ ◦’ denotes element-wise multiplica-

ion. In our case, d = 128 × 128 = 16384 and h = 1024 . The output

f the second LSTM is passed through a fully connected layer to

ap back to d -dimension edge image.

Fig. 4 illustrates how LSTM can help correct and correlate the

dge probability maps produced from the MV-CNN. For example,

he boundaries between the torso and two legs are quite different

mong four views, which may result in inconsistent edge informa-

ion when unprojecting them to 3D mesh. However, as the LSTM

onsumes the whole view sequence, the edges at convergence are

ll consistent.

.4. Training

We train our network in a two-stage approach. In the first stage,

e train the HED module. We randomly rotate each 3D model in

6 different ways. The network takes as input a pair of two im-

ges, shaded image and ground truth boundary map. We use the

igmoid cross-entropy loss for all five side outputs and the fused

utput. The network is initialized from VGG-16 weights [32] . We

se Adam optimizer [58] with fixed learning rate 10 −7 , batch size

f 16 and train for 10 0,0 0 0 iterations. The first stage training takes

hree days on an NVIDIA Titan X.

After the HED module is trained, it is fixed for training the

STM module in the second stage. The two-layer LSTM takes as

nput a pair of sequences of boundary probability maps from the

ED and ground truth boundary maps. We also use Adam opti-

izer [58] with initial learning rate 0.01 (as this optimization al-

orithm is able to compute adaptive learning rates for each param-

ter), batch size of 1 (due to memory limit) and train for 70 0 0

pochs. Each view sequence is processed bidirectionally, which

ields two sequences per shape. The second stage training takes

hree days on an NVIDIA Titan X.

.5. Back projection to 3D and post-processing

The consistent boundary maps produced from LSTM network

re back projected to 3D surface using the stored pixel-to-vertex

nformation (see Section 4.1 ). It is possible that many pixels (typ-

cally from different views) map to the same vertex, so we take

he maximum response as the final value. For each edge of the

esh model, we assign the boundary probability which is defined

s the average of the boundary probabilities of the two vertices

hat it connects. Finally a binary boundary edge map is created by

hresholding (we set the threshold as 0.5). These boundary edges

unction as the borders of the regions to be segmented. Thus, we

se a simple region growing to find the initial segmentation with

he boundary edges as blockers. A region with big enough area is

onsidered as a segment. The polygons near the boundaries may

e unlabeled due to projection error. Denote h v as the initial label


Fig. 5. Representative segmentation results produced by our MV-RNN on PSB dataset.

Fig. 6. Performance plots of different segmentation algorithms with respect to four evaluation metrics. Lower value is better.


Fig. 7. Comparison of segmentation algorithms.

f

r

a

o

E

E

E

w

a

p

c

5

s

w

d

t

o

i

m

p

s

o

t

b

h

s

b

s

t

c

n

t

o

e

d

a

d

i

t

t

n

o

i

w

or polygon v , where h v = 0 if v has no label. We expect that cor-

ect labels will be propagated to them via a CRF. Let V be the set of

ll polygons in a 3D shape, a CRF f with unary and pairwise terms

perating on the surface representation is defined as follows.

( f ) =

∑

v ∈ V E unary ( f v ) +

∑

(u, v ) ∈ V 2 E pairwise ( f u , f v ) (7a)

unary ( f v = l) =

{

0 , ∀ l if h v = 0

0 if h v = l ∞ if otherwise

(7b)

pairwise ( f u = l u , f v = l v ) =

{e −d 2 (u, v ) if l u � = l v

e −( 1 −d(u, v ) ) 2 if l u = l v (7c)

here d(u, v ) is the geodesic distance [59,60] between polygon u

nd polygon v . All distances are normalized to [0, 1].

The unary term tells that we only want to correct unlabeled

olygons while the pairwise terms favor the same label for adja-

ent polygons. We use mean-field approximation [61] to solve (7a) .

. Evaluation

In this section, we present experimental validations and analy-

es of our approach. We test the segmentation algorithm on the

ell-known Princeton Segmentation Benchmark dataset [1] . This

ataset has been intensively used to evaluate 3D shape segmenta-

ion and 3D shape retrieval algorithms. The dataset has 19 different

bject categories with 20 objects for each category which results

n 380 models in total. For each category, we randomly select 16

odels for training and 4 models for testing. Since there are multi-

le human generated segmentations for each model, we manually

elect one segmentation which is the most consistent among the

bject category. The ground truth edge images can be easily ob-

ained by rendering the edges between different segments overlaid

y the 3D shape with the same color as background. To further en-

ance the quality of the ground truth images, we use polygon off-

et in OpenGL . The ground truth edge images are used in training

oth the MV-CNN and the LSTM. Fig. 5 shows some representative

egmentations of our MV-RNN approach on this dataset.

To evaluate our segmentation method, we adopt four metrics

hat are defined by Chen et al. [1] , including Rand Index, Cut Dis-

repancy, Hamming Distance and Consistency Error. Rand Index,

amed after William M. Rand, measures the similarity between

wo segmentations of the same shape. From a mathematical point

f view, Rand Index is related to the accuracy, but is applicable

ven when class labels are not used. In this paper, we use Rand In-

ex Error, which equals to one minus the Rand Index. Cut Discrep-

ncy is a boundary-based method evaluating the distance between

ifferent cuts. It sums the distances from points along the cuts

n the computed segmentation to the closest cuts in the ground

ruth segmentation, and vice-versa. Hamming Distance, named af-

er Richard Hamming, is a region-based method and measures the

umber of substitutions required to change one region into the

ther. Hamming Distance is directional, hence it includes miss-

ng rate (Rm) and false alarm (Rf) distances. Consistency Errors,

hether the global version (GCE) or local version (LCE), are used to


Fig. 8. More comparisons of segmentation algorithms.

T

s

k

t

M

e

a

p

s

a

r

t

a

t

l

i

m

v

i

t

c

W

t

compute the hierarchical differences and similarities between seg-

mentations, which are based on the theory that humans percep-

tual organization imposes a hierarchical tree structure on objects.

Regarding all four metrics, smaller value indicates better result.

Comparison : We compare our method with the following seg-

mentation algorithms:

• MV-CNN: we apply non-maximum suppression [15] on the

boundary probability maps returned from the multi-view CNN

(HED in this case) and unproject them back to 3D (without

LSTM) followed by CRF. This serves as a baseline for multi-view

paradigm.

• [Shu2016] [41] : an unsupervised 3D shape segmentation via

stacked auto-encoders.

• WcSeg [62] ; approximate convexity analysis.

• RandCuts [22] : randomized cuts.

• ShapeDiam [27] : shape diameter function.

• NormCuts [22] : normalized cuts.

• CoreExtra [63] : core extraction.

• RandWalks [23] : random walks.

• FitPrim [18] : fitting primitives.

• KMeans [64] : k -means.

Figs. 7 and 8 provide a side-by-side comparison of segmen-

tations obtained from various algorithms. Although there are

large shape variations, the absolute majority of our segmenta-

tion results are desirable and consistent with our perception.

he baseline MV-CNN indeed yields better segmentations than

ome of the methods based on hand-crafted features such as

-means, fitting primitives, random walks. Due to the inconsis-

ency of the boundary probability maps across multiple views, the

V-CNN is still not as good as the shape diameter function. How-

ver, the added LSTM has a significant contribution to the over-

ll robustness, which vastly improves the nature of multi-view

aradigm.

Numerical comparison : The Rand Index score statistics of our

egmentation on the dataset, as well as those of other methods,

re detailed in Table 1 , from which we can see that our algo-

ithm obtains an average Rand Index of 0.084 that outperforms

he related algorithms. In addition to Rand Index, our MV-RNN

lso shines out of other methods with respect to other evalua-

ion metrics (see Fig. 6 and Table 2 ). Comparing with the base-

ine MV-CNN, the LSTM in our framework indeed has a significant

mprovement because it correlates the outputs from CNN across

ultiple views.

Different number of views : We also experiment with various

alues of K . According to Fig. 9 , using too few number of views

s not good due to occlusion. As using more views equally dis-

ributed around the object, the object’s surface area is more fully

overed, hence we get higher accuracy (or lower Rand Index score).

e choose K = 60 as a reasonable trade-off between accuracy and

ime/memory consumption.


Table 2

Average cut discrepancy, hamming distance, consistency error scores of segmentation for each category with different methods. Smaller is better.

MV-RNN MV-CNN [Shu2016] WcSeg RandCuts ShapeDiam NormCuts CoreExtra RandWalks FitPrim KMeans

Cut Discrepancy 0.144 0.220 0.212 0.211 0.263 0.275 0.282 0.375 0.367 0.341 0.409

Hamming 0.075 0.129 0.124 0.116 0.136 0.166 0.177 0.169 0.203 0.239 0.277

Hamming-Rm 0.061 0.104 0.130 0.118 0.152 0.187 0.195 0.126 0.209 0.293 0.345

Hamming-Rf 0.089 0.153 0.118 0.114 0.119 0.146 0.158 0.213 0.198 0.186 0.209

GCE 0.060 0.107 0.099 0.098 0.126 0.130 0.159 0.135 0.179 0.217 0.251

LCE 0.041 0.062 0.070 0.065 0.073 0.082 0.102 0.086 0.104 0.142 0.168

Fig. 9. The Rand Index with respect to the number of views. We choose K = 60 as

a reasonable trade-off between accuracy and time/memory usage.

Fig. 10. Limitation of our approach. The area under the torso is occluded and hence

the left and right thighs are not separated although our MV-RNN can detect 2D

edges correctly in all views.

5

h

t

t

c

o

e

v

i

p

6

t

m

fi

a

r

M

n

e

o

t

t

p

m

d

w

t

w

t

s

A

m

v

[

l

[

r

R

.1. Limitation

Because our approach belongs to the multi-view paradigm, it

as a common occlusion issue. For example, the left and right

highs of the man in Fig. 10 are not separated due to occlusion (i.e.

he area under torso is not revealed from any of K = 60 views). In-

rease the number of views could reduce the occlusions at the cost

f more computations. Since we can easily computed occluded ar-

as given the current set of views, we plan to use adaptive best

iew prediction to focus the camera on these areas, which is sim-

lar to the next-best-view prediction in 3D attention model pro-

osed by Xu et al. [65] .

. Conclusion

We have presented our novel MV-RNN for 3D shape segmen-

ation which combines the MV-CNN and LSTM to enhance the

ulti-view paradigm. To the best of our knowledge, we are the

rst group that treats multiple views as a temporal sequence and

pplies RNN to predict the edge images by aggregating the cor-

esponding edge probability maps obtained by feed-forwarding a

V-CNN. Our MV-RNN detects 3D edges in an end-to-end man-

er and the segmentation is obtained as a post-processing. The 3D

dges can be either semantic-based (e.g. semantic segmentation)

r geometric-based (e.g. CAD model segmentation, suggestive con-

our, ridge and valley). According to our experimental results on

he Princeton Segmentation Benchmark dataset, our MV-RNN com-

ares favorably with other state-of-the-art methods on mesh seg-

entation.

In the future, we would like to conduct more experiments on

ifferent datasets such as those in [49,66] . Additionally, our frame-

ork right now work on meshes only. In the future we would like

o extend it to handle point clouds as well. The proposed frame-

ork is purely data-driven, thus in the future we would like to ex-

end our method to other interesting problems in shape modeling

uch as suggestive contours [8,9] and ridge-valley detection [10] .

cknowledgment

We would like to acknowledge the authors of Princeton Seg-

entation Benchmark [1] who made the dataset public and pro-

ided evaluation toolbox. We also appreciate the authors of HED

40] for their edge detection network. Last but not least, we would

ike to thanks all the authors of other segmentation algorithms

18,22,23,27,41,62–64] for their contribution of the segmentation

esults on the Princeton Segmentation Benchmark dataset.

eferences

[1] Chen X , Golovinskiy A , Funkhouser T . A benchmark for 3d mesh segmentation.ACM Trans Graph 2009;28(3) . 73:1–73:12.

[2] Biasotti S , Marini S , Mortara M , Patane G . An overview on properties and effi-cacy of topological skeletons in shape modeling. In: Shape modeling interna-

tional; 2003. p. 245–54 .

[3] Katz S , Tal A . Hierarchical mesh decomposition using fuzzy clustering and cuts.ACM Trans Graph 2003;22(3):954–61 .

[4] Funkhouser T , Kazhdan M , Shilane P , Min P , Kiefer W , Tal A , et al. Modelingby example. ACM Trans Graph 2004;23(3):652–63 .

[5] Zockler M , Stalling D , Hege H-C . Fast and intuitive generation of geometricshape transitions. Vis Comput 20 0 0;16(5):241–53 .

[6] Zuckerberger E , Tal A , Shlafman S . Polyhedral surface decomposition with ap-

plications. Comput Graph 2002;26(5):733–43 . [7] Levy B , Petitjean S , Ray N , Maillot J . Least squares conformal maps for auto-

matic texture atlas generation. ACM Trans Graph 2002;21(3):362–71 . [8] DeCarlo D , Finkelstein A , Rusinkiewicz S , Santella A . Suggestive contours for

conveying shape. ACM Trans Graph 2003;22(3):848–55 . [9] Burns M , Klawe J , Rusinkiewicz S , Finkelstein A , DeCarlo D . Line drawings from

volume data. ACM Trans Graph 2005;24(3):512–18 . [10] Ohtake Y , Belyaev A , Seidel H-P . Ridge-valley lines on meshes via implicit sur-

face fitting. ACM Trans Graph 2004;23(3):609–12 .

[11] Su H , Maji S , Kalogerakis E , Learned-Miller EG . Multi-view convolutional neuralnetworks for 3d shape recognition. In: IEEE international conference on com-

puter vision; 2015 . [12] Xie Z , Xu K , Shan W , Liu L , Xiong Y , Huang H . Projective feature learning for 3d

shapes with multi-view depth images. Comput Graph Forum 2015;34(7):1–11 .

http://refhub.elsevier.com/S0097-8493(17)30060-2/sbref0001






























































[

[13] Qi CR , Su H , Nießner M , Dai A , Yan M , Guibas L . Volumetric and multi-viewcnns for object classification on 3d data. In: IEEE international conference on

computer vision and pattern recognition . [14] Choy CB , Xu D , Gwak J , Chen K , Savarese S . 3d-r2n2: a unified approach for

single and multi-view 3d object reconstruction. In: IEEE european conferenceon computer vision; 2016. p. 628–44. ISBN 978-3-319-46484-8 .

[15] Dollar P , Zitnick CL . Structured forests for fast edge detection. In: IEEE interna-tional conference on computer vision; 2013 .

[16] Vieira M , Shimada K . Surface mesh segmentation and smooth surface extrac-

tion through region growing. Comput Aided Geometr Des 2005;22:771–92 . [17] Jagannathan A , Miller E . Three-dimensional surface mesh segmentation using

curvedness-based region growing approach. IEEE Trans Pattern Anal Mach In-tell 2007;29(12):2195–204 .

[18] Attene M , Falcidieno B , Spagnuolo M . Hierarchical mesh segmentation basedon fitting primitives. Vis Comput 2006;22:181–93 .

[19] Garland M , Willmott A , Heckbert PS . Hierarchical face clustering on polygo-

nal surfaces. In: Processing of the symposium on interactive 3D graphics. NewYork, NY, USA: ACM; 2001. p. 49–58. ISBN 1-58113-292-1 . I3D ’01.

[20] Shi J , Malik J . Normalized cuts and image segmentation. IEEE Trans PatternAnal Mach Intell 20 0 0;22(8):888–905 .

[21] Yamauchi H , Lee S , Lee Y , Ohtake Y , Belyaev AG , Seidel H-P . Feature sensi-tive mesh segmentation with mean shift. In: Processing of the international

conference on shape modeling and applications. IEEE; 2005. p. 238–45. ISBN

0-7695-2379-X . [22] Golovinskiy A , Funkhouser T . Randomized cuts for 3d mesh analysis. ACM

Trans Graph 2008;27(5) . 145:1–145:12. [23] Lai Y-K , Hu S-M , Martin RR , Rosin PL . Fast mesh segmentation using random

walks. In: Processing of the ACM symposium on solid and physical modeling.New York, NY, USA: ACM; 2008. p. 183–91. ISBN 978-1-60558-106-4 . SPM ’08.

[24] Benjamin W , Polk AW , Vishwanathan S , Ramani K . Heat walk: robust salient

segmentation of non-rigid shapes. Comput Graph Forum 2011;30(7):2097–106 .[25] Xiao D , Lin H , Xian C , Gao S . Cad mesh model segmentation by clustering.

Comput Graph 2011;35(3):685–91 . Shape Modeling International (SMI) Con-ference 2011.

[26] Lavoue G , Dupont F , Baskurt A . A new cad mesh segmentation method basedon curvature tensor analysis. Comput Aided Des 2005;37(10):975–87 .

[27] Shapira L , Shamir A , Cohen-Or D . Consistent mesh partitioning and skeletoni-

sation using the shape diameter function. Vis Comput 2008;24(4):249–59 . [28] Shamir A . A survey on mesh segmentation techniques. Comput Graph Forum

2008;27(6):1539–56 . [29] Agathos A , Pratikakis I , Perantonis S , Sapidis N , Azariadis P . 3d mesh seg-

mentation methodologies for cad applications. Comput Aided Des Applic2007;4(6):827–41 .

[30] Theologou P , Pratikakis I , Theoharis T . A comprehensive overview of method-

ologies and performance evaluation frameworks in 3d mesh segmentation.Comput Vis Image Underst 2015;135:49–82 .

[31] Krizhevsky A , Sutskever I , Hinton GE . Imagenet classification with deep convo-lutional neural networks. In: Advances in neural information processing sys-

tems; 2012. p. 1097–105 . [32] Simonyan K , Zisserman A . Very deep convolutional networks for large-scale

image recognition. In: International conference on learning representations . [33] Szegedy C , Liu W , Jia Y , Sermanet P , Reed S , Anguelov D , et al. Going deeper

with convolutions. In: IEEE conference on computer vision and pattern recog-

nition; 2015. p. 1–9 . [34] He K , Zhang X , Ren S , Sun J . Deep residual learning for image recognition. In:

IEEE conference on computer vision and pattern recognition; 2016. p. 770–8 . [35] Farabet C , Couprie C , Najman L , LeCun Y . Learning hierarchical features for

scene labeling. IEEE Trans Pattern Anal Mach Intell 2013;35(8):1915–29 . [36] Long J , Shelhamer E , Darrell T . Fully convolutional networks for semantic seg-

mentation. In: IEEE international conference on pattern recognition; 2015 .

[37] Noh H , Hong S , Han B . Learning deconvolution network for semantic segmen-tation. In: IEEE international conference on computer vision; 2015 .

[38] Sharma A , Tuzel O , Jacobs DW . Deep hierarchical parsing for semantic segmen-tation. In: 2015 IEEE conference on computer vision and pattern recognition

(CVPR); 2015. p. 530–8 . [39] Hong S, Oh J, Lee H, Han B. Learning transferrable knowledge for semantic

segmentation with deep convolutional neural network. Comput Res Repos -

arXiv 2015 . Url: https://arxiv.org/abs/1512.07928 .

[40] Xie S , Tu Z . Holistically-nested edge detection. In: IEEE international confer-ence on computer vision); 2015. p. 1395–403 .

[41] Shu Z , Qi C , Xin S , Hu C , Wang L , Zhang Y , et al. Unsupervised 3d shape seg-mentation and co-segmentation via deep learning. Comput Aided Geometr Des

2016;43:39–52 . Geometric Modeling and Processing 2016. [42] Guo K , Zou D , Chen X . 3d mesh labeling via deep convolutional neural net-

works. ACM Trans Graph 2015;35(1) . 3:1–3:12. [43] Wu Z , Song S , Khosla A , Yu F , Zhang L , Tang X , et al. 3d shapenets:

a deep representation for volumetric shapes. In: IEEE international confer-

ence on computer vision and pattern recognition; 2015. p. 1912–20. ISBN978-1-4673-6964-0 .

44] Karpathy A , Toderici G , Shetty S , Leung T , Sukthankar R , Fei-Fei L . Large-scalevideo classification with convolutional neural networks. In: 2014 IEEE confer-

ence on computer vision and pattern recognition; 2014. p. 1725–32 . [45] Simonyan K , Zisserman A . Two-stream convolutional networks for action

recognition in videos. In: Advances in neural information processing systems;

2014b. p. 568–76 . [46] Donahue J , Hendricks LA , Rohrbach M , Venugopalan S , Guadarrama S ,

Saenko K , et al. Long-term recurrent convolutional networks for visual recog-nition and description. IEEE Trans Pattern Anal Mach Intell 2017;39(4):677–91 .

[47] Kalogerakis E, Averkiou M, Maji S, Chaudhuri S. 3d shape segmentation withprojective convolutional networks. Comput Res Repos - arXiv 2016 . Url: http:

//arxiv.org/abs/1612.02808 .

[48] Qi CR, Su H, Mo K, Guibas LJ. Pointnet: Deep learning on point sets for 3dclassification and segmentation. Computing Research Repository - arXiv 2016b .

Url: https://arxiv.org/abs/1612.00593 . [49] Yi L , Kim VG , Ceylan D , Shen I-C , Yan M , Su H , et al. A scalable active

framework for region annotation in 3d shape collections. ACM Trans Graph2016;35(6) . 210:1–210:12.

[50] Nair V , Hinton GE . Rectified linear units improve restricted boltzmann ma-

chines. In: IEEE international conference on machine learning; 2010. p. 807–14 .[51] Hochreiter S , Schmidhuber J . Long short-term memory. Neural Comput

1997;9(8):1735–80 . [52] Li Z , Gan Y , Liang X , Yu Y , Cheng H , Lin L . Lstm-cf: unifying context modeling

and fusion with lstms for rgb-d scene labeling. In: IEEE european conferenceon computer vision; 2016. p. 541–57. ISBN 978-3-319-46475-6 .

[53] Leopardi P . A partition of the unit sphere into regions of equal area and small

diameter. Electron Trans Numer Anal 2006;25 . [54] Phong BT . Illumination for computer generated pictures. Commun ACM

1975;18(6):311–17 . [55] Gupta S , Girshick R , Arbeláez P , Malik J . Learning rich features from rgb-d im-

ages for object detection and segmentation. In: IEEE european conference oncomputer vision; 2014. p. 345–60. ISBN 978-3-319-10584-0 .

[56] Schmidhuber J . A local learning algorithm for dynamic feedforward and recur-

rent networks. Connect Sci 1989;1:403–12 . [57] Gers FA , Schmidhuber E . Lstm recurrent networks learn simple context-free

and context-sensitive languages. IEEE Trans Neural Netw 2001;12(6):1333–40 . [58] Kingma DP , Ba J . Adam: a method for stochastic optimization. In: IEEE inter-

national conference for learning representations; 2015 . [59] Hilaga M , Shinagawa Y , Kohmura T , Kunii TL . Topology matching for fully auto-

matic similarity estimation of 3d shapes. ACM Trans Graph 2001:203–12 . ISBN1-58113-374-X.

[60] Zhang E , Mischaikow K , Turk G . Feature-based surface parameterization and

texture mapping. ACM Trans Graph 2005;24(1):1–27 . [61] Krahenbuhl P , Koltun V . Efficient inference in fully connected crfs with gaus-

sian edge potentials. Neural Inf Process Syst 2011:109–17 . [62] Kaick OV , Fish N , Kleiman Y , Asafi S , Cohen-Or D . Shape segmentation by ap-

proximate convexity analysis. ACM Trans Graph 2014;34(1) . 4:1–4:11. [63] Katz S , Leifman G , Tal A . Mesh segmentation using feature point and core ex-

traction. Visual Comput 2005;21(8):649–58 .

[64] Shlafman S , Tal A , Katz S . Metamorphosis of polyhedral surfaces using decom-position. Comput Graph Forum 2002;21(3):219–28 .

[65] Xu K , Shi Y , Zheng L , Zhang J , Liu M , Huang H , et al. 3d attention–driven depth acquisition for object identification. ACM Trans Graph 2016;35(6) .

238:1–238:14. [66] Wang Y , Asafi S , van Kaick O , Zhang H , Cohen-Or D , Chen B . Active co-analysis

of a set of shapes. ACM Trans Graph 2012;31(6) . 165:1–165:10.
























































































































https://arxiv.org/abs/1512.07928












































http://arxiv.org/abs/1612.02808

https://arxiv.org/abs/1612.00593





















































































A multi-view recurrent neural network for 3D mesh segmentation

Documents

Transcript of A multi-view recurrent neural network for 3D mesh segmentation