View synthesis from multi-view rgb data using multi-layered … · 2019. 12. 3. · View synthesis...

View synthesis from multi-view rgb data using multi-layered representation and volumetric estimation

Zhaoqi SU1, Tiansong ZHOU2, Kun LI2, David BRADY3, Yebin LIU1,

1 TNList and the Department of Automation, Tsinghua University, Main Building, Beijing 100086, China.

2 Tianjin University, Tianjin 300072, China.

3 Duke University, Kunshan, China.

Corresponding authors: Yebin Liu E-mail: [email protected], [email protected]

Abstract Aiming at free-view exploration of complicated scene, this paper presents a method for interpolate views among multi RGB cameras. In this paper, we combine the idea of both cost volumes which represent the 3D information, and 2D semantic segmentation of the scene, to accomplish view synthesis of complicated scenes. We use the idea of cost volume to estimate the depth and confidence map of the scene, as well as using the multi-layer representation and resolution of the data to optimize the view synthesis of the main object. With different treatment of different layers of the volume, we can handle complicated scene containing multiple person and plentiful occlusion. We also propose the view-interpolation → multi-view reconstruction → view interpolation pipeline to iteratively optimize the result. We test our method on varying data of multi-view scene and generate decent results.

Keywords view interpolation, cost volume, multi-layer processing, MVE, iteratively optimization.

1 Introduction With the rapid development of AR/VR devices, there is demand for an immerse

experience of a real-world scene like boxing games or a virtual tour. Most AR/VR materials for such scene is produced with 3D reconstruction and texturing of the whole scene. However, with the massive data representing the 3D model and the unavoidable cavies on the border of the model, such method is impractical in representing dynamic scenes. On the other way, some works focus on directly interpolate views among the known views [1], [16]. These methods can interpolate novel views among cameras and directly generate images of certain views. However, these methods either cannot generate subtle background scene [1], or might not perform well in complicated scene with massive occlusion and multiple person [16].

Also, with the developing technique of image segmentation such as[9] and [18], the input image can be segmented into different layers or labels, according to semantic information. The main objects (most time the main person) in the pictures can be accurately segmented as the foreground object. Thus we use the segmentation from such

methods to generate multi-layer representation of the data, and use different treatment in different layers so that the foreground and background region can be separately handled by the algorithm. We first segment the input RGB multi-view data using [6] and use the input image as the RGB information to optimize the segmentation. Next we define different resolution volume space for different layers, and estimate the depth and the confidence map using the idea of cost volume as in [16] with the help of segmentation prior. Then four nearby views are used to generate novel view image, using the confidence map estimated in the previous step. We also propose the view-interpolation → multi-view reconstruction → view interpolation pipeline to iteratively optimize our view interpolation result as well as generating better 3D models comparing to 3D reconstruction using only the input RGB data.

The main contribution of our work are as followed: • We combine the idea of cost volumes from [16] and semantic segmentation

to deal with view interpolation of complicated scene with much occlusion and many objects, which has not been done before.

• We propose the view-interpolation → multi-view reconstruction → view interpolation pipeline to iteratively optimize both the 3D reconstruction of the scene and the view synthesis results.

In the following Sections, Section 2 introduces the related work of 3D reconstruction and view synthesis, Section 3 describe our detailed method for view synthesis using the view-interpolation → multi-view reconstruction → view interpolation pipeline. Section 4 shows the experiment results of our method.

2 Related work

This section presents the related research on view synthesis and stereo depth estimation and reconstruction. Previous view synthesis work are mainly separated into two methods, the first is 3D reconstruction and rendering, the second is directly using input views to generate novel views.

The first method is leveraging the 3D reconstruction and texture mapping to generate the new views[12][5][7]. For example, [12] uses optical-flow based method to optimize the initial visual-hull model captured and calculated by an indoor multi camera capture system, and render with texture, which is designed for generating free-view viewpoints of a particular object. [5] tend to generate more general scenes like buildings and landscapes, which takes several photos taken by consumer-level cameras and reconstruct a model of the scene. [7] use fusion based method to generate 3D model of human motion with albedo and texture information, in such system when the human body is completely captured, free-view scene of the person can be created. [15] uses segment-based method to generate 4D dynamic model of indoor scenes, which has the potential for creating scene of random view with texture information, but the use of texture is not implemented in the paper. Those methods can be used to generate free views of the scene, but often with holes of the background area and distortion artifacts of the 3D model [5] as the some background area of the input pictures have less correspondence, or can only generate free view of particular objects or human[12][7].

The second method is to render the novel view directly using the input pictures. Some view interpolation works use traditional ways of estimating novel views, while others use deep learning based methods. Among those traditional methods, [3] is the first

paper proposing image based view synthesis, which integrates image morphing technique into a interactive view interpolation pipeline. [20] use layered representation of the scene to form continuous view changes inside the view angle of the 8 cameras. [1] uses plane approximation of the foreground object and 3D model of the background to handle the view change of a human in a large scenario. [16] use the idea of confidence volume to estimate the depth and then the confidence map of the 3D scene, which can generate excellent results with some scenes. [13] proposes a new view synthesis framework, and makes good use of other neighboring complementary views to implement hole filling in view synthesis scenario. These methods either tend to have good results with relatively simple scenes with few occlusion [20][16] comparing to complicated scenes like live boxing games, or generate blur background [1]. Our method is based on [16], and with semantic segmentation and 3D reconstruction based iterative optimization, we can handle view interpolation of more complicated scene.

Deep learning based methods are mostly used for view interpolation of static scenes. For example, [11] use CNN method to estimate the disparity map for 2x2 light field cameras, which is then used to warp source image into the novel views. [4] use an end-to-end network to predict pixels in novel views, which utilizes nearby views to estimate the color tower and depth tower, then use these information to synthesis novel view. [19] use layered representation for generating multi-plane images(MPI), which contains both RGB and alpha information for the scene, and use the MPI layers to blend the novel views. [8] use different warped views according to estimated depth as view mosaics, and train a network to learn the blending weights of neighbor views. [14] train MPIs for each input views and use existing views as ground truth for training, then use nearby MPIs to warp the novel views. [17] also estimate MPI representation of the scene leveraging flow information, in order to perform view extracpolation. As deep learning based methods mostly rely on the diversity of the dataset, and there are few multi-view datasets suitable for complex dynamic scenes, therefore these methods mainly focus on static scenes. Our method uses semantic information and traditional 3D reconstruction methods, therefore can tackle the problem of view synthesis for dynamic scenes.

3 Overview

As [16] does well in view interpolation of several scenes, we use their pipeline as the backbone and do some adjustments for our data. However, the method of [16] has some drawbacks: it only consider the color and edge information, although with consistency estimation of neighbor views, still hard to be used in complicated scene where the objects may have various occlusion and the baseline is wide and random between views. Therefore, we propose our method, based on [16] but with modification of algorithm, as well as using semantic segmentation to recognize particular objects as our specified layer, and use the view-interpolation → multi-view reconstruction → view interpolation for special case to leverage the global 3D information.

Our algorithm basically includes three stages. Stage 1 is data segmentation and optimization, described in Section 3.1; Stage 2 is depth and confidence volume estimation, described in Section 3.2; Stage 3 is color blending and view interpolation, described in Section 3.3; Stage 4 is optional, which is the multi-view reconstruction → view interpolation iteration for optimization of the results, described in Section 3.4.

3.1 Segmentation

Figure 1: choose segment for a particular layer

The multi-view RGB data is the input of our algorithm. In the first step, we segment the RGB pictures into different layers, thus form the multi-layer representation of the scene. The reason for doing the segmentation are as followed:

• When the scene is complicated with abundant occlusion, especially when

the focus of the scene (like the players of the boxing game) are quite small in the picture, the traditional view interpolation method may be hard to handle such data. Therefore, with segmentation, the relationship between different layers, like the foreground and background objects, can be clearer.

• Segmentation of the scene can also give us access to using different volume resolution in different layers, which will make the depth estimation of the foreground object more subtle and optimize the quality of the synthesized foreground objects.

Our segmentation method is based on Mask-RCNN [9], each image from every

view in every frame is fed to pre-trained network in [6], and gives the segmentation of layers. For each object or person we want to set as the ℎ𝑡𝑡ℎ layer ℒℎ, we specify the corresponding segment of one view in one frame, for example, the 𝑚𝑚𝑡𝑡ℎ segment of the 0𝑡𝑡ℎ view in the 0𝑡𝑡ℎ frame, denotes 𝒮𝒮0,0

𝑚𝑚 , and then use rough depth estimation (will be described in Section 3.2) to automatically determine the corresponding segment of the same objects in other views and in other frames.

To be more specific, if the 𝑚𝑚𝑡𝑡ℎ segment is determined in view 𝑖𝑖 in the 𝑗𝑗𝑡𝑡ℎ frame as 𝒮𝒮𝑖𝑖,𝑗𝑗𝑚𝑚, and the segmented parts generated by [6] for view 𝑖𝑖 + 1 are 𝒮𝒮𝑖𝑖+1,𝑗𝑗

𝑚𝑚′ ,𝑚𝑚′ ∈𝑠𝑠𝑠𝑠𝑠𝑠𝑚𝑚𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠, we want to determine 𝑚𝑚′ for the object. First we estimate the rough depth as described in 3.2, then for every points (𝑥𝑥,𝑦𝑦) inside 𝒮𝒮𝑖𝑖,𝑗𝑗𝑚𝑚 , we use the 𝑖𝑖𝑡𝑡ℎ camera parameters to map it into the 3D space, and project it into view 𝑖𝑖 + 1:

(𝑥𝑥′,𝑦𝑦′) = 𝐶𝐶𝑖𝑖+1(𝐶𝐶𝑖𝑖−1(𝑥𝑥, 𝑦𝑦)) (1) Where 𝐶𝐶𝑖𝑖 and 𝐶𝐶𝑖𝑖+1 are the camera matrix for the two views.

After every points inside 𝒮𝒮𝑖𝑖,𝑗𝑗𝑚𝑚 is projected into view 𝑖𝑖 + 1, we find the segments which contains most projected points as the correct segment for layer ℒℎ. The process is showed in Figure 1.

After each object to be selected as foreground is automatically determined, the

basic segmentation is made for every view in every frame. As the segments from Mask-RCNN [9] sometimes fail to capture the limbs of the person when the person in the picture is occluded by other person, as shown in Fig. 2, therefore we multi-view information iteratively optimize the segment of all views altogether. For view 𝑖𝑖, we choose four nearby views and view 𝑖𝑖 itself to estimate the segment-confidence map 𝑆𝑆𝑠𝑠𝑠𝑠 − 𝑐𝑐𝑐𝑐𝑠𝑠𝑓𝑓𝑖𝑖:

𝑆𝑆𝑠𝑠𝑠𝑠 − 𝑐𝑐𝑐𝑐𝑠𝑠𝑓𝑓𝑖𝑖(𝑥𝑥, 𝑦𝑦) = 𝑆𝑆𝑠𝑠𝑠𝑠𝑖𝑖(𝑥𝑥,𝑦𝑦) +

𝜖𝜖(Δ𝑑𝑑)∑ 𝑗𝑗∈𝒩𝒩𝑖𝑖 𝑆𝑆𝑠𝑠𝑠𝑠𝑗𝑗(𝐶𝐶𝑗𝑗(𝐶𝐶𝑖𝑖−1(𝑥𝑥,𝑦𝑦))) (2)

Where 𝒩𝒩𝑖𝑖 denotes the neighboring views of view 𝑖𝑖 , 𝑆𝑆𝑠𝑠𝑠𝑠𝑖𝑖(𝑥𝑥,𝑦𝑦) denotes whether (𝑥𝑥,𝑦𝑦) belongs to the foreground region as a 1 or 0 value, and Δ𝑑𝑑 is the absolute difference between the depth estimated at the projected pixel and the depth calculated in view 𝑗𝑗 with 𝐶𝐶𝑖𝑖−1(𝑥𝑥,𝑦𝑦). 𝜖𝜖(𝑥𝑥) is a gaussian function which gives more weight with more consistent depth estimation.

After the segment-confidence map 𝑆𝑆𝑠𝑠𝑠𝑠 − 𝑐𝑐𝑐𝑐𝑠𝑠𝑓𝑓𝑖𝑖 is estimated, the segment of view 𝑖𝑖 is determined by this map, when 𝑆𝑆𝑠𝑠𝑠𝑠 − 𝑐𝑐𝑐𝑐𝑠𝑠𝑓𝑓𝑖𝑖(𝑥𝑥, 𝑦𝑦) is above a threshold 𝜖𝜖𝑠𝑠𝑖𝑖𝑠𝑠, the point is regarded as foreground points, otherwise classified as background.

Figure 2: Illustration of the role of using multi-view information to optimize the segmentation. (a)(b) Segmentation using Mask-RCNN [9] (c) Segmentation after

optimization.

3.2 Volumetric Estimation The volumetric estimation basically includes two parts: depth estimation and

confidence volume estimation. The basic idea follows [16], but with multi-layer representation of the scene, the estimation optimize the multi-layer information and generate more robust results in complicated scene.

Like in [16], for each view in every frame, we make plane sweeps and generate a discrete ray for every pixel, thus formulate a volume 𝑉𝑉(𝑥𝑥, 𝑦𝑦, 𝑧𝑧) where (𝑥𝑥,𝑦𝑦) are pixels and 𝑧𝑧 denotes the depth. We estimate the depth estimation volume of every view as in [16]:

𝐸𝐸𝑟𝑟𝑟𝑟𝑟𝑟(𝑥𝑥, 𝑦𝑦, 𝑧𝑧) = ∑ 𝑘𝑘∈𝑁𝑁𝐸𝐸𝑘𝑘(𝑥𝑥,𝑦𝑦,𝑧𝑧)ℒ𝑘𝑘(𝑥𝑥,𝑦𝑦,𝑧𝑧)

∑ 𝑘𝑘∈𝑁𝑁ℒ𝑘𝑘(𝑥𝑥,𝑦𝑦,𝑧𝑧)

𝐸𝐸(𝑥𝑥,𝑦𝑦, 𝑧𝑧) = ∑ (𝑥𝑥�,𝑦𝑦�)∈𝑊𝑊 𝑤𝑤(𝑥𝑥,𝑦𝑦, 𝑥𝑥�,𝑦𝑦�)𝐸𝐸𝑟𝑟𝑟𝑟𝑟𝑟(𝑥𝑥�,𝑦𝑦�, 𝑧𝑧) (3)

where N is the nearby views of the current view, 𝐸𝐸𝑘𝑘(𝑥𝑥,𝑦𝑦, 𝑧𝑧) is the absolute color difference between the pixel and the projected point in view 𝑘𝑘 with coordinate (𝑥𝑥, 𝑦𝑦, 𝑧𝑧). ℒ𝑘𝑘(𝑥𝑥,𝑦𝑦, 𝑧𝑧) denotes whether (𝑥𝑥,𝑦𝑦) in current view belongs to the same layer with the projected point. 𝑊𝑊 is the perception domain of the guided filter at pixel (𝑥𝑥,𝑦𝑦) . 𝑤𝑤(𝑥𝑥,𝑦𝑦, 𝑥𝑥�,𝑦𝑦�) is the guided filter kernel[10], also returns 0 when (𝑥𝑥, 𝑦𝑦) and (𝑥𝑥�,𝑦𝑦�) are labeled in different layers.

With the depth estimation volume calculated, the raw depth image of each view is decided as followed:

𝐷𝐷𝑟𝑟𝑟𝑟𝑟𝑟(𝑥𝑥,𝑦𝑦) = argmin𝑧𝑧𝐸𝐸(𝑥𝑥,𝑦𝑦, 𝑧𝑧) (4)

As the depth estimated with such method may be limited to pixel-level calculation

and lack of global information, we use the multi-view optical-flow based method in [12], which propose the Multiple starting scales (MSS) framework and utilize the coarse-to-fine image pyramid to catch global patch information, thus can be used to optimize the quality of the depth image:

𝐹𝐹𝐹𝐹𝑐𝑐𝑤𝑤𝑑𝑑 = 𝐷𝐷𝑠𝑠𝐷𝐷𝑠𝑠ℎ𝑇𝑇𝑐𝑐𝐹𝐹𝐹𝐹𝑐𝑐𝑤𝑤(𝐷𝐷𝑟𝑟𝑟𝑟𝑟𝑟)𝐹𝐹𝐹𝐹𝑐𝑐𝑤𝑤𝑖𝑖 = 𝑈𝑈𝐷𝐷𝑠𝑠𝑈𝑈𝑚𝑚𝐷𝐷𝐹𝐹𝑠𝑠&𝑂𝑂𝐷𝐷𝑠𝑠𝑖𝑖𝑚𝑚𝑖𝑖𝑧𝑧𝑠𝑠(𝐹𝐹𝐹𝐹𝑐𝑐𝑤𝑤𝑖𝑖+1)𝐷𝐷 = 𝐹𝐹𝐹𝐹𝑐𝑐𝑤𝑤𝑇𝑇𝑐𝑐𝐷𝐷𝑠𝑠𝐷𝐷𝑠𝑠ℎ(𝐹𝐹𝐹𝐹𝑐𝑐𝑤𝑤0)

(5)

Where 𝐹𝐹𝐹𝐹𝑐𝑐𝑤𝑤𝑖𝑖(𝑖𝑖 = 𝑑𝑑 − 1,𝑑𝑑 − 2, . . . ,0) denotes the 𝑖𝑖𝑡𝑡ℎ layer pyramid optical flow, and in each pyramid layer, the flow will be optimized by minimizing the optical-flow based energy function as [12]:

𝐸𝐸(𝑤𝑤) = 𝐸𝐸𝐷𝐷(𝑤𝑤) + 𝛼𝛼𝐸𝐸𝑆𝑆(𝑤𝑤)𝐸𝐸𝐷𝐷(𝑤𝑤) = ∫ Ω 𝜓𝜓𝐷𝐷(|𝐼𝐼𝑟𝑟(𝐷𝐷 + 𝑑𝑑(𝑤𝑤)) − 𝐼𝐼𝑡𝑡(𝐷𝐷)|2

+𝛾𝛾|Δ𝐼𝐼𝑟𝑟(𝐷𝐷 + 𝑑𝑑(𝑤𝑤)) − Δ𝐼𝐼𝑡𝑡(𝐷𝐷)|2)𝑑𝑑𝑥𝑥𝑑𝑑𝑦𝑦𝐸𝐸𝑆𝑆(𝑤𝑤) = ∫ Ω 𝜓𝜓𝑆𝑆(|Δ𝑤𝑤|2)𝑑𝑑𝑥𝑥𝑑𝑑𝑦𝑦

(6)

Where 𝐸𝐸𝐷𝐷(𝑤𝑤) and 𝐸𝐸𝑆𝑆(𝑤𝑤) denotes the data term and smooth term of the energy function, 𝑤𝑤 is the flow and 𝑑𝑑(𝑤𝑤) is the flow projected to the epipolar line of the target image. 𝜓𝜓𝐷𝐷 and 𝜓𝜓𝑆𝑆 are the robust functions used in [2]. The final layer of the pyramid then be transferred to the depth image 𝐷𝐷.

After the depth image 𝐷𝐷𝑖𝑖 is calculated for each view 𝑖𝑖, adjusted from [16], the surface consensus volume is estimated as followed:

𝐶𝐶𝑐𝑐𝑠𝑠𝑟𝑟𝑟𝑟𝑟𝑟 = ∑ 𝑘𝑘∈𝑁𝑁𝑟𝑟𝑘𝑘𝑒𝑒|𝐷𝐷𝑘𝑘(𝑥𝑥′,𝑦𝑦′)−𝑧𝑧′|

∑ 𝑘𝑘∈𝑁𝑁𝑟𝑟𝑘𝑘

𝐶𝐶𝑐𝑐𝑠𝑠(𝑥𝑥,𝑦𝑦, 𝑧𝑧) = ∑ (𝑥𝑥�,𝑦𝑦�)∈𝑊𝑊 𝑤𝑤(𝑥𝑥,𝑦𝑦, 𝑥𝑥�, 𝑦𝑦�)𝐶𝐶𝑐𝑐𝑠𝑠𝑟𝑟𝑟𝑟𝑟𝑟(𝑥𝑥�,𝑦𝑦�, 𝑧𝑧) (7)

Where 𝑁𝑁 and 𝑊𝑊 are defined same as in 3. (𝑥𝑥′,𝑦𝑦′, 𝑧𝑧′) is the point (𝑥𝑥, 𝑦𝑦, 𝑧𝑧) transformed from view 𝑖𝑖 to view 𝑗𝑗 using camera parameters, and

𝑤𝑤𝑘𝑘 = (𝐷𝐷𝑘𝑘(𝑥𝑥′,𝑦𝑦′) − 𝑧𝑧′ < 𝜖𝜖𝑑𝑑𝑒𝑒𝑑𝑑𝑡𝑡ℎ&&ℒ𝑖𝑖(𝑥𝑥,𝑦𝑦) = ℒ𝑘𝑘(𝑥𝑥′,𝑦𝑦′)) (8) represents whether the point (𝑥𝑥′,𝑦𝑦′, 𝑧𝑧′) can be seen in view 𝑘𝑘, and whether the points

has the same layer label with the corresponding point in view 𝑖𝑖. The 𝐶𝐶𝑐𝑐𝑠𝑠 volume is also calculated using the layer-labelled guided filter as in equation 3.

To increase robustness, the above procedures are iteratively operated, as the results of surface consensus volume can be used to optimize the depth estimation, as it increase the information of the consistence of different views. Therefore, we recalculate the depth estimation volume as followed:

𝐸𝐸𝑟𝑟𝑟𝑟𝑟𝑟𝑖𝑖𝑡𝑡𝑒𝑒𝑟𝑟(𝑥𝑥, 𝑦𝑦, 𝑧𝑧) = ∑ 𝑘𝑘∈𝑁𝑁𝐸𝐸𝑘𝑘(𝑥𝑥,𝑦𝑦,𝑧𝑧)𝐶𝐶𝐶𝐶𝐶𝐶(𝑥𝑥′,𝑦𝑦′,𝑧𝑧′)ℒ𝑘𝑘(𝑥𝑥,𝑦𝑦,𝑧𝑧)

∑ 𝑘𝑘∈𝑁𝑁𝐶𝐶𝐶𝐶𝐶𝐶(𝑥𝑥′,𝑦𝑦′,𝑧𝑧′)ℒ𝑘𝑘(𝑥𝑥,𝑦𝑦,𝑧𝑧)

𝐸𝐸(𝑥𝑥,𝑦𝑦, 𝑧𝑧) = ∑ (𝑥𝑥�,𝑦𝑦�)∈𝑊𝑊 𝑤𝑤(𝑥𝑥,𝑦𝑦, 𝑥𝑥�,𝑦𝑦�)𝐸𝐸𝑟𝑟𝑟𝑟𝑟𝑟(𝑥𝑥�,𝑦𝑦�, 𝑧𝑧) (9)

And the rest of the steps will still be operated to iteratively increase the quality of the depth image, also the robustness of the surface consensus volume.

3.3 View Interpolation

For a novel view which we want to synthesize the scene, we use the nearby 4 views to interpolate the final view. In order to consider both the visibility and the confidence volume, we generate the scene representation volume for view 𝑖𝑖 as followed:

𝑠𝑠𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠𝑖𝑖(𝑥𝑥,𝑦𝑦, 𝑧𝑧) = max(0, min( 𝐶𝐶𝑐𝑐𝑠𝑠𝑖𝑖(𝑥𝑥,𝑦𝑦, 𝑧𝑧),

1 − ∑ �̂�𝑧<𝑧𝑧 𝐶𝐶𝑐𝑐𝑠𝑠𝑖𝑖(𝑥𝑥,𝑦𝑦, �̂�𝑧))) (10)

The scene representation and the consensus volume of each nearby views can then

be separately used to generate the color volume and the consensus volume of the novel view 𝑁𝑁 with layer 𝑘𝑘:

𝑐𝑐𝑐𝑐𝐹𝐹𝑐𝑐𝑟𝑟𝑁𝑁,𝑘𝑘(𝑥𝑥, 𝑦𝑦, 𝑧𝑧) =

∑ 𝑖𝑖∈𝒩𝒩𝑁𝑁ℒ𝑖𝑖(𝑥𝑥𝑖𝑖′,𝑦𝑦𝑖𝑖′)=𝑘𝑘

𝑠𝑠𝑠𝑠𝑒𝑒𝐶𝐶𝑒𝑒𝑖𝑖(𝑥𝑥𝑖𝑖′,𝑦𝑦𝑖𝑖′,𝑧𝑧𝑖𝑖′)𝐼𝐼𝑖𝑖(𝑥𝑥𝑖𝑖′,𝑦𝑦𝑖𝑖′)

∑ 𝑖𝑖∈𝒩𝒩𝑁𝑁ℒ𝑖𝑖(𝑥𝑥𝑖𝑖′,𝑦𝑦𝑖𝑖′)=𝑘𝑘

𝑠𝑠𝑠𝑠𝑒𝑒𝐶𝐶𝑒𝑒𝑖𝑖(𝑥𝑥𝑖𝑖′,𝑦𝑦𝑖𝑖′,𝑧𝑧𝑖𝑖′)

𝐶𝐶𝑐𝑐𝑠𝑠𝑁𝑁,𝑘𝑘(𝑥𝑥,𝑦𝑦, 𝑧𝑧) = ∑ 𝑖𝑖∈𝒩𝒩𝑁𝑁ℒ𝑖𝑖(𝑥𝑥𝑖𝑖′,𝑦𝑦𝑖𝑖′)=𝑘𝑘

𝐶𝐶𝑐𝑐𝑠𝑠𝑖𝑖(𝑥𝑥𝑖𝑖′,𝑦𝑦𝑖𝑖′, 𝑧𝑧𝑖𝑖′) (11)

Where 𝒩𝒩𝑁𝑁 is the nearby views of view 𝑁𝑁 . ℒ𝑖𝑖(𝑥𝑥𝑖𝑖′,𝑦𝑦𝑖𝑖′) = 𝑘𝑘 denotes that pixel (𝑥𝑥𝑖𝑖′,𝑦𝑦𝑖𝑖′) in the 𝑖𝑖𝑡𝑡ℎ image is labeled with semantic layer 𝑘𝑘. Each pixel (𝑥𝑥,𝑦𝑦) in novel view labeled as each layer ℒ𝑘𝑘 will give a value of the color and the confidence, which utilize the segmentation information as Section 3.1. The advantage is that through this procedure, the color blending will have less tendency to blend colors of different objects of the scene, which benefits the view interpolation of complicated scene.

Then we calculate the label volume and the final consensus volume as followed:

𝐶𝐶𝑐𝑐𝑠𝑠𝑁𝑁(𝑥𝑥,𝑦𝑦, 𝑧𝑧) = max𝐶𝐶𝑐𝑐𝑠𝑠𝑁𝑁,𝑘𝑘(𝑥𝑥,𝑦𝑦, 𝑧𝑧)ℒ𝒱𝒱𝑁𝑁(𝑥𝑥,𝑦𝑦, 𝑧𝑧) = argmax

𝑘𝑘𝐶𝐶𝑐𝑐𝑠𝑠𝑁𝑁,𝑘𝑘(𝑥𝑥,𝑦𝑦, 𝑧𝑧) (12)

To synthesize the novel view, we search for 𝑧𝑧 for each pixel (𝑥𝑥, 𝑦𝑦) to get the

best-matched depth. In order to consider the occlusion, the search range of 𝑧𝑧 is given as followed:

ℛ𝑁𝑁(𝑥𝑥,𝑦𝑦) = {𝑧𝑧|∑ �̂�𝑧<𝑧𝑧 𝐶𝐶𝑐𝑐𝑠𝑠𝑁𝑁(𝑥𝑥,𝑦𝑦, 𝑧𝑧) < 𝜖𝜖𝐶𝐶𝐶𝐶𝑛𝑛𝑒𝑒𝑠𝑠} (13)

We first mark the layer label for each pixel (𝑥𝑥,𝑦𝑦) using the search range calculated above:

ℒ𝑁𝑁(𝑥𝑥, 𝑦𝑦) = ℒ𝒱𝒱𝑁𝑁(𝑥𝑥,𝑦𝑦, arg max𝑧𝑧∈ℛ𝑁𝑁(𝑥𝑥,𝑦𝑦)

𝐶𝐶𝑐𝑐𝑠𝑠𝑁𝑁(𝑥𝑥,𝑦𝑦, 𝑧𝑧)) (14)

As there might be salt and pepper noise in the layer label, we propagate the label value to a pixel if most of its neighbor label are consistent. After that, we can finally calculate the RGB value of that pixel:

𝑅𝑅𝑅𝑅𝐵𝐵𝑁𝑁(𝑥𝑥, 𝑦𝑦) = 𝑐𝑐𝑐𝑐𝐹𝐹𝑐𝑐𝑟𝑟𝑁𝑁,ℒ𝑁𝑁(𝑥𝑥,𝑦𝑦)(𝑥𝑥,𝑦𝑦, 𝑧𝑧0)𝑧𝑧0 = arg max

𝑧𝑧∈ℛ𝑁𝑁(𝑥𝑥,𝑦𝑦)𝐶𝐶𝑐𝑐𝑠𝑠𝑁𝑁,ℒ𝑁𝑁(𝑥𝑥,𝑦𝑦)(𝑥𝑥,𝑦𝑦, 𝑧𝑧)) (15)

Figure 3: Depth estimation before and after optimization from reconstruction. (a) Input

image. (b) Depth estimation without optimization. (c) Depth estimation with optimization.

3.4 Optimizing from Reconstruction In order to optimize the robustness of the algorithm, we propose the

view-interpolation → multi-view reconstruction → view interpolation pipeline to iteratively optimize the result, especially when the major area of the scene is still. We first generate the view interpolation result of the scene, then take the output images and use the multi-view reconstruction method [5] to generate the 3D model, which can be used then to optimize the depth estimation in our pipeline. As shown in Fig. 3, after optimization from multi-view reconstruction, we obtain better estimated depth of the scene with less outliers, which benefits our final view synthesis results.

In practice, as such 3D reconstruction method is very time-consuming, we tend to generate only the static background utilizing the segmentation we did before. Then, we use the generated 3D model to project to each camera and get the background depth image. Then we use this depth image to renew the depth estimated in the first step of Section 3.2 in every frame. The reason for using the 3D reconstruction to optimize the results is that the 3D reconstruction tend to eliminate the outliers of the 3D scene, which can give more robust depth as the initialization of our pipeline. Besides, the 3D reconstruction method takes the advantage of using global information instead of the local one, which form complementarity with our current pipeline.

One more thing to be pointed out in this pipeline is, we do not use the origin input image as the 3D reconstruction input. The reason is that due to the wide baseline of the images, the method of [5] tend to generate the scene with large holes. To illustrate that, we use one of the dataset of [15] which contains 6 still camera frames to generate 40 output views, and we respectively take 6 input images and 40 output images as the MVE input, the output 3D model is shown in Figure 4. The MVE model generated from output views apparently have better quality than the other one.

Figure 4: different MVE input

After the model is generated and projected to each input view, the depth image is

acquired and can be used to iteratively optimize the results by feeding it into the first step.

4 Results

In this section, we show some results of the generated novel view of multi-view input data. As our work mainly focus on dealing with complicated scene, which has hardly been used for view interpolation before, we may not have much comparison with previous work. The following experiments are conducted on three datasets, the boxing dataset contains a scene with 500 frames with 32 views, the dataset of [12] contains a 20-view scene inside a green-background "cage", while the dataset of [15] have just 6 views of a girl inside a room.

Figure 5: generated novel view of boxing data

As shown in Fig. 5, which shows the multi-view data of boxing game and its corresponding view generating result. The first and the third column are the nearby input views of the multi-view data, and the second column is the generated scene. The scene is complicated as there are occlusion between the players on the court, also between the boxing field and the audiences and the referee. Also, there are so many people and objects in the input pictures. Fig 5 shows that we can handle this kind of scene and generate good results.

Figure 6: view interpolation results of data from [12], the top row are the two nearby input views, and other four pictures are generated novel view. Notice that the nearby cameras have relatively small baseline comparing to other datasets, the video demo

should have clearer demonstration of this result.

We also generate views from the input data of [12], which contains multi-view

data of a single person inside a "cage" surrounded by green screen. With this setup, the human can easily be segmented and we only generate view of the human. As shown in Fig. 6, we can generate high quality free-view of human.

Figure 7: generated novel view of dance data from [15]

The data of [15] is also used for testing our view interpolation result. In such data,

the scene is static except the human, which is well-suited for our view-interpolation → multi-view reconstruction → view interpolation pipeline. We first use [6] to segment the human in the scene, then we use only the background part to generate 40 views between the input 6 views using our method described in Section 3, then we use the 40 output views as input to generate a 3D model using [5]. The generated model is showed in Figure 4. The model is then projected to each input view to generate more robust depth estimation of the background part, which we use to replace the depth estimated in the first step. The result is shown in Figure 7.

The comparison between using such pipeline and not using image segmentation or reconstruction for optimizing of this data is shown in Figure 8. As is shown, without image segmentation, due to the inaccurate depth estimation of the foreground object, the result shows artifacts in the boundary area of the human. And without optimization from reconstruction, there are some artifacts in the background area because of the complicated occlusion between objects and the subtle depth estimation result without such optimization.

We compare our method with [16]. As shown in Fig.9, [16] generates blurring results particularly in the boundary area of human, instead, our method generates clearer boundaries benefiting from our semantic segmentation strategy. Also, [16] cannot estimate accurate depth especially areas between two person, but with our view-interpolation → multi-view reconstruction → view interpolation pipeline and semantic layer representation, we can avoid the ghosting effects of the floor area between the two boxers. We also compare our method with [14]. As shown in Fig.9, due to the complexity of our scene, [14] cannot generate clear view interpolation result as it mainly focus on simple scenario with few objects. However, our method can handle plausible

view interpolation of this kind of the scene. In short, we propose a view interpolation method leveraging both volumetric

information and semantic information of the input images, and use the view-interpolation → multi-view reconstruction → view interpolation pipeline to add robustness for view interpolation of dynamic scenes. There are still some limitations of our work, for instance, scenes without clear semantic layers may not be used to generate plausible results according to our pipeline. Also, the quality of the result relies heavily on the quality of the segmentation method, and if some body parts are not accurately segmented (hands, legs), the result may show some bluring effect in these areas. And as our pipeline mainly focus on dynamic scenes, and there are few corresponding methods that has open source, thus it is difficult to make comparisons with current view synthesis works.

Considering the future work, a more robust depth estimation can be integrated in our pipeline, and some 3D deep learning approach for volumetric estimation may also be explored. Besides, the continuous of the interpolated scene between different frames can also be considered.

Figure 8: Illustration of the role of segmentation and view-interpolation → multi-view

reconstruction → view interpolation pipeline. (a) Result without image segmentation procedure. (b) Result without optimization from reconstruction. (c) Result with

optimization.

Figure 9: The comparison between our method, soft3d [16], and local light-field fusion[14]. (a) The result generated by our method. (b) The result generated by [16]. (c)

The result generated by [14].

Acknowledgments The authors would like to thank Tsinghua University, Tianjin University and

Duke University for supporting the research.

References [1] Luca Ballan, Gabriel J. Brostow, Jens Puwein, and Marc Pollefeys. Unstructured video-based rendering: Interactive exploration of casually captured videos. ACM Transactions on Graphics (Proceedings of SIGGRAPH 2010), pages 1–11, July 2010. [2] Thomas Brox, Andrés Bruhn, Nils Papenberg, and Joachim Weickert. High accuracy optical flow estimation based on a theory for warping. In Tomás Pajdla and Jiří Matas, editors, European Conference on Computer Vision (ECCV), September 2014. [3] Shenchang Eric Chen and Lance Williams. View interpolation for image synthesis. In Proceedings of the 20th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’93, pages 279–288, New York, NY, USA, 1993. ACM. [4] John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. Deepstereo: Learning to predict new views from the world’s imagery. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. [5] Simon Fuhrmann, Fabian Langguth, and Michael Goesele. Mve ?€? a multi-view reconstruction environment. In European Conference on Computer Vision (ECCV), September 2014. [6] Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr Dollár, and Kaiming He. Detectron. https://github.com/facebookresearch/detectron, 2018. [7] Kaiwen Guo, Feng Xu, Tao Yu, Xiaoyang Liu, Qionghai Dai, and Yebin Liu. Real-time geometry, albedo and motion reconstruction using a single rgbd camera. ACM Transactions on Graphics (TOG), 2017. [8] Peter Hedman, Julien Philip, True Price, Jan-Michael Frahm, George Drettakis, and Gabriel Brostow. Deep blending for free-viewpoint image-based rendering. ACM Transactions on Graphics (SIGGRAPH Asia Conference Proceedings), 37(6), November 2018. [9] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017. [10] Kaiming He, Jian Sun, and Xiaoou Tang. Guided image filtering. In European Conference on Computer Vision (ECCV), Oct 2016. [11] Nima Khademi Kalantari, Ting-Chun Wang, and Ravi Ramamoorthi. Learning-based view synthesis for light field cameras. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia 2016), 35(6), 2016. [12] Yebin Liu, Xun Cao, Qionghai Dai, and Wenli Xu. Continuous depth estimation

for multi-view stereo. In IEEE international Conference on Computer Vision and Pattern Recognition (CVPR), June 2009. [13] S. Li, C. Zhu, and M. Sun. Hole filling with multiple reference views in dibr view synthesis. IEEE Transactions on Multimedia, 20(8):1948–1959, Aug 2018. [14] Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG), 2019. [15] Armin Mustafa, Hansung Kim, Jean-Yves Guillemaut, and Adrian Hilton. Temporally coherent 4d reconstruction of complex dynamic scenes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. [16] Eric Penner and Li Zhang. Soft 3d reconstruction for view synthesis. ACM Transactions on Graphics (Proc. SIGGRAPH Asia), 36(6), 2017. [17] Pratul P. Srinivasan, Richard Tucker, Jonathan T. Barron, Ravi Ramamoorthi, Ren Ng, and Noah Snavely. Pushing the boundaries of view extrapolation with multiplane images. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. [18] Shui-Hua Wang, Junding Sun, Preetha Phillips, Guihu Zhao, and Yu-Dong Zhang. Polarimetric synthetic aperture radar image segmentation by convolutional neural network using graphical processing units. Journal of Real-Time Image Processing, 15(3):631–642, Oct 2018. [19] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images. In SIGGRAPH, 2018. [20] Larry Zitnick, Sing Bing Kang, Matt Uyttendaele, Simon Winder, and Rick Szeliski. High-quality video view interpolation using a layered representation. In ACM SIGGRAPH, volume 23, page 600?€?608. Association for Computing Machinery, Inc., August 2004.

View synthesis from multi-view rgb data using multi-layered … · 2019. 12. 3. · View synthesis...

Documents

Transcript of View synthesis from multi-view rgb data using multi-layered … · 2019. 12. 3. · View synthesis...