[IEEE 2013 6th International Congress on Image and Signal Processing (CISP) - Hangzhou, China...

5
978-1-4799-2764-7/13/$31.00 ©2013 IEEE 948 2013 6th International Congress on Image and Signal Processing (CISP 2013) Construction and Compression of Face Models for Multi-Party Videoconferencing With Multi-Camera Xiangran Sun School of Information Engineering Communication University of China Beijing, China [email protected] Yunyun Wei 1 , Suiyu Zhang 2 , Jiaying Pan 2 1 School of Animation and Digital Arts 2 School of Information Engineering Communication University of China Beijing, China Abstract—Multi-party videoconferencing systems have been recently used more and more, but they have still faced some challenges and problems. In this paper we present a novel method and framework to create face models and construct the gaze and face direction, which can make the user experience of multi-party videoconferencing more immersive and interactive. We utilize multi-camera and one display screen as the user terminal, so it can implement face modeling, gaze tracking and view synthesis. Applying our proposed method it can compress the transfer data and reduce the network bandwidth. Moreover it improves the effectiveness of video contents and lowers the requirements of system hardware. Keywords-multi-party videoconferencing; gaze tracking; gaze and face direction; view synthesis I. INTRODUCTION It is outstanding recently that videoconferencing systems for face-to-face video communication have been used more and more for natural contact over network. The general setup in a typical desktop videoconferencing faces several problems such as the reduction of the effectiveness of video contents and the efficiency of the conversation [1]. The researchers and engineers have make lots of effort to develop advanced techniques for multi-party videoconferencing. Recent development enables users to experience immersive and perceptually realistic media to communicate and work. In a typical desktop face-to-face videoconferencing setup, we commonly use only single camera which is placed on the top of the display screen and one screen toward the user. While the system is working for multi-party conversion, the display screen should displays different participants at the same time and be divided into several parts. Although videoconferencing has achieved much improvement recently, it faces many problems, such as hardware requirements, network bandwidth, and the lack of eye-contact, which need us to pay more attention to resolve. In the conversation humans display great sensitivity to the gaze of others. Gaze at their eyes reveals that a person is looking at them, also shows his eye contact and attention. The eye contact can improve the feeling of natural conversation and the effectiveness of communication. Talking at shared table in the common conference, the participants can discriminate gaze at their eyes by someone facing them. Head orientation reveals a person is looking at others as the same as the gaze and face direction. Humans can judge one person talking with and looking at which one in the conference, depending on the angle of the gaze and face direction. However, most videoconferencing systems are not very good at supplying friendly interface and preserving the gaze and face directional cues. One reason is that each person has only one camera and the camera is typically placed above the eyes of the user on the screen. Because of this parallax, eye gaze displays lower than it actually is. In the conversation a participant generally looks at the image on the screen but not straight into the camera, thus he can not make eye contact with the remote party. The other is that single camera can not create 3D face model perfectly and provide the face view in the different angles to other participants. It is very difficult for all participants to understand who the participant is looking at and talking to. These problems are directly caused by the lack of information about the gaze and face direction of the participants, that can indicate who is talking or listening to whom with great accuracy. The lack of such information must have a great effect on the management of multi-party conversation. So we proposed a novel framework and hardware components consisted of three cameras and one display screen, which can construct the face model, display multiple participants and provide corrected the gaze and face view. Determination of the gaze and face direction is one of the most important issue for an immersive multi-party videoconferencing system. Although many conventional works have been performed, these studies have not yet achieved the reconstruction of the virtual gaze and face view in different angles efficiently. Because of using special hardware some exiting designed systems, the expensive cost and the bulky setup prevent them to be used in a ubiquitous way [2] [3]. Video streaming is too large to be transferred in multi-party videoconferencing network, especially from three cameras of every terminal. So one person can not transfer three images of three cameras directly to other terminals, and also can not receive these data. In order to reduce the signal transmission and compress the data, we construct the user’s 3D face model and acquire who he is focusing on, and then these data are transferred to other terminals. The computer of each participant reconstructs and displays the gaze and face view of other participants.

Transcript of [IEEE 2013 6th International Congress on Image and Signal Processing (CISP) - Hangzhou, China...

978-1-4799-2764-7/13/$31.00 ©2013 IEEE 948

2013 6th International Congress on Image and Signal Processing (CISP 2013)

Construction and Compression of Face Models for Multi-Party Videoconferencing With Multi-Camera

Xiangran Sun School of Information Engineering

Communication University of China Beijing, China

[email protected]

Yunyun Wei1, Suiyu Zhang2, Jiaying Pan2 1 School of Animation and Digital Arts

2 School of Information Engineering Communication University of China

Beijing, China

Abstract—Multi-party videoconferencing systems have been recently used more and more, but they have still faced some challenges and problems. In this paper we present a novel method and framework to create face models and construct the gaze and face direction, which can make the user experience of multi-party videoconferencing more immersive and interactive. We utilize multi-camera and one display screen as the user terminal, so it can implement face modeling, gaze tracking and view synthesis. Applying our proposed method it can compress the transfer data and reduce the network bandwidth. Moreover it improves the effectiveness of video contents and lowers the requirements of system hardware.

Keywords-multi-party videoconferencing; gaze tracking; gaze and face direction; view synthesis

I. INTRODUCTION It is outstanding recently that videoconferencing systems

for face-to-face video communication have been used more and more for natural contact over network. The general setup in a typical desktop videoconferencing faces several problems such as the reduction of the effectiveness of video contents and the efficiency of the conversation [1]. The researchers and engineers have make lots of effort to develop advanced techniques for multi-party videoconferencing. Recent development enables users to experience immersive and perceptually realistic media to communicate and work. In a typical desktop face-to-face videoconferencing setup, we commonly use only single camera which is placed on the top of the display screen and one screen toward the user. While the system is working for multi-party conversion, the display screen should displays different participants at the same time and be divided into several parts.

Although videoconferencing has achieved much improvement recently, it faces many problems, such as hardware requirements, network bandwidth, and the lack of eye-contact, which need us to pay more attention to resolve. In the conversation humans display great sensitivity to the gaze of others. Gaze at their eyes reveals that a person is looking at them, also shows his eye contact and attention. The eye contact can improve the feeling of natural conversation and the effectiveness of communication. Talking at shared table in the common conference, the participants can discriminate gaze at their eyes by someone facing them. Head orientation reveals a

person is looking at others as the same as the gaze and face direction. Humans can judge one person talking with and looking at which one in the conference, depending on the angle of the gaze and face direction. However, most videoconferencing systems are not very good at supplying friendly interface and preserving the gaze and face directional cues. One reason is that each person has only one camera and the camera is typically placed above the eyes of the user on the screen. Because of this parallax, eye gaze displays lower than it actually is. In the conversation a participant generally looks at the image on the screen but not straight into the camera, thus he can not make eye contact with the remote party. The other is that single camera can not create 3D face model perfectly and provide the face view in the different angles to other participants. It is very difficult for all participants to understand who the participant is looking at and talking to. These problems are directly caused by the lack of information about the gaze and face direction of the participants, that can indicate who is talking or listening to whom with great accuracy. The lack of such information must have a great effect on the management of multi-party conversation. So we proposed a novel framework and hardware components consisted of three cameras and one display screen, which can construct the face model, display multiple participants and provide corrected the gaze and face view.

Determination of the gaze and face direction is one of the most important issue for an immersive multi-party videoconferencing system. Although many conventional works have been performed, these studies have not yet achieved the reconstruction of the virtual gaze and face view in different angles efficiently. Because of using special hardware some exiting designed systems, the expensive cost and the bulky setup prevent them to be used in a ubiquitous way [2] [3]. Video streaming is too large to be transferred in multi-party videoconferencing network, especially from three cameras of every terminal. So one person can not transfer three images of three cameras directly to other terminals, and also can not receive these data. In order to reduce the signal transmission and compress the data, we construct the user’s 3D face model and acquire who he is focusing on, and then these data are transferred to other terminals. The computer of each participant reconstructs and displays the gaze and face view of other participants.

949

To solve these problems and augment the effectiveness of multi-party videoconferencing, we proposed a novel approach that produces the gaze and face view in different angle using computer vision and computer graphics method, which is based on the fact that if we can obtain a 3D description of the scene utilizing three cameras, it enables us to construct an image seen from any desired view point. The approach includes two part procedure, one is the local terminal processes the face model, the other is the remote terminal reconstructs the gaze and face view. We also divide the approach four steps: face modeling, gaze tracking, determination of the gaze and face direction and the gaze and face view synthesis.

In this paper, we design an immersive multi-party videoconferencing system consisting of three cameras and one display screen and propose a creative method including two part procedure on local and remote terminals. The main contribution of our work is to present a particular transferred data structure, a practical gaze tracking solution and propose a simplified view synthesis method, then improve immersive and interactive feature and reduce networking bandwidth usage in multi-party videoconferencing.

The rest of this paper is organized as follows. Section II briefly reviews the architecture of the multi-part videoconferencing system and the approach of software processing. Section III presents the detailed method of processing. Experimental results and conclusions are presented in Section IV.

II. AN OVERVIEW OF SYSTEM AND ARCHITECTURE Fig. 1 outlines the specification of our proposed multi-party

videoconferencing system. We utilize three same cameras in our proposed system, which are placed on the bottom left, bottom right and the top of the display screen, individually numbered the first, second and third one. The three cameras are strictly fixed on the screen to construct a triangle, and they are faced toward the user from different angles. Then the three cameras are connected to a computer, and all these devices compose a terminal.

Figure 1. Specification of our proposed multi-party videoconferencing system

Furthermore communicating in the conversation, the user should be positioned as close as possible to the display. The user should also be located in the middle of the triangle made up of three cameras, for the cameras can capture user’s face perfectly. We choose the triangle setup as shown in Fig. 1, because it can provide us the pictures of different angles, wider coverage of the subject and higher disambiguation characteristic in feature matching. Based on symmetric facial features matching ambiguity is completed by two bottom cameras.

Fig. 2 illustrates the block diagram of our designed approach in the multi-party videoconferencing system. The entire processing includes two part procedures: the local terminal processing and the remote terminal procedure, individually corresponding to five major steps: face region recognition, face modeling, gaze tracking, determination of the gaze and face direction, the gaze and face view synthesis.

Figure 2. Overall scheme of our proposed approach

In order to reduce data amount and computational time, we down-sample the captured images according to the resolution 640x480. In the first step, we apply three images of a scene from three different angles to extract and match edges using an improved trinocular edge matching technique based on the method in [4]. The method of matching ambiguity usually involves symmetric facial features such as eyes and lip contours aligned horizontally from the processed images. The individual simplicity face model is constructed by a rapid face modeling tool [5]. We simplify the process of gaze tracking using the method of the estimation of the gaze direction based on the change of the eyes in the shape and in the position from an input image and the face model. Then the gaze and face correction can be calculated and determined by the displayed location relation of participants on the display screen. In the last step, the accurate algorithm of view synthesis is designed and performed, and the corrected image of gaze and face is finally displayed through the screen in the user’s terminal.

950

III. PROCESSING OF THE MULTI-PARTY VIDEOCONFERENCING

A. Face Region Recognition In the first step, we determine the face region for each

image from three cameras. The three down-sampling images are taken in a normal room by three cameras in different angles of the view (see Fig. 3). We can distinguish at least three major parts of objects in the same scene between the three views: background, head, and other parts of the body such as the shoulder. Unless we separate them successfully, it is very difficult to determine a complex facial model. Since the cameras are fixed towards the head, it enables us to remove the background by subtracting one image from the other two. However, as the face position and color change smoothly, a portion of the face may be marked as background. There is another problem with this image subtraction technique is that the moving head and the body are hard to be distinguished.

We use a trinocular edge matching technique based upon dynamic programming and edge string ambiguity resolution to extract and match edges. So we can determine the face region quickly and accurately. The advantage of the method we proposed is that it does not require camera calibration as the necessary step and sufficient information for epipolar edge matching, which can be recovered from the images themselves, which was described in [6]. Firstly starting from down-sampled three images of a scene, they are arranged roughly at the corners of a triangle. Then the method has been used to match edges of trinocular triplets of images, we extends it to the trinocular case which do not need to be calibrated. The method is composed of three parts: estimation of the epipolar geometry for each pair of cameras, trinocular matching of edge pixels and matching connected strings of edge pixels.

Figure 3. Face modeling based on delaunay triangulation over matched points

B. Face Modeling In the facial shape modeling step, not only the facial model

is constructed quickly, but also we compress the model data as soon as possible. We present methods to find the correction angle and the facial shape model used in the rendering step to generate the gaze-corrected image. It can provide the necessary

data to reconstruct the gaze and face view and avoid operating the whole face model and three images. After the face region is recognized, we use a rapid face modeling tool [5] to compute the 3D face geometry from the three images. Because we can acquire the face region accurately, it improves and accumulates the processing of face modeling.

We can create the 3D face model from the three images by using a rapid face modeling tool. The user then locates five markers in each of the images (see Fig. 3). The five markers corresponding to the two outer eye corners, nose top, and two mouth corners are also determined on the face model. The video sequences can be tracked to create a facial texture map completely through the blending frames of the sequence. After some observation we can conclude that even though it is difficult to extract dense 3D facial geometry from the images, it is possible to match a sparse set of corners and use them to compute head motion and the 3D locations of these corner points. We can then fit a linear class of human face geometries to this sparse set of reconstructed corners to generate the complete face geometry. Linear classes of face geometry and image prototypes have been demonstrated for constructing 3D face models from images in a morphable model framework.

C. Gaze Tracking While the user is contacting to another participant in the

multi-party videoconferencing system, the user’s eyes should focus on another person’s eyes and his face should orientate him on the screen. Because it is important and essential to identify the user’s gaze direction on which part on the screen, we should acquire an accurate model of the user's face in the 3D space. After the face model is constructed and five key points are determined, we can use them to compute an approximate calculation of the gaze and face orientation.

A geometric model is defined by the gaze and face direction vector and the face plane as Fig. 4. We choose four points in five key points to the face plane corresponding to the two inner eye corners and two mouth corners, and map these points to the face plane in 3D space. The direction vector is derived from a perpendicular normal vector of the plane going through the nose point. The origin is located at the nose point of the face model, Z axis is pointing towards the screen, and Y axis is pointing up. The angle between the direction vector and X-Y plane can indicate which part on the screen is focused on by the user approximately.

Figure 4. Direction vector determined by the face model

951

D. Determination of Gaze and Face Direction The immersive videoconferencing should make the

participants to feel they are situated symmetric around the shared table, with each conferee appearing on a certain part on the screen. Due to the symmetric geometry of this set-up, it guarantees eye contact, gaze awareness and gesture reproduction. Thus, everyone in the conversation can observe under correct perspective who is talking to whom or who is pointing at what. For example, if the person in front of the terminal in Fig. 1 talks to the one on left while making a gesture in direction of the right one, this third person can easily recognize that the two others are talking about him. At first, a seamless transition between the real working desk in front of the display and the virtual conference table at the screen gives the user the impression of being part of an extended perception space. Secondly, the remote participants are rendered seamlessly and under correct perspective view into the virtual conference scene.

The basic idea of our system concept is to place 3D video reproductions of a given number of participants at predefined positions of a shared virtual environment. For this purpose the conferees are captured at each terminal by a multiple camera setup and the desired 3D video representation of the local conferee is extracted from the multi-view images (see Fig. 4). Then the 3D video objects of all conferees are grouped virtually around the shared table. In the case of four parties as Fig. 1 showed, there are five poses or degrees of the head direction for each person corresponding to different position. We should reproduce the gaze and face correction corresponding to head rotations of 0, +/- 15 and +/-30 degrees as shown in Fig. 5. Based on the result of gaze tracking we can determine who is talking to whom and who is focusing on whom. Then the head rotation of a certain degree and the gaze and face direction for every person can be known.

E. Gaze and Face View Synthesis Finally we synthesize the virtual view by the face model

and the images of the cameras in each participant’s terminal. On one hand the corrected the gaze and face direction and the constructed face model provide sufficient information to implement the novel virtual view synthesis of the participants. On the other hand the amount of the sufficient information is so small to save the system requirement. There are two critical issues including real-time implementation and occlusion handling, which we should consider in the practical design and applications. In this paper, we improve the view synthesis method based on an efficient synthesis algorithm has been developed in [7] which is able to take these issues into account. Fig. 5 illustrates a synthesis result showing the good quality of the synthetic view. By the previous modeling, tracking and correcting steps, we have obtained a set of point and line and images of the participant’s face that could be used to synthesize the new virtual view.

We present a novel algorithm of view synthesis based on synthesizing two methods. One is based on view morphing and the other uses multi-texture blending. The view morphing technique allows synthesizing virtual view along the path connecting the optical centers of the three cameras. The view

morphing factors mc , md control the exact view position.

They are usually between 0 and 1. When the factor mc is a value of 0, it corresponds to the first camera view exactly, and as a value of 1 corresponding to the second camera view. Corresponding to third camera view, the factor md is a value of 0. Any value in between represents a virtual viewpoint somewhere along the path from one camera to another. In the multi-texture rendering method, it creates a 2D triangular mesh using Delaunay triangulation in the first camera’s image space. According to the fundamental method, we need to process three images and offset each vertex’s coordinate by its disparity modulated by the view morphing factors mc , md . The offset mesh is fed to the render with three sets of texture coordinates, one for each camera image. Considering that all the images and the mesh are located in the rectified coordinate space, it is important to set the viewing matrix to the inverse of the rectification matrix to generate the resulting image to its normal view position. In fact it is equivalent to the post-warp in view morphing. If the powerful graphics hardware is utilized, the blending scheme could be elaborate. The weight iW and

iV are based on the product of the total area of adjacent

triangles and the view-morphing factor, iW is the horizontal

factor, iV is the vertical, as

( )

( ) ∑∑∑

∗+−∗−∗

=mimi

mii cScS

cSW 21

1

11

(1)

( )

( )( ) ( )∑∑∑∑

−∗+∗∗+−∗−∗

=mimiiii

mii dSdWSWS

dSV

111

321

3

(2)

where 1iS are the areas of the triangle in the first image, 2

iS are the areas of the corresponding triangle in the second image and 3

iS corresponding to the third one. By changing the view

morphing factors mc , md , we can synthesize correct view with desired eye gaze and face direction in real-time.

IV. CONCLUSION In the analysis of the proposed method and the

experimental results, we can create the virtual gaze and face view to implement head rotations of 0, +/- 15 and +/-30 degrees as shown in Fig. 5. In the experiments, we constructed the whole multi-party videoconferencing and tested with real images using three cameras and one display screen as each participant’s terminal of multi-party videoconferencing. The results illustrate the synthesized virtual view in different angles

952

can fulfill the requirements of immersive multi-party videoconferencing. It can run at 12 to 15 frames per second, as we captured all test data with resolution 640x480 at 30 frames per second.

In this paper, we have presented a novel hardware scheme and software procedure for gaining eye contact and generating the gaze and face virtual view in multi-party videoconferencing. We also have presented a new gaze tracking approach and simplified view synthesis method to generate a gaze and face image from three cameras. Experimental results indicate that the system can realize the natural gaze and face image reconstruction efficiently in real-time, improve immersive and interactive feature in multi-party videoconferencing and reduce the networking bandwidth usage.

REFERENCES

[1] L. Mhlbach, B. Kellner, A. Prussog, and G. Romahn, “The Importance of eye contact in a videotelephone service,” in 11th Interational Symposium on Human Factors in Telecommunications, CessonSevigne, France, 1985.

[2] Lee, Sang-Beom, In-Yong Shin, and Yo-Sung Ho, “Gaze-corrected view generation using stereo camera system for immersive videoconferencing,” Consumer Electronics, IEEE Transactions on 57.3 (2011): 1033-1040.

[3] Feldmann, I., et al, “Real-time depth estimation for immersive 3D videoconferencing,” 3DTV-Conference: The True Vision-Capture, Transmission and Display of 3D Video (3DTV-CON), 2010. IEEE, 2010.

[4] Pollard, Stephen, et al, “View synthesis by trinocular edge matching and transfer,” Image and Vision Computing 18.9 (2000): 749-757.

[5] Z. Liu, Z. Zhang, C. Jacobs, and M. Cohen, “Rapid modeling of animated faces from video,” in The Third International Conference on Visual Computing (Visual 2000), pages 58–67, Mexico City, September 2000.

[6] Y. Ohta, T. Kanade, “Stereo by intra- and inter-scanline search,” IEEE Trans. PAMI 7 (2) (1985) 139–154.

[7] Yang R, Zhang Z, “Eye gaze correction with stereovision for video-teleconferencing,” Computer Vision—ECCV 2002. Springer Berlin Heidelberg, 2002: 479-494.

[8] Granai L, Hamouz M, Tena J R, et al, “Compression for 3D face recognition applications,” Advanced Video and Signal Based Surveillance, 2007. AVSS 2007. IEEE Conference on. IEEE, 2007: 33-38.

[9] Lien J M, Kurillo G, Bajcsy R, “Multi-camera tele-immersion system with real-time model driven data compression,” The Visual Computer, 2010, 26(1): 3-15.

[10] Yang Z, Nahrstedt K, Cui Y, et al, “Teeve: the next generation architecture for tele-immersive environments,” Multimedia, Seventh IEEE International Symposium on. IEEE, 2005: 8 pp.

[11] Choi J, Medioni G, Lin Y, et al, "3D face reconstruction using a single or multiple views," Pattern Recognition (ICPR), 2010 20th International Conference on. IEEE, 2010: 3959-3962.

[12] Levine, Martin D., and Yingfeng Yu, "State-of-the-art of 3D facial reconstruction methods for face recognition based on a single 2D training image per person," Pattern Recognition Letters 30.10 (2009): 908-913.

[13] Kuster, Claudia, et al, "Gaze correction for home video conferencing," ACM Transactions on Graphics (TOG) 31.6 (2012): 174.

[14] Shi, Qun, Shohei Nobuhara, and Takashi Matsuyama, "3D face reconstruction and gaze estimation from multi-view video using symmetry prior," Information and Media Technologies 7.4 (2012): 1544-1555.

[15] Culbertson, W. Bruce, and Thomas Malzbender, "Method and system for communicating gaze in an immersive virtual environment," U.S. Patent No. 7,532,230. 12 May 2009.

Figure 5. Final synthesized view in different angles