MirrorTrack – A Real-Time Multiple Camera Approach for...

1

Abstract—This paper presents a real-time multiple camera

approach for multi-touch interaction system that takes advantage of specular display surface (such as conventional LCD displays) and the mirror-effect in a low-azimuth camera angle to detect and track fingers their reflections simultaneously. Building on our prior work, 1. We use multi-resolution processing to greatly improve runtime performance of the system; 2. We employ different edge detection and pattern recognition algorithms for different processing resolution to help detect fingers more accurately and efficiently; 3. We track both the location of a fingertip and its pointing direction so it can be identified more effectively; 4. We use a full stereo algorithm to compute finger locations in the 3D space more accurately. Our system has many advantages. 1. It works with any glossy flat panel display; 2. It avoids clumsy set-up time of a top-down camera with the concomitant screen glare problems; 3. It supports both touch and hover operation; 4. It can work with large vertical display without the usual occlusion problems. We describe our approach and implementation in details.

I. INTRODUCTION Multi-touch related systems have attracted considerable

interest in the recent years [6]. There have been many well-known commercial products using the multi-touch technologies, such as Apple’s iPhone series and Macbook Air [1,2]. Successes of these products have fueled further interest in such technologies. Multi-touch interfaces are interfaces that enable users to use multiple finger touches and hand gestures to interact with digital objects and media directly on display surfaces. Advantages of multi-touch interfaces over other traditional input devices such as mouse and keyboard are that, they are more suited for use on table-top surfaces [8], and they are less restrictive and are able to provide more intuitive interactions.

Many multi-touch systems have been developed, using a

wide range of touch-sensing technologies [3], such as [4,7,10, 13]. The sensing mechanisms used mostly depend on the type of surfaces. Systems with rear-projected display such as [11] mount optical sensing devices behind the projected surface because this setup is more convenient, and able to track points of contact on the surfaces accurately. However, the

disadvantages are they it is difficult if not impossible to detect finger hovering, which offers many advantages and new level of interactions [9], and they do not work on conventional luminescent display surfaces such as LCD monitors, which happens to be the most dominant display technologies. Systems that use resistive surfaces are very common on devices such as touch-screen checkout machines. They are cheap to implement, but with a relatively low sensing clarity and also do not support hovering. Capacitive touch surfaces provide precise sensing of multiple contact and small distance hovering on the screen, but there are limitations. First, some capacitive systems require users to be connected to a receiver, such as a metal chair, in order to detect signals emitted from antenna under the display surface [8]. This restricts user movement. Second, it requires specialized equipments, making such systems very costly. Other systems use computer vision techniques to track finger locations using top-down cameras [12], the type of system also faces significant challenges; First, it is difficult to detect whether or not a finger is in contact with the screen from a top-down camera angle. Second, glossy display surfaces (e.g. LCD screens) could become a problem in common office type environment with overhead illuminations. Third, it is not impossible to use such a camera setup in vertical displays due to self-occlusion.

In our previous work, we have introduced MirrorTrack [5],

a multi-touch system using vision-based finger detection that takes advantages of glossy display surfaces. Figure 1. illustrates the design concept, by placing cameras at low-azimuth from the display surface, the glossy screen approaches a perfect mirror.. In this paper, we discuss our real-time processing enhancement to the system. As discussed in our previous work, our system offers many advantages: First, it is software driven and does not require specialized hardware. Second, cameras are side-mounted and easy to setup. Third, it is easy to distinguish between finger hovering and touching. The main contribution of our new approach is a real time, multi-camera approach for this system. In the following sections, we present the design and implementation of our system in details.

MirrorTrack – A Real-Time Multiple Camera Approach for Multi-touch Interactions on

Glossy Display Surfaces Pak-Kiu Chung, Bing Fang, Roger W. Ehrich, and Francis Quek

Center of Human Computer Interaction Virginia Polytechnic Institute and State University

Blacksburg, VA 24060, USA †[email protected]

2

II. SYSTEM DESCRIPTION

A. System Overview The physical setup of our system is illustrated in Figure 1.

As shown, our system uses 3 side-mounted Logitech Orbit AF webcams. Each pair of the 3 cameras is stereo-calibrated ahead of time.

Figure 1 The Low-azimuth camera configuration

Figure 2 MirrorTrack system overview.

As illustrated in Figure 2, our system can be partitioned into

two parts: 2D processing and 3D processing. In 2D Processing, the system takes image frames from video sources, which are stereo-calibrated cameras, computer vision techniques are then applied to extract possible fingertips locations from these image frame. In 3D Processing, 2D coordinates from stereo-calibrated cameras are then triangulated into 3D space, and correspondences are then

resolved. These 3D finger positions are then converted into coordinates of display surface. In the following subsections, we describe each system components in details.

B. 2D Processing In our original design in [5], we could not achieve a real-

time performance when processing multiple cameras on the same machine. In order to improve the performance, we employ a multi-resolution processing. The system first applies image processing techniques on low-resolution image frame to find interest areas, which are regions on the image where hands and fingers may be located. Then these areas are processed on a high resolution image frame to find more accurate locations of the fingertips. The advantage of this approach is two-fold. First, the number of pixels that need to be processed is only a small fraction of what would be needed if the complete image is processed in high resolution. Second, this top-down hierarchical approach, where we first locate the hands or fingers in low resolution image, then proceed to extract the position of the fingertips helps eliminates “false positive” cases.

Figure 3 Low resolution processing.

1) Low resolution Image processing: The resolution we are using for the low-resolution processing is a sixteenth the size of original image. In a low resolution image, we first employ image differencing technique similar like what we described in [5]. An image differencing algorithm computes the difference of pixel values of two image frames shown as Figure 3. Since our new system is targeting changing backgrounds, where a person could be sitting in frontof some of the cameras, this step it is no longer as significant as it used to be. Nonetheless, this step helps eliminate significant parts of the image that are not interest areas (possible regions of hands and fingers), and the algorithm is simple and contributes very little to the overall computational time.

In the resulting interest areas, parts of the image that we consider to be “foreground”, we perform a unidirectional Sobel edge detection technique similar to what we have described in [5]. However, since we are processing on low resolution image, the goal here is to find the hands and possibly the arms, instead of the exact locations of fingers and fingertips. The reason we are using a unidirectional edge detection strategy is because we observe that given the low camera configuration the hands tend to be more vertical than horizontal. By ignoring horizontal edges, we gain the added benefit of ignoring noisy edges that could confuse our system.

3

To represent the resulting edges from above steps, we employ a unidirectional region growing algorithm. This region-growing is different from what we used in our previous approach. In our old system, we employ a 4 ways region-growing approach which would represent an area as a single region as long as all pixels in the area are connected. Our new approach offers a clear advantage, for instance, if the end point of one edge touches the middle of another edge, these two edges would be recognized as one region. In our new approach, they would be recognized as separated regions. The resulting regions from above method largely represent edges that resemble straight lines. So the next step our approach is to represent these regions as straight lines, we use slope-interception form to record the lines. The issue with the edge detection approach we described in earlier subsection is that these edges are often broken up due to lighting conditions or poor image quality (due to low resolution), so we need to somehow connect the broken edges.

Figure 4 Approach to connect broken edges.

Figure 4 illustrates the approach we use to connect broken

edges. In order to consider R1 and R2 part of the same edge, the two lines must satisfy the following criteria: First, direction1 must be approximately the same with direction2. Second, distance1, which is calculated using point-line distance of R1 and R2 must be within a trained threshold, otherwise they can simply be considered parallel lines. Third, distance2, which is calculated by the distance of the end points of R1 and R2 must also be within a trained threshold, or they can be considered not along the same edge.

Once this step is done, we employ pattern recognition approaches on the resulting edges to eliminate noisy edges and extract the edges that could be part of hands, arms or fingers. We exploit the fact that any edge must have a “reflection” on the glossy surface to be considered “useful.” The reason is quite simple, for instance if we cannot detect the reflection of a finger, it must mean the finger is either too far away from the screen or outside the screen boundary, in such cases it can be ignored. These “reflections” can be detected quite easily, as they are straight lines with opposite angles but angles with similar absolute values. The closest end points of the two lines must have a horizontal distance that is within a trained threshold. The system uses above steps to find regions on the image where fingertips could potentially be located. The results would then be passed along to the next step – High

Resolution Processing. 2) High Resolution Image Processing: The system would

apply steps on each and every region identified to be areas where fingertips may be located in high resolution. In each of these regions, an edge detection technique using Canny edge detection algorithm with non-maximum suppression is applied. The results are edges of 1 pixel width. Similar as steps mentioned in Low Resolution Processing, these edges are represented as straight lines, horizontal lines are ignored, while vertical and diagonal lines are recorded. Pattern recognition techniques are applied to eliminate regions with edge lines that are unlikely to be fingers. Shown as Figure 5, the system then match the regions with potential fingertips with other regions to find finger-reflection pairs, only these pairs are kept, the others are ignored. In each of these pairs, we identify the 2D coordinates of the tip of the fingers and the angle/direction that these fingers point to.

Figure 5 High resolution processing.

C. 3D Processing The results we obtain from above steps are based on 2D

coordinates of each of the camera view. In order to turn those results into some more meaningful forms, we must convert them into the coordinate system of the display surface. To do so, we employ full stereo to obtain position information in the 3D space. In doing so, we also face the challenge of correspondence, that is how to match one finger from a camera view to the finger from another. In the following subsections, we describe our approach in performing stereo calibration, triangulation and solving correspondence problems.

1)Stereo Calibration and Triangulation: As mentioned above, since video sequences from cameras only provide 2D information, in order to triangulate 3D positions base on 2D positions from multiple cameras, these cameras first need to be stereo-calibrated. To do so, we first computer intrinsic and extrinsic parameters of the cameras using Tsai’s calibration model [14]. Tsai’s algorithm require a minimum of 11 sets of corresponding control points to perform this calculation, but 20-60 sets are usually used in the calibration process with our previous experience. In our approach, we use a calibration box with 48 control points on it. A pair of 2D coordinates (one of each camera) for each of these 48 control points are then recorded and Tsai’s calibration algorithm is then applied to estimate the camera parameters. We can then triangulate the 3D world coordinates from a pair of corresponding 2D coordinates on camera frames. Since we are using 3 cameras

4

in our system, each pair of them would be calibrated separately.

2) Solving Correspondence: The main challenge we are

facing when performing stereo operations is to find correspondence. It is particularly so in our system, as cameras are facing drastically different directions, and the same object (finger or hand) can look quite different from different camera view. How to match one finger from a camera view to a finger on another is a problem we must solve. In our approach, we perform four methods to solve this problem, because any of them is not guaranteed solve it individually. First, we divide the display surface into grids, as shown in Figure 6, this help approximate the position of a fingertip on the display surface, a finger in camera view 1 can only match a finger in camera view 2 if they appear on the same grid in both views. Second, stereo-triangulation provides us with depth hypothesis for each pair of 2D coordinates. As show in Figure 7, there are two objects (O1 and O2) projected into two cameras (cam1 and cam2). We can easily figure out O3 is not a corresponding position, because it is out of the bound of the display area. Third, the corresponding positions are consistent for all camera pairs. As we mentioned in the previous subsection, each pair of the 3 cameras are calibrated separately, if we have obtained a correct correspondence, then the triangulation results from all pairs should be approximately the same. Fourth, we record corresponding positions by their historical information. By employing trajectory tracking, which is discussed next, we can exclude unlikely correspondence by analyzing the fingers motion trajectories.

Figure 6 Solving correspondence by grids.

D. Trajectory Tracking When working with cameras, it is not always possible to

locate the visual features needed in every single frame due to occlusion and lighting/camera noises. In our previous work, we employed trajectory tracking in each camera view to track motion trajectories of fingers in 2D space, using a least-square method to track each fingertip’s movement in x and y axis separately. In our new approach, instead of only tracking finger movements in 2D camera views, we also track their 3D motion trajectories. In addition, we also keep track of the angles of fingers, which makes it significant easier to identify multiple fingertips in very close proximity as long as they are

pointing in different directions.

III. DISCUSSION In this section, we discuss th e overall benefit of our new approach and improvements over our previous work. First, in our new system, we employ full stereo multi-camera approach. This approach helps solve some of the occlusion. Since we have 3 camera, and each pair of them are calibrated separately, we have a total of 3 stereo-calibrated camera pairs. So essentially, as long as there are two cameras be able to have a clear view of a fingertips we are then able to triangulate a 3D position for that finger. The accuracy also improves if all 3 cameras are able to have clear view of the fingers, as triangulation results can be combined. It is also easy to extend our setup to use more than 3 cameras. Essentially, the more cameras we employ, the more accurate results we can achieve. A major improvement we have in our new approach is the runtime performance. By employing multi-resolution approach we are able to cut the processing time significantly and obtain real-time performance using 3 cameras. By resolving the performance issue, the next step of our research is to implement a set gesture interactions with our system.

Figure 7 Solve correspondence by depth hypothesis.

IV. CONCLUSION We have made significant improvement in runtime and

accuracy to our system with the new approaches. We performed a preliminary test with our system, and were able to obtain a real-time performance (average 25 FPS). We have also been able to resolve the correspondence problem and applied a full stereo multi-camera solution. Furthermore, by tracking finger motion and finger pointing directions in 3D space, we are able to identify fingers more accurately and minimize the issues caused by fingers moving too close to each other.

The next step of our research is to implement a full set of gesture interactions and perform formal user study to determine the robustness of our system in actual use.

5

ACKNOWLEDGMENT This research has been partially supported by NSF grants “Embodied Communication: Vivid Interaction with History and Literature,” IIS-0624701, “Interacting with the Embodied Mind,” CRI-0551610, and “Embodiment Awareness, Mathematics Discourse and the Blind,” NSF-IIS- 0451843.

REFERENCES [1] Apple iPhone. [cited; Mar. 16, 2008];

http://www.apple.com/iphone. [2] Apple MacBook Air. [cited Mar. 16, 2008];

http://www.apple.com/macbookair. [3] A. Agarwal, et al., High Precision Multi-touch Sensing on Surfaces

using Overhead Cameras, in tabletop. 2007. [4] J. Ahlberg, Real-time Facial Feature Tracking Using an Active

Model with Fast Image Warping. in International Workshop on Very Low Bit-rate Video Coding. 2001. Athens, Greece.

[5] P. Chung, B. Fang, and F. Quek, MirrorTrack - A Vision Based Multi-Touch System for Glossy Display Glossy. VIE 2008

[6] W. Buxton, Multi-Touch Systems I Have Known & Loved. 2007 [cited Mar. 16 /2008]; http://www.billbuxton.com/multitouchOverview.html.

[7] P. Dietz, and D. Lehigh, DiamondTouch: a Multi-User Touch Technology. UIST, 2001.

[8] P. Dietz, and D. Leigh. DiamondTouch: a multi-user touch technology. in UIST '01: Proceedings of the 14th annual ACM symposium on User interface software and technology. 2001. New York, NY, USA: ACM Press.

[9] T. Grossman, et al., Hover widgets: using the tracking state to extend the capabilities of pen-operated devices, in Proceedings of the SIGCHI conference on Human Factors in computing systems. 2006, ACM: Montreal, Quebec, Canada.

[10] J.Y., Han, Low-Cost Multi-Touch Sensing through Frustrated Total Internal Reflection. UIST, 2005.

[11] J.Y., Han , Low-cost multi-touch sensing through frustrated total internal reflection, in Proceedings of the 18th annual ACM symposium on User interface software and technology. 2005, ACM: Seattle, WA, USA.

[12] J. Letessier, and B. Franois, Visual tracking of bare fingers for interactive surfaces, in Proceedings of the 17th annual ACM symposium on User interface software and technology. 2004, ACM: Santa Fe, NM, USA.

[13] A. Wilson, TouchLight: An Imaging Touch Screen and Display for Gesture-Based Interaction. ICMI, 2004.

[14] R.Y. Tsai, "A versatile camera calibration technique for high accuracy 3d machine vision metrology using off-the-shelf TV cameras and lenses", IEEE Journal of Robotics and Automation, vol. RA-3, pp. 323–344, 1987.

MirrorTrack – A Real-Time Multiple Camera Approach for...

Documents

Transcript of MirrorTrack – A Real-Time Multiple Camera Approach for...