A multi-projector CAVE system with commodity hardware ......A multi-projector CAVE system with...

A multi-projector CAVE system with commodity hardware and

gesture-based interaction

Carlos Andujar, Pere Brunet, Alvar Vinacua, Miguel A. Vico and Jesus Dıaz Garcıa

MOVING research group. Jordi Girona 1-3

08034 Barcelona, Spain

Abstract

Spatially-immersive systems such as CAVEs provide users with surrounding worlds by projecting 3D modelson multiple screens around the viewer. Compared to alternative immersive systems such as HMDs, CAVEsystems are a powerful tool for collaborative inspection of virtual environments due to better use of peripheralvision, less sensitivity to tracking errors, and higher communication possibilities among users. Unfortunately,traditional CAVE setups require sophisticated equipment including stereo-ready projectors and tracking sys-tems with high acquisition and maintenance costs. In this paper we present the design and construction of apassive-stereo, four-wall CAVE system based on commodity hardware. Our system works with any mix of awide range of projector models that can be replaced independently at any time, and achieves high resolutionand brightness at a minimum cost. The key ingredients of our CAVE are a self-calibration approach thatguarantees continuity across the screen, as well as a gesture-based interaction approach based on a clevercombination of skeletal data from multiple Kinect sensors.Keywords:Virtual Reality Systems, Immersive Systems, Projector Calibration, Gesture-based interaction

1 Introduction

CAVE systems are known to be highly immersiveback-projected VR rooms. Classical CAVE systemsfrom the 90’s and beginning of the past decade wereusing expensive CRT projectors and magnetic headand hand tracking. Unfortunately, most projectionsystems of the existing CAVE’s are becoming obsoleteafter several years, and updating them can be ratherexpensive. In the meantime, however, multi-projectorsystems and cheap motion capture systems have ap-peared. These recent technologies can be efficientlyused for CAVE updates, resulting in higher image res-olution, better brightness and lower costs.

In this paper, we present a novel four wall multi-projector CAVE architecture which is powered by 40off-the-shelf projectors controlled by 12 PCs. It op-erates in passive stereo, providing high brightness at2000 × 2000 pixel resolution on each of the 4 walls.We have achieved high resolution while significantlyreducing the cost and increasing versatility: the sys-tem works with any mix of a wide range of projector

models that can be individually substituted at anytime for more modern or cheaper ones. The unifor-mity of the final image is achieved using a speciallydesigned self-calibration software which adapts eachof the 40 projectors and guarantees concordance andcontinuity. The main contributions of our approachare:

• The design and construction of a passive stereo,four-wall CAVE system with commodity hard-ware is presented. It is based on 40 off-the-shelfDLP projectors and 12 PCs. The CAVE designachieves higher resolution and brightness than al-ternative installations, at significantly lower totalcost.

• The system is versatile: it works with any mix ofa wide range of projector models that can be indi-vidually substituted at any time for more modernor cheaper ones. The software and system archi-tecture can be easily adapted to different screenconfigurations.

• Uniformity of the final image is guaranteed by a

specially designed self-calibration software whichadapts each of the 40 projectors and guaranteesconcordance and continuity. Independent self-calibration of the different CAVE walls is suffi-cient.

• A gesture-based, ergonomic interactionparadigm, based on dynamically mergingthe information from two nearly-orthogonalKinect sensors. Interaction is intuitive andcable-less.

The paper is organized as follows. After reviewingthe previous work in next Section, Section 3 presentsthe architecture of the CAVE system. Sections 4 and 5are devoted to presentation of the self-calibration algo-rithms and the software infrastructure for applicationdevelopment. Section 6 presents the Kinect-based ges-tural interaction scheme, while Section 7 discusses thesystem performance through a number of examples.Conclusions and future work are listed in Section 8.

2 Previous work

Multi-projector systems and calibration

The work on multi-projector displays using camera-based registration started at the end of the 90’s, see forinstance [21]. The design of projection-based tiled dis-play systems was considered shortly afterwards [12],and the use of homographies and hierarchical ho-mography settings for geometric automatic calibrationstarted in 2002 with the work of Wallace et al. [7].Since then, a number of papers have addressed theproblem of designing tiled displays for planar PowerWalls, but, with few exceptions, like [23], almost nowork has been done for CAVE systems.

The color variation in multi-projector displays isdue to the spatial variation in the color gamut acrossthe display [17, 18]. This variation can be classifiedinto three different categories: intra-projector varia-tion (within a single projector), inter-projector vari-ation (across different projectors), and overlap varia-tion. For an analysis of perceptual aspects, see for in-stance [19]. In [22] a new method was presented whichaddressed spatial variation in both luminance andchrominance in tiled projection-based displays. Theirapproach morphs the spatially-varying color gamut ofthe display in a smoothly constrained manner whileretaining the white point.

The system which is probably closest to our pro-posal is the Star CAVE project [8]. It uses passivestereo and multi-projection on a five-sided small roomwith three horizontal display layers with a total of

15 projection surfaces and two projectors per sur-face. However, the Star CAVE is not using commod-ity hardware and it is not able to benefit from existingCAVE settings. Our proposal, instead, works with anymix of commodity projector models and results on ahigher resolution and brightness at a significantly re-duced total cost.

Kinect-based interaction

The Microsoft Kinect provides a convenient and in-expensive depth sensor and skeleton tracking systemfor VR applications [29, 26]. Different authors havemeasured and reported the performance of the Kinecttracking software. Most of these studies refer to thefirst version of the sensor [15, 16, 6, 9] whereas only afew [11, 1] report on the Kinect V2 sensor. Livingstonet al. [15] have conducted multiple tests to evaluate thenoise, accuracy, resolution, and latency of the KinectV1 skeleton tracking software. At 1.2 m the positionnoise was found to be 1.3 mm (sd=0.75 mm), whereasat 3.5 m, the noise increased to 6.9 mm (sd=5.6 mm).Noise was found to differ by dimension (z values werenoisier than x and y values) and joint (wrist and handexhibited more noise than other joints). Average end-to-end latency on a machine equipped with two IntelCore2 6600 2.4 GHz processors was found to be ap-proximately 125 ms [15]. Minimum latency has beenreported to be 102 ms for V1 and 20-60 ms for V2 [1].In all these tests users were facing a single Kinect sen-sor during the data acquisition.

Several authors have explored the use of multipleKinect cameras for a number of applications rangingfrom real-time 3D capture of scenes [16] to gesture-based interaction [6]. The infrared dot patterns pro-jected by multiple Kinects [10] are known to inter-fere with each other [9]. Several works have addressedthe multi-Kinect interference problem, including hard-ware solutions for time-multiplexing [24, 3, 9] and soft-ware solutions [16, 6]. Maimone and Fuchs [16] haveobserved that the difference between the depth imageswith and without interferences corresponds mostly tothe missing data rather than differences in depth val-ues. They propose a modified median filter to fill miss-ing points (using data from other units), allowing the3D reconstruction of dynamic scenes for telepresencesystems. Maimone and Fuchs also show how to com-bine 2D eye recognition with depth data to providehead-tracked stereo.

Caon et al. [6] use two Kinect sensors to get skeletaldata from multiple users. Their skeleton fusion algo-rithm calculates the coordinates of every joint of thefinal skeleton as a weighted average of the joints atthe individual skeletons, where weights are assignedaccording to the number of joints detected by each

2

Figure 1: Architecture of the system. It consists of twoKinects controlled by a dedicated PC, a master PCwhich handles interaction and orchestrates rendering,10 rendering PCs and 40 DLP projectors. The systemalso includes four digital cameras for calibration, notshown in the image.

sensor. We also use two Kinects but our fusion algo-rithm uses more elaborated, per-joint data to estimateposition reliability.

3 System Architecture

The architecture of the system is shown in Figure 1.A first PC receives events from two tracking Kinectsand sends them to the Master PC via VRPN. TheMaster PC sends camera events to a total of 10 ren-dering PCs. Rendering PCs have 12–16 GB of RAM,2.66 GHz and two Nvidia GTX 580 boards, driving 4projectors. Rendering PCs are network synchronized,each drawing the same scene with different camerasand viewports. Display on each vertical CAVE wall isperformed by 3 PCs which manage 12 projectors. Dis-play on the CAVE floor is performed by a single PC,the output being directed to four HD projectors. Weused Stewart FilmScreen 100 screens for the CAVEwalls.

The projection setup includes two towers per verti-cal CAVE wall, supporting a total of 12 DLP, 1024×768 projectors per wall, Figure 2. Direct (withoutmirrors) rear projection is used. Four digital cameras(Canon EOS 1100-D) are used for the calibration step.Cameras capturing vertical walls have been fixed tothe room walls to minimize the need for repeating thefirst phase of the calibration step (Section 4). We usefront-projection on the floor with four HD DLP pro-jectors at the top of the front wall and a mirror. Thecamera that captures the floor surface is fixed at thetop of the wooden structure at the CAVE entrance. A

Figure 2: The supporting towers, holding 12 projec-tors per vertical CAVE wall.

mix of 26 Mitsubishi XD550U (1024 × 768, 3000 lm),12 Casio XJ-S32(1024×768, 2300 lm) and 2 MitsubishiFD630U (1920 × 1080, 4000 lm) projectors is used inthe present setup, see Figure 3. New DLP projec-tors like the BenQ MX768, the ViewSonic PJD7333,or similar models could also be used. Led-laser DLPtechnology, however, is not recommended, because ofthe interaction between polarization and color.

The overall system achieves higher resolution andbrightness while significantly reducing the total con-struction and maintenance cost, as discussed in Sec-tion 7.4

Two Kinect devices are located at the top of the twoedges between the side walls and the front wall. Ourcurrent setup uses Kinect V1 sensors (with a 57.5 hor-izontal fov) although we plan to test also V2 sensors(with a 70 horizontal fov). The sensors are positionedin a way that users at the center of the CAVE are inthe area of maximum tracking resolution for both ofthem. The location of the Kinects at a height of threemeters is not optimal, but it is the best option to min-imize optical disturbances. Lower limbs are hardlydetected, but a good tracking of the head and upperlimbs is achieved. A MultiKinect software module re-ceives the 3D coordinates of the user skeleton jointsas detected by both Kinects, and merges them accord-ing to the user position and orientation, as detailed inSection 6. The VRPN client in the Master PC re-ceives the merged skeleton, estimates the user gesturetype and computes the new rendering settings for therendering PCs.

The next three sections describe the self-calibrationsoftware, our software infrastructure for applicationdevelopment and the gesture interaction module.

3

Figure 3: The auto-calibration algorithm supportsmixed projectors from different suppliers, as shownin these towers holding the projectors of one wall.

4 Adaptive auto-calibrationsystem

Manual calibration in a projection system with 40 pro-jectors is unaffordable. Solving the calibration prob-lem requires finding a set of suitable transformationsto be used in each rendering PC, the final goal be-ing that users should perceive a unique, smooth andsurrounding image along the CAVE walls. We nowdescribe an algorithm for the automatic calibration ofthe whole set of projectors which succeeds in obtaininga unique image in the CAVE without user interven-tion. The algorithm is able to calibrate any mix ofdifferent projectors even from different suppliers, asshown in Figure 3.

The auto-calibration is an off-line process which canbe run at any time (although, in our experience, oncea week is sufficient). It is performed in three steps, asdetailed in the next paragraphs. During calibration,projectors are made to project specific patterns on theCAVE walls which are captured by four computer-controlled cameras (one per CAVE wall). Calibrationinvolves three types of coordinate systems, as shownin Figure 4: the projector coordinate systems (40 intotal), the camera coordinate systems (four camerasin total) and the CAVE coordinate system. Figure 2shows the setup for one of the vertical CAVE walls,involving two towers with 12 projectors and one digitalcamera to capture the projected patterns.

Auto-calibration output consists of a deformation

Figure 4: Geometric calibration involves three types ofcoordinates: projector coordinates, image coordinatesand CAVE coordinates.

matrix and a texture mask per projector. Deforma-tion matrices will be inserted in the rendering pipelinejust after the projection matrix, in order to produce aprojection adapted to the targeted region in the CAVEwall. On the other hand, texture masks will be usedby fragment shaders as detailed below to blend theprojections of neighboring projectors. Calibration in-volves two basic steps: geometric calibration [7] andchromatic calibration. Deformation matrices are pro-duced by the geometric calibration whereas the chro-matic step creates the texture masks. Our method,well adapted to planar screen configurations and con-sumer hardware, is simpler than that proposed in [23].

Before calibration, projectors are uncoordinatedand neighbor images do not match, Figure 5(a). Ge-ometric calibration starts by detecting the eight ver-tices of the CAVE cube. A lamp is located inside (Fig-ure 5(b) and images of the CAVE walls are capturedby the four cameras, Figure 5(c). Camera distortionsare then removed, and we use implicit information tolink vertices that appear in two or in three images:the cube has 4 vertices which belong to a single wall,2 vertices which belong to two neighbor walls and 2vertices belonging to three walls (two vertical wallsand the floor). We then estimate the positions of therelevant wall vertices in each of the undistorted cameraimages by computing them as the intersections of thewall edge lines in camera coordinates. These refinedpositions are used to compute the Homographies [7]H(C,W ) for each CAVE wall W . The Homographies

4

H(C,W ) convert from camera image coordinates of Wto normalized CAVE coordinates (Figure 4), allowingus to use the cameras as an optical measuring systemfor the physical CAVE walls. We take special carein minimizing errors when computing the vertex posi-tions in the captured wall images and the four camera-wall Homographies H(C,W ). Captured images of thefront wall and floor have four relevant vertices each,whereas the two side walls have three of them each,as they have one unshared vertex. Proper matchingamong CAVE walls in the final result depends on the2D image coordinates of these 14 relevant vertices. Wetherefore iterate the geometric calibration algorithmto optimize the 2D coordinates of the 14 relevant ver-tices: we estimate them, we complete the second ge-ometric calibration phase, we detect mismatches be-tween neighbor walls, and we use them to improvethe estimates of the 2D image coordinates of the 14vertices and to run again the second geometric calibra-tion phase. We rely on the fact that 2D coordinatesof the relevant vertices and Homographies H(C,W )are quite stable, as they only depend on the cameraand CAVE wall locations and cameras are usually wellfixed. Recomputation of camera-wall HomographiesH(C,W ) is therefore unnecessary in most cases andshould only be performed from time to time. It shouldbe observed that, once proper Homographies H(C,W )have been computed, the geometric calibration prob-lem has been decoupled and simplified: the secondgeometric calibration phase consists on four indepen-dent CAVE wall calibrations.

In the second phase of the geometric calibrationalgorithm we sequentially visit the individual CAVEwalls. The algorithm for each wall runs in two iter-ations, the first one involving all projectors for theleft eye and the second one including all projectorswhich generate images for the right eye. The processinvolves six projectors for the vertical walls and twoprojectors in the case of the floor. Each projector Pk

displays a pattern grid with 30 vertices, as shown inFigure 5(d), and this projection is captured by thecorresponding camera. After removing image distor-tions, the coordinates of the 30 grid vertices in thecaptured image are computed at subpixel resolution.We use standard image processing algorithms for linedetection followed by the analytical computation ofline intersections in image space. By solving a leastsquares problem, we compute the Homography matrixHPk,C that converts from projector Pk coordinates tocamera C coordinates (we have 60 equations for the8 unknown components of HPk,C , as any grid vertexgenerates two independent equations). At this point,the Homography H(W,Pk) = (HPk,C ∗ HC,W )−1 re-lates points in the CAVE wall coordinates to their im-

ages in projector coordinates. Now, by just imposingsuitable target positions in CAVE coordinates of the4 corner vertices of the pattern grid, we obtain the de-formation matrix for Pk which precisely ensures a pro-jection on the desired rectangular region of the CAVEwall. Figure 7 shows the result after running the cal-ibration algorithm on the four CAVE walls. Observethat matrices HPk,C are used to calibrate the projec-tions within each of the CAVE walls, whereas the fourmatrices H(C,W ) are responsible for seamless imagecontinuity between neighbour walls.

For the chromatic calibration we have implementedthe algorithms from Sajadi et al. [22]. Individual colorgamuts are measured for each projector and thesegamuts are morphed along horizontal and verticaloverlap bands. The result is a texture mask for eachone of the individual projectors. During the interac-tive visualization, fragment colors will be multipliedby the value of the corresponding texel in the mask be-fore being written to the frame buffer, to compensategamut differences and to smoothly blend overlappingareas between projectors. Figure 6 shows the CAVEwith two projected models after this final calibrationstep. The complete calibration procedure —geometricand chromatic— takes less than half an hour; most ofthat time is consumed by the chromatic calibration,which seldom needs to be repeated.

Our auto-calibration software is flexible, and it isstraightforward to adapt it to the calibration of VRsystems with different screen configurations in termsof size, relative positions and number of projectionwalls. It is also possible to use other projector com-binations. For example, each wall may be served byonly four FullHD projectors, splitting each wall in twohalves, simplifying the architecture at the expense ofbrightness.

5 Software infrastructure forapplication development

To maximize the ease with which new developerscan write software for this CAVE —or adapt oldapplications— we developed a generic rendering en-gine that abstracts the multi-projector architectureand the calibration process, providing a relatively leanAPI.

In fact, this API sits between the user applicationand whichever other software is being used for con-trolling the projectors, so an installation may chooseto use VRJuggler [4], for instance, and another mayuse Unity3D [14], or yet another Qt framework (http://www.qt.io). (for a recent review of frameworks forVR applications see [5]). A developer may be inter-

5

(a) (b)

(c) (d)

Figure 5: Calibration process: (a) uncalibrated projections, (b) interior lighting for wall image capturing, (c)capture of one CAVE wall by the corresponding camera, (d) pattern as displayed by one of the projectors.

Figure 6: Two medical models, displayed after the calibration process. The camera does not have a filter, sothe picture shows the left-eye and right-eye images overlapped.

6

Figure 7: Patterns in Figure 5(a), redisplayed aftergeometric calibration.

ested in developing using one of these platforms, per-haps on a desktop, and then deploy it in the CAVEusing a different one. Because of this, the abstrac-tions the API offer have been conceived to make thisunimportant and transparent.

To create a new application, the user creates an ap-plication object of class ’App’ and assigns a scene anda set of events to which the application shall respond.The scene class contains implementations of all of itsmethods (most of them empty) so the user needs onlyto refine the ones relevant to his application. The twomain things one must redefine are the initialization (anopportunity to initialize any resources the applicationwill need later) and the render method (which mustonly take care of sending the adequate information tothe pipeline; the library will otherwise concern itselfwith the low level details of integrating this streamwith whichever rendering platform is being used).

Events are handled through a callback system thatallows the user to subscribe a function to a certainevent (all input and all relevant user actions triggerevents that can be handled in this way. The Appclass provides a method associateEvent() to declarethis connection. Furthermore, there is a plugin mech-anism to reuse navigation or interaction modes. Theuser may dynamically load or replace these plugins toalter the behavior of the application without exiting,and may easily develop his own plugins to test newinteraction modes.

Our auto-calibration process puts minimum require-ments in terms of shader programming. For exam-ple, our basic shader for Unity has 7 lines of codefor the vertex program (this includes the applicationof the deformation matrix) and 9 lines of code forthe fragment program (the blending mask is encodedas a texture). Closed-source applications relying onOpenGL can also be supported via function call in-

(a) (b)

Figure 8: Interaction space with one Kinect (a) andtwo Kinects (b).

terception [30] to activate the appropriate shaders, al-though we have not experimented with this option yet.

6 Kinect-based Gestural Inter-action

In this section we discuss how we support head-tracking and gesture-based interaction in our CAVEby means of Kinect cameras. Although the Kinectdoes not reach the position accuracy of traditionaltracking systems, it provides full-body motion captureat a much lower cost.

The Kinect V1 sensor has a field of view of 57.5degrees (horizontal) and 43.5 degrees (vertical), with aphysical depth range of 0.8 m to 4 m [20]. Depth valuesof objects beyond the 4 m limit can also be retrievedbut at the expense of tracking noise.

6.1 Analysis of kinect placement

Figure 8(a) shows a single-Kinect setup along withthe resulting interaction space. It turns out that thissetup has two important limitations. On the one hand,a large portion of the CAVE is outside the field of viewof the sensor, resulting in a rather limited interactionspace. Since the head position provided by the Kinectis used for head-tracking, this fact limits the freedomof movement of the tracked user. On the other hand,skeletal tracking is stable when the user faces the sen-sor, but gets noisier and unreliable as the user turnsleft or right, or gets partially occluded by other users.Since CAVE users often face the side screens, a singleKinect provides unreliable tracking data even with asingle user. In practice, existing CAVEs using a sin-gle Kinect force the user to face the front screen, thusseverely reducing the advantages of head-tracking.

The above limitations can be addressed using addi-tional depth cameras with overlapping fields of view.However, the infrared dot patterns projected by mul-tiple Kinects are known to interfere each other. Ex-perimentally we have found that using two Kinects to

7

(a)

(b)

Figure 9: Depth images and reconstructed skeleton ofa static user from two slightly rotated Kinects. Theimages in (a) were taken sequentially, with one Kinectbeing covered while the other captured the subject,whereas those in (b) were taken simultaneously andthus suffer from mutual interferences.

illuminate the target area does increase noise in thedepth images but only at a moderate level (having al-most no effect on skeletal tracking accuracy) whereasusing three or more sensors results in more severe arti-facts. This is also in agreement with previous studieson Multi-Kinect interference issues [9, 6]. Figure 9illustrates the effect of such mutual interference onthe depth images. Despite the depth images in Fig-ure 9(b) are clearly noisier, the skeletal tracking of theKinect [25] seems to be quite resilient to noise, as evi-denced by the nearly identical skeletons reconstructedwith and without interference noise.

Since we aim at improving both the physical areacovered by the Kinects and the stability of gesturerecognition, a reasonable option is to place the Kinectson the edges of the front screen (Figure 10). We testedseveral combinations of height and tilt angles: (a)at 3 m from the floor with a tilt angle of -21.75 de-grees (half the vertical field of view, Fig. 10a), (b)at 1.8 m with no tilt rotation (Fig. 10b), and (c) at0.04 m from the floor with a tilt angle of 21.75 degrees(Fig. 10c). In all three cases the rotation around thevertical axis was approximately ±28.75 degrees (halfthe horizontal field of view of the Kinect), as shown inFigure 8b. Since we aim to get skeletal data throughthe Kinect for Windows SDK, which predicts 3D po-sitions of body joints through a classifier optimizedfor people facing the sensor [25], we did not consideralternative configurations (such as placing one Kinecton top of the CAVE looking downwards).

An informal user study with a variety of selection,manipulation and navigation gestures revealed that

(a) (b) (c)

Figure 10: Different setups with two Kinects. BothKinects are placed on the top (a), middle (b) or bot-tom (c) of the vertical edges of the front screen. Onlyone frustum is shown for clarity; the other frustum isplaced symmetrically on the opposite vertical edge ofthe front screen.

placing the Kinects at floor level (configuration c)is the worst option, because the head is often oc-cluded by the arms, both during interaction gesturesand inadvertent gestures. The other two options pro-vided more robust skeletal tracking. In our experi-ments we adopted configuration (b) because the phys-ical space with full-body tracking better matches theCAVE space, and because it better mimics the front-facing scenario the Kinect software body-part classifierhas been trained for [25].

6.2 Gesture interaction with twokinects

The relative pose between the two Kinects can bedetermined in a number of ways. We adopted thejoint depth and color calibration procedure describedin [13] which, as most camera calibration procedures,is based on capturing a checkerboard pattern at differ-ent poses. We first register both Kinects with respectto a high-resolution camera (placed at the center offront screen), and then we compose the resulting rigidtransformations to get the relative pose between theKinects.

Since each Kinect provides its own skeleton recon-struction, the two skeletons need to be combined intoa single one before joint data can be delivered to theVR application. Let PL

i (resp. PRi ) be the (x, y, z) po-

sition of the i-th joint of the skeleton as reported bythe Kinect placed on the left (resp. right) edge. Fromnow on we will use P j

i with j ∈ {L,R} to refer tothese joint positions. We use a simple skeleton combi-nation algorithm based on assigning a confidence valueto each joint. This confidence value is the product ofthe confidence values described below.

We define the detection confidence of P ji as

cfD(P ji ) = 2−n(P j

i ) (1)

where n(P ji ) is the number of frames the i-th joint

has not been detected by the j-th Kinect (this value

8

Figure 11: Depth image from a Kinect sensor. Theright hand appears unoccluded, and thus most depthvalues in the vicinity (yellow circle) approximatelymatch the depth of this joint as reported by the Kinectsoftware. Conversely, the left hand is completely oc-cluded; if the Kinect software still estimates its po-sition behind the user, its visibility confidence valuewould be close to zero.

is reset when the joint is detected again). The jointdetection flag is provided by the Kinect software. No-tice that Equation 1 halves the confidence of a jointeach time it is not detected.

We define the within-frustum confidence of P ji as

cfF (P ji ) = 1−max(0, (|NDCx(P j

i )| − 0.8)/1.1, (2)

where NDCx(P ji ) returns the x coordinate of point

P ji in the normalized device space of the j-th Kinect.

The bounds 0.8 and 1.1 have been found empirically.Equation 2 assigns a confidence value of 1 to jointswith NDC |x| in [0, 0.8] (thus reasonably apart fromthe frustum boundary) and linearly decreases the con-fidence to 0 in the interval [0.8, 1.1]. We use a 1.1value (which indicates that the joint is actually out-side the frustum) because when some body parts leavethe frustum, the Kinect software often still continuesto predict the clipped parts, but with much less accu-racy.

Let Aji be the screen projection of a small sphere

centered at P ji with respect to the j-th Kinect (we

used a sphere with a radius of 4 cm in our tests). Thevisibility confidence cfV (P j

i ) is defined as the ratio

of pixels (labeled as part of the user) in Aji whose

depth value (from the j-th Kinect’s depth image) isless than the depth value corresponding to P j

i . Inthe comparison we use a 12 cm offset to account forthe fact that the Kinect returns joints at some learntdistance from the body surface, rather than on thesurface [25]. This confidence value attempts to iden-tify poorly-recognized joints due to self-occlusion (seeFigure 11).

We also compute an alignment confidence value thatmeasures roughly to which extent the upper torso of

Figure 12: Pose reconstruction examples. The skele-ton shown on the left (resp. right) corresponds to theKinect located on the left (resp. right) edge. Themiddle skeleton is the combined skeleton. Joints inred correspond to non-detected nodes (shown at thelast known position). In (a), the left Kinect is unableto detect the right arm due to occlusion, but the com-bined skeleton uses joint data from the other Kinect.A symmetric situation is shown in (b), this time af-fecting also the left ankle. In both cases the combinedskeleton provides an accurate estimation of the pose.

the j-th skeleton is facing the sensor. We computecfA(P j

i ) as 1 − |~sj · ~vj | where ~sj is the vector joiningthe left and right shoulder joints (according to the j-thKinect), and ~vj is the view vector of the j-th Kinect.

Finally, the confidence value of a joint cf(P ji ) is

the product of the four confidence values definedabove. The combined position of the i-th joint Pi

is defined as the weighted average (cf(PLi )/W )PL

i +cf(PR

i )/W )PRi , where W = cf(PL

i ) + cf(PRi ).

Source code and binaries of our software for com-bining two or more Kinect skeletons is publicly avail-able [28]. For maximum portability, the skeletal track-ing is implemented as a VRPN [27] server, allowing anarbitrary number of clients (running on Linux in ourCAVE) to connect to the VRPN server (currently run-ning on Windows 7 and Kinect for Windows 1.8) toreceive joint data. Figure 12 shows different poseestimations from two Kinects configured as in Fig-ure 10(b).

7 Results and Discussion

7.1 Calibration

Geometric calibration results are shown in Figure 7and in the accompanying video. The feedback fromusers about the calibration is discussed in section 7.3.

Regarding chromatic calibration, however, there isstill room for improvement, as discussed in Section 8.The main problem is related to the differences betweenwall and floor projections, as we use four HD DLP pro-jectors for front-projection on the floor and 12 stan-

9

Figure 13: Path followed by the head of a moving user.The middle skeleton is the combined skeleton, whichis resilient to the severe head misplacements when theuser is partially outside the Kinect’s frustum.

dard 1024 x 768 DLP projectors in the back of eachvertical wall. The result is that the floor is darker thanthe walls as seen in Figure 6. We have decided to ac-cept some degree of brightness discontinuity betweenwalls and floor in order not to darken excessively thewall projectors, with the goal of keeping the maxi-mum CAVE brightness. Tuning this brightness dis-continuity can be a good avenue for future research.However, the main fact is that the brightness discon-tinuity does not affect the immersive experience. Inour user tests, brightness and chromatic calibrationdiscontinuities were perceived by less than 2% of theusers.

7.2 Interaction

We now discuss some performance measures of theKinect-based tracking system in our CAVE. For theexperiments we used two Kinects V1 placed on thevertical edges of the front screen, as in Figure 10(b).

The physical space with head tracking (that willbe referred to as target space) covers the most rele-vant portion of the CAVE. For an average adult userstanding up, head tracking is available in 78% of theCAVE floor (about 7 m2). This contrasts with the 53%(4.7 m2) available in a single-Kinect configuration.

Figure 13 shows the path followed by the head of amoving user. The relative pose of the Kinects guar-antees robust head tracking also near the side walls ofthe CAVE (see the accompanying video).

We did not measure head-tracking accuracy since aminor shift in the head location does not cause im-portant artifacts in the rendered images. Instead, wecompared the noise of the head position coordinates ofa still user adopting several poses. We measured the

noise as√σ2x + σ2

y + σ2z where σx is the standard devi-

ation of the x coordinate of the reported head positionin a 400 ms window (12 samples on a 30fps Kinect).We found the average head noise within the targetspace to be less than 0.1 mm, but this increased up to5 mm in poses with the arms partially occluding thehead. Using data from a single Kinect also providedan average deviation of about 0.1 mm (on a restrictedtarget space though), but noise increased up to 56 mm

Figure 14: Pointing with two Kinects. The combinedskeleton (middle) is resilient to non-detected joints onone of the Kinects.

in the problematic poses, making the single-Kinect op-tion less usable.

We also evaluated the suitability of Kinect-basedtracking for virtual pointing, as this is the prevalentinteraction metaphor for 3D selection tasks [2]. Again,the use of two Kinects was found to be of great value(Figure 14). We conducted an informal user studyto measure the accuracy of the path followed by thepointing tool (a virtual ray) when the user was askedto follow a predefined path (both the predefined pathand the pointing tool were shown on the screen). Thepath consisted of five horizontal segments at eleva-tion angles ±45, ±22.5 and 0, with heading angle−45 ≤ θ ≤ 45. We tested three different pairs ofjoints for defining the pointing direction: the elbow-wrist pair, the shoulder-wrist pair, and the hip-wristpair. We will refer to these conditions as ELBOW,SHOULDER and HIP. The hip point considered wasthe central hip joint returned by the Kinect, whereasthe elbow, shoulder and wrist correspond to the user’sdominant hand, which was selected manually. Fig-ure 15 shows the paths followed by the pointing toolfor a representative right-handed user. We measuredthe standard deviation of the angle between the actualpointing vector and its closest match in the path. TheELBOW condition was the noisiest option, with an av-erage deviation among the six participants of σ = 2.4degrees. The SHOULDER condition was more stable(σ = 2.0 deg), although it still had significant jitter-ing, especially when pointing at or above the horizon.The HIP condition was slightly more stable (σ = 1.7deg).

In practice, the suitability of these options for 3Dselection depends on a number of factors [2]. Whenpointing accuracy is not critical (e.g. large, unoc-cluded targets) the ELBOW condition is a suitableoption requiring little physical effort from the user.In other scenarios the SHOULDER or HIP conditionsare more suitable, at the expense of less natural move-ments and higher physical effort.

In terms of gesture-based interaction, our two-Kinect configuration provides two significant advan-tages: (a) a relaxation of the front-facing constraint,increasing by about 28 degrees the range of orienta-tions for which most gestures can be recognized (with

10

Figure 15: Path followed by the pointing tool usingthe ELBOW (a), SHOULDER (b) and HIP (c) con-ditions. We use an equirectangular projection to mapunit vectors to the plane (d).

respect to a single Kinect), and (b) a pose estimationmore stable on the presence of other users inside theCAVE, gestures being recognized as long as the jointsinvolved are seen by at least one of the two Kinects.According to our experiments, most typical gesturescan be recognized within a ±90 degree rotation withrespect to the front-face direction, which in practice isenough for casual CAVE interaction. The accompany-ing video shows how the combined skeleton facilitatespose reconstruction for different gestures and at dif-ferent facing conditions.

7.3 User feedback

A large number of persons (well over a hundred) havetested our CAVE enviroment over the last months.The vast majority of these users were not practition-ers. Across the board, the informal feedback gath-ered validates both the geometric calibration and im-age blending, as no users complained of any defects,and when asked, they concurred that the visualizationwas satisfactory and perceptually correct.

The immersive experience was usually rated as’high’, and users were only aware of the chromatic cal-ibration discontinuities at the end, after having beeninformed by us of the problem. Our main conclusion isthat users are more tolerant to chromatic calibrationbiases during immersive experiences, and that percep-tive metrics for chromatic calibration discontinuitiesin immersive VR settings should be investigated.

Users with previous CAVE experiences unani-mously considered that this Multi-resolution CAVEwas brigther than the previous ones, with a resolu-tion and visual quality significantly higher. The maincomment was related to the gesture-based interaction,which is still not intuitive, and to the Kinect latency.

7.4 Cost analysis

We have compared the multi-projector CAVE systempresented in this paper with a standard commercialsystem. This commercial system uses active stereowith four projectors and four front surface mirrors.

Synchronization is achieved through Nvidia Quadroboards with Nvidia Sync cards. The results are shownin Table 1.

The table shows that we achieve a much higher lu-minosity (with a factor of ≈ 5.25) at a higher reso-lution while having a significantly reduced cost. Theprojectors maintenance cost is the cost of replacinga projector when it crashes. This cost is remarkablylow in the proposed system as a consequence of usingoff the shelf DLP projectors. Moreover, we have ob-served that circular polarization in conjunction withbright images projected in the walls produces a highimmersive perception and a significant presence level,probably similar to active stereo VR systems. Exper-iments to confirm this last hypothesis will be part ofour future work.

8 Conclusions and Future Work

We have presented a novel four wall multi-projectorCAVE architecture, powered by 40 off the shelf pro-jectors and controlled by 12 PCs. It operates in pas-sive stereo, providing high brightness at 2000 × 2000pixel resolution on each of the four walls. We haveachieved high resolution while significantly reducingthe cost and having increased versatility: the sys-tem works with any mix of a wide range of pro-jector models that can be substituted at any timefor more modern or cheaper ones. The uniformityof the final image is achieved using a specially de-signed self-calibration software which guarantees con-cordance and continuity. Interaction is intuitive andcable-less, operating on a gesture-based, ergonomic in-teraction paradigm. We dynamically merge the in-formation from two nearly-orthogonal Kinect sensors.Having a two-Kinect configuration results in a morestable pose estimation and a relaxation of the front-facing constrain. Most typical gestures are well rec-ognized within any face and body rotation up to 90degrees with respect to the frontal CAVE direction.

In the future we plan to optimize the chromatic cali-bration differences between wall and floor projections,in order to have a maximum brightness in the verti-cal walls while keeping a high immersive perception.Perceptive metrics for chromatic calibration disconti-nuities in CAVE settings should also be investigated.We also plan to position the Kinects in configuration(b) —Figure 10— for optimal gesture recognition, aswell as to test Kinect V2 sensors.

11

Cost of projectors Cost of projectors Resolution Lumensand computers maintenance per wall per wall

Standard commercial CAVE 299 Ke n.a. 1920× 1200 4000Multi-projector CAVE 52 Ke 0.6 Ke 2000× 2000 ≈ 21000

Table 1: Comparison between a commercial CAVE-like system and the proposed setup

Acknowledgements

We would like to thank Oscar Argudo, Pedro Her-mosilla, Jordi Moyes, Isabel Navazo and Adria Ver-netta for their valuable contributions. This work hasbeen partially funded by the Spanish Ministry of Sci-ence and Innovation under grants TIN2010–20590–C01–01 and TIN2014–52211–C2–1–R.

References

[1] Clemens Amon and Ferdinand Fuhrmann. Evaluationof the spatial resolution accuracy of the face trackingsystem for kinect for windows v1 and v2. In Pro-ceedings of the 6th Congress of Alps-Adria AcousticsAssosiation, October 2014.

[2] Ferran Argelaguet and Carlos Andujar. A survey of 3dobject selection techniques for virtual environments.Computers & Graphics, 37(3):121 – 136, 2013.

[3] Kai Berger, Kai Ruhl, Yannic Schroeder, ChristianBruemmer, Alexander Scholz, and Marcus A Magnor.Markerless motion capture using multiple color-depthsensors. In VMV, pages 317–324, 2011.

[4] Allen Bierbaum, Christopher Just, Patrick Hartling,Kevin Meinert, Albert Baker, and Carolina Cruz-Neira. Vr juggler: A virtual platform for virtual real-ity application development. In Virtual Reality, 2001.Proceedings. IEEE, pages 89–96. IEEE, 2001.

[5] Rozenn Bouville, Valerie Gouranton, Thomas Bog-gini, Florian Nouviale, and Bruno Arnaldi. #FIVE: High-Level Components for Developing Collabora-tive and Interactive Virtual Environments. In EighthWorkshop on Software Engineering and Architecturesfor Realtime Interactive Systems (SEARIS 2015),conjunction with IEEE Virtual Reality (VR), Arles,France, March 2015.

[6] Maurizio Caon, Yong Yue, Julien Tscherrig, ElenaMugellini, and Omar Abou Khaled. Context-aware3d gesture interaction based on multiple kinects. InAMBIENT 2011, The First International Conferenceon Ambient Computing, Applications, Services andTechnologies, pages 7–12, 2011.

[7] H. Chen, R. Sukthankar, G. Wallace, and K. Li. Scal-able alignment of large-format multi-projector dis-plays using camera homography trees. In IEEE Vi-sualization 2002, pages 339–346, 2002.

[8] Thomas A. DeFanti, Gregory Dawe, Daniel J. Sandin,Jurgen P. Schulze, Peter Otto, Javier Girado, FalkoKuester, Larry Smarr, and Ramesh Rao. The star-cave, a third-generation CAVE and virtual realityoptiportal. Future Generation Computer Systems,25(2):169 – 178, 2009.

[9] Florian Faion, Simon Friedberger, Antonio Zea, andUwe D Hanebeck. Intelligent sensor-scheduling formulti-kinect-tracking. In Intelligent Robots and Sys-tems (IROS), 2012 IEEE/RSJ International Confer-ence on, pages 3993–3999. IEEE, 2012.

[10] D. Freedman, A. Shpunt, M. Machline, and Y. Arieli.Depth mapping using projected patterns. US patent2010/0118123, 05 2010.

[11] Hermann Furntratt and Helmut Neuschmied. Eval-uating Pointing Accuracy on Kinect V2 Sensor. InProceedings of the 2nd International Conference onHuman-Computer Interaction, pages 124–1–124–5,August 2014.

[12] M. Hereld, I.R. Judson, and R.L. Stevens. Introduc-tion to building projection-based tiled display sys-tems. IEEE Computer Graphics and Applications,20(4):22–28, 2000.

[13] Daniel Herrera C., Juho Kannala, and Janne Heikkila.Joint depth and color camera calibration with distor-tion correction. Pattern Analysis and Machine In-telligence, IEEE Transactions on, 34(10):2058–2064,2012.

[14] J. Jerald, P. Giokaris, D. Woodall, A. Hartbolt,A. Chandak, and S. Kuntz. Developing virtual re-ality applications with unity. In IEEE Virtual Reality(VR), 2014 Tutorials, pages 1–3, March 2014.

[15] M.A. Livingston, J. Sebastian, Zhuming Ai, and J.W.Decker. Performance measurements for the microsoftkinect skeleton. In Virtual Reality Short Papers andPosters (VRW), 2012 IEEE, pages 119–120, 2012.

[16] A. Maimone and H. Fuchs. Encumbrance-free telep-resence system with real-time 3d capture and displayusing commodity depth cameras. In Mixed and Aug-mented Reality (ISMAR), 2011 10th IEEE Interna-tional Symposium on, pages 137–146, 2011.

[17] Aditi Majumder and M Gopi. Modeling color prop-erties of tiled displays. Computer Graphics Forum,24(2):149–163, 2005.

[18] Aditi Majumder and Rick Stevens. Color nonunifor-mity in projection-based displays: Analysis and solu-tions. IEEE Transactions on Visualization and Com-puter Graphics, 10(2):177–188, March 2004.

12

[19] Aditi Majumder and Rick Stevens. Perceptual photo-metric seamlessness in projection-based tiled displays.ACM Trans. Graph., 24(1):118–139, January 2005.

[20] Microsoft Corporation. Kinect for Windows. HumanInterface Guidelines v1.8, 2013.

[21] R. Raskar, M.S. Brown, Ruigang Yang, Wei-ChaoChen, G. Welch, H. Towles, B. Scales, and H. Fuchs.Multi-projector displays using camera-based registra-tion. In Proceedings of Visualization’99., pages 161–522, 1999.

[22] Behzad Sajadi, Maxim Lazarov, M. Gopi, and AditiMajumder. Color seamlessness in multi-projectordisplays using constrained gamut morphing. IEEETransactions on Visualization and Computer Graph-ics, 15(6):1317–1326, 2009.

[23] Behzad Sajadi and Aditi Majumder. Autocalibrationof multiprojector CAVE-like immersive environments.IEEE Transactions on Visualization and ComputerGraphics, 18(3):381–393, March 2012.

[24] Yannic Schroder, Alexander Scholz, Kai Berger, KaiRuhl, Stefan Guthe, and Marcus Magnor. Multiplekinect studies. Technical Report 09-15, ICG, October2011.

[25] Jamie Shotton, Toby Sharp, Alex Kipman, AndrewFitzgibbon, Mark Finocchio, Andrew Blake, MatCook, and Richard Moore. Real-time human poserecognition in parts from single depth images. Com-munications of the ACM, 56(1):116–124, 2013.

[26] Evan A. Suma, David M. Krum, Belinda Lange,Sebastian Koenig, Albert Rizzo, and Mark Bolas.Adapting user interfaces for gestural interaction withthe flexible action and articulated skeleton toolkit.Computers & Graphics, 37(3):193 – 201, 2013.

[27] Russell M Taylor II, Thomas C Hudson, AdamSeeger, Hans Weber, Jeffrey Juliano, and Aron THelser. Vrpn: a device-independent, network-transparent vr peripheral system. In Proceedings ofthe ACM symposium on Virtual reality software andtechnology, pages 55–61. ACM, 2001.

[28] Miguel Angel Vico. Multi-kinect VRPN server, 2013.github.com/mvm9289/multikinect.

[29] Zhengyou Zhang. Microsoft kinect sensor and its ef-fect. Multimedia, IEEE, 19(2):4–10, 2012.

[30] David J. Zielinski, Ryan P. McMahan, SolaimanShokur, Edgard Morya, and Regis Kopper. EnablingClosed-Source Applications for Virtual Reality viaOpenGL Intercept-Based Techniques. In IEEE Work-shop on Software Engineering and Architectures forRealtime Interactive Systems (SEARIS), pages 59–64.IEEE, March 2014.

13

A multi-projector CAVE system with commodity hardware ......A multi-projector CAVE system with...

Documents

Transcript of A multi-projector CAVE system with commodity hardware ......A multi-projector CAVE system with...