Video Tourism: A Video Extension to Photo Tourismareiter/CS_Webpage/COMS6734_Final... · 2011. 5....

Video Tourism: A Video Extension to Photo Tourism

Austin Reiter

Abstract— Photo tourism is a system which interactivelyexplores an unstructured collection of photographs and re-constructs the viewpoints of the cameras which took eachphotograph and also produces a sparse point cloud of the scene.This project extends this concept to video, whereby any videotaken of the same scene that has been reconstructed can beshown as a trajectory that was taken to capture that video,in both 3D position and orientation (pose). In the same waythat users can then manipulate this virtual 3D environmentswitching between different people’s photographs with varyinglevels of detail, the following work proposes a method toincorporate video at different viewpoints and how people movedto capture that video.

I. INTRODUCTION

The sharing of personal photographs between large groupsof unrelated people has become very popular over recentyears with advancements in cell phone technology and photo-sharing website capabilities, such as Flickr, Google’s Picasa,and Facebook. These days, just about everybody has ahigh-powered computer in their pockets with HD imagingcapabilities in the form of smart phones. More recently, thesephones are equipped with HD video capabilities, bringingabout a newer revolution in capturing experiences in the formof moving images rather than single-shot still photographs.

Photo tourism [1] is a system developed in a joint collab-oration between researchers at the University of Washingtonand Microsoft Corporation. It compiles a large collection ofunstructured photographs, all of which are presumably aimedat the same scene, but from varying viewpoints and vastlydifferent imaging devices. Using a technique called structure-from-motion (SFM), the positions and orientations of each ofthe cameras as well as a sparse 3D point cloud of the scene isreconstructed in a joint optimization procedure. Essentially,by viewing the same scene point across multiple views, a 3Dtriangulation can be used to deduce the position of that scenepoint, while at the same time finding the globally-optimalsolution to where each of the cameras must have been to seethis scene point as they did. Users can then, after the fact, seewhere they were when that photograph was taken and alsomanipulate this virtual 3D environment to browse throughother people’s personal photographs in a scene-consistentmanner, changing in different levels of resolution, content,and detail. The authors have provided an open-source im-plementation of this software, called Bundler, at http://phototour.cs.washington.edu/bundler/.

A. Filling the Model

A subsequent paper [2] uses the output of Bundler andproduces a denser 3D model of the environment using

Fig. 1. The concept of Photo tourism is to collect scattered photographsof the same scene into a sparse 3D point cloud and 3D positions andorientations of where each camera took each photograph. Video Tourismextends this to video, being able to recover positions of videos taken of thesame scene, which could then be stitched in for manipulation in a virtual3D environment in the same way.

rectangular image patch data, in a method called Patch-based Multi-view Stereo (PMVS). This can be used toimprove the virtual 3D environment which the user canmanipulate, providing more photo-realism. In this work, Iuse PMVS to provide a better model of the scene ratherthan the sparse output of Bundler. PMVS source code isalso open-source and available at: http://grail.cs.washington.edu/software/pmvs/.

B. Supplementing With Video

The goal of this work is to supplement the previously-described system with video. One can imagine a video-sharing website, similar in spirit to Flickr, whereby userscan upload personal videos taken from their cell-phones orpersonal video cameras, say of vacations, and share with theirfamilies and friends (different from YouTube). In the sameway that we can collect photographs of a particular scene,we can also collect videos. Videos are advantageous in thatthey really capture the spirit of a moment in somebody’slife more vividly than a single photograph can. The motionof the people in the video and how things change in theenvironment can really pull-in a viewer more than a singlesnapshot could ever do.

Using the reconstruction from Bundler, I describe amethod to then recover the 3D trajectory (in position andorientation) of a video viewing that same scene, as a personmoves around and films in this environment. The featuresthat come associated with the sparse point cloud help to

http://phototour.cs.washington.edu/bundler/

http://phototour.cs.washington.edu/bundler/

http://grail.cs.washington.edu/software/pmvs/

http://grail.cs.washington.edu/software/pmvs/

deduce the camera pose at each frame in the video. Thiscan be done efficiently using a variation on an algorithmcalled Pose from Orthography and Scaling with Iterations(POSIT) [3]. Next I describe the methods and then I’ll showsome results and discuss short-comings.

II. METHOD

The input to the system described here is simply a collec-tion of unordered photographs taken of the same scene, froman everyday digital camera (or several different cameras).The approximate focal length should be known, which caneither come from the camera’s manufacturer specificationsor an offline camera calibration. The Bundler algorithm thencomputes the correct focal length of each image’s camerainside it’s optimization procedure. In the end, a text file isproduced which specifies the 3D position and orientation ofthe camera associated with each photograph and then a listof 3D points. First I’ll discuss how we get this part to work.

A. Using Bundler

Bundler uses SIFT [4] features which are computed oneach image and computes an initial set of potential pointmatches from each image to every other image. The tradi-tional way to do this is, for each image, construct a KD-treeof the 128-dimensional SIFT feature vectors from that frame.Then, for all other images, match against this KD-tree withtheir SIFT features, and collect potential matches across allthe frames. Repeat this for all images that are input, andstore them in a text file as input for the next stage.

The Bundler software suggests using the binaries providedby David Lowe (http://www.cs.ubc.ca/˜lowe/keypoints/) to extract the SIFT features. However, inmy experiments (described below) I use a NIKON D100digital camera with an image resolution of 3008x2000 pixels.Lowe’s software crashes due to memory issues and I was notable to use his binaries without downsampling my images,which I did not want to do. Therefore, I had an imple-mentation of SIFT in C++ using the OpenCV library whichcan process larger-sized images. Lowe’s binaries outputs theSIFT keypoints in a particular file format which Bundlerexpects, and so I adapted my code to write out the same fileformat for each set of SIFT keypoints. This is used as inputto the Bundler software.

Using this list of potential matches across all of the frames,Bundler iteratively performs SFM to find the best combina-tion of scene 3D points and camera poses to correspond tothe matches that were provided. A global joint-optimizationcomputes a sparse 3D point cloud and the positions andorientations of cameras that are consistent with this result.Some cameras may be excluded for not having sufficient datato match this result, and must be ignored in subsequent steps.

On final output, the poses of the cameras (which suc-ceeded) are listed in a text file, along with a list of the 3Dscene points which survived. Along with each 3D point is alist of camera views which were able to successfully viewthat scene point and the index of the SIFT feature associatedwith that scene point, in that image. This information is

essential to my approach. I take in this text file, and collectall of the 3D scene points. For each scene point, I collectthe SIFT feature vectors associated from each of the cameraimages which viewed that point. In the end, I have severalSIFT descriptors per scene point, but I only want a singlefeature vector to describe this scene point. Because of therobustness of SIFT features, I can take the average of all ofthese matching SIFT descriptors because they are presumedto be close already, which is how they matched in the firstplace. Then I re-normalize as described in [4] so that first thisaverage feature is normalized to unit length, then thresholdedat 0.2, and then re-normalized to unit-length. This part wasimportant to getting good 3D scene matches in the videostep later on.

B. Using PMVS

Incorporating PMVS into the system was very convenientbecause the researchers who wrote the Bundler code wereaware of this research and provided a wrapper programwhich takes their output and converts it into a different for-mat which can be ingested into the PMVS software. PMVS, acomputationally-intensive procedure, takes the images whichare input into Bundler and also the sparse 3D point cloudand camera positions, and produces a denser 3D model usingthe image patch data. Results are shown in the experimentssection below.

C. Using POSIT

Finally, I take a video of the same scene of the reconstruc-tion and attempt to produce the 3D trajectory that was takento produce that video, to be displayed within the 3D virtualenvironment. For purposes of this project, I only show the3D positions, however note that the 3D orientation is alsobeing computed and verified (loosely).

Using only the 3D scene points and the associated(averaged) SIFT descriptor for each scene point, I first builda KD-tree of these scene descriptors. Then, for each frameof the video sequence, I do the following:

1) Extract SIFT features from the video frame.2) Match SIFT features into the KD-tree of scene feature

descriptors. Employ the same matching technique asin Lowe’s paper ensuring that the second closest pointis a ratio away from the closest so that we minimizefalse matches.

3) Compute the pose of the camera for that image usinga variantion on POSIT.

POSIT takes as input a list of 3D-2D correspondencesand requires at least 4 non-coplanar points. However, it isnot robust to false matches and even a single bad matchwill produce an incorrect pose. But when the matches arecorrect, a 3D pose is generated, consisting of a 3x3 rotationmatrix and a 3x1 translation vector, representing the 3Drigid-body transformation mapping points from the model’scoordinate system to the camera’s coordinate system. Thisyields information about the pose of the camera that tookthat image which we seek.

http://www.cs.ubc.ca/~lowe/keypoints/

http://www.cs.ubc.ca/~lowe/keypoints/

Fig. 2. 42 still frame images used as input to Bundler to reconstruct awall-sized bookshelf in a small office.

This works only because Bundler produces, for each scenepoint, the photo-realistic image descriptors that producedthat point. Often in model-based vision techniques, onlya virtual model is available and so matching using imagefeatures just doesn’t work well beyond using edges. In orderto use more robust feature methods such as SIFT in a model-based approach, we need to associate SIFT descriptors withthe 3D scene points, which is the key component that mostmethods lack (i.e., photo-realistic 2D image features whichare matchable to photo-realistic 3D image feature data). Inthis case, we get that for free from Bundler. Therefore we canuse robust feature matching techniques with 3D models to2D images. However, the matching described above almostcertainly will yield some false matches. Although SIFT isquite good at wide-viewpoint feature matching, because thescene descriptor is a combination of very different viewpointsof the same scene point, false matches are inevitable, whichwill break POSIT.

The fix to this issue is to apply RANSAC [5] to POSIT.RANSAC works as follows: iteratively select the minimumnumber of points to fit a model (in this case, I choose ∼ 53D-2D point correspondences for POSIT). Apply that model(3D pose matrix) to all of the potential point matches andcompute an error score (i.e., how well did the 3D point back-project to that pixel). Then save the best model (that withthe lowest overall error score). This serves two importantpurposes: first, as long as there enough correct matches,even with outliers RANSAC will likely find the correct pose;second, any matches that were incorrect can be fixed to revealtheir correct location in the image (consistent with the correct3D pose of the camera in the scene).

In the end, POSIT with RANSAC can compute the pose ofthe camera for the current video frame very efficiently. Thefollowing section shows some results of different scenes.

III. RESULTSI chose 3 different scenes to evaluate the results of this

algorithm: (A) a wall bookshelf in a small office, (B) a statueof the Columbia University Lion, and (C) a large building(Butler library at Columbia University).

Fig. 3. Bundler output of the sparse point cloud and camera viewpointsof the office scene. The point cloud is colored using RGB colors from thecamera images. The camera poses are shown as red/green/yellow dots in3D space, showing were each image was taken with respect to the shelves.

A. Office

Figure 2 shows thumbnails of 42 sample images of a wall-sized bookshelf in a small office. SIFT features are computedon these and along with the images are input into Bundler toproduce a sparse point cloud and 3D camera poses for eachof the images. The results of this reconstruction are shown infigure 3, where a sparse point cloud with RGB colors fromthe images are displayed with the camera positions shownas red/green/yellow dots on the right-hand side of the figuredisplay. As you can see, the point cloud is very sparse and it’svery hard to even see any of the structure of the bookshelf.

To improve this, next we run the output of Bundler throughPMVS to get a denser reconstruction of the scene, shownon the top of figure 4. This is a more photo-realistic 3Dmodel rendering. Finally, using a standard Logitech Webcam

Fig. 4. [Top] PMVS output of a dense point cloud of the office scene. Here,the structure and color of the scene is much more clear to see in a virtual3D environment, and would be much more user friendly to manipulate.[Bottom] Video camera reconstruction of the office scene. Red dots showindividual camera positions by a desk in front of the shelves, as the camerais waved in a circular pattern at a roughly constant distance.

Fig. 5. 39 still frame images used as input to Bundler to reconstruct astatue of the Columbia Lion.

I waved a video camera in-front of the bookshelf, in roughlya circle at a fairly-constant depth. The bottom of figure 4shows these results inside the model produced by PMVS,where red dots show individual 3D locations of the imageframes in the video (no orientation is shown). Although thereare some outliers of incorrect camera recoveries, for the mostpart the cameras were recovered at the correct location (in theabsence of ground-truth). However, no structured movementoccurred and it’s hard to evaluate. Therefore I tried sometrajectories where it’s more believable to see if the resultworked, shown next.

B. Columbia Lion

For the next test, I took still-shot images in a circlearound a statue of the Columbia Lion, located on ColumbiaUniversity’s campus. Figure 5 shows thumbnails of the 39images which were used in Bundler to produce a sparsepoint cloud and the camera locations. Even though I circled360 degrees around the statue, the software only seemed torecover half of the lion (see figure 6). It’s possible that Ididn’t take enough pictures to tie the scene together. In anycase, this was still sufficient to test the video approach.

To get a better 3D model, I ran the output through PMVSand obtained the model shown on the top of figure 7. Theresulting model is more visually-appealing to view the 3Dstructure of the lion statue. Finally, I took a video in atrajectory that was predictable and easier to evaluate. I movedin a straight-line around one side of the lion (the part that wasreconstructed correctly) at a constant height. The resultingtrajectory should look roughly straight around the lion in asomewhat smooth movement.

The bottom of figure 7 shows the recovered camera trajec-tory. The red dots show the individual camera positions, andthe result is indeed in a smooth linear shape. This time therewere less incorrect camera pose estimates because there weremany more positive matches for the POSIT pose estimation.The significant texture on the statue facilitated good matchesand made the RANSAC approach less susceptible to falsesolutions.

Fig. 6. Bundler output of the sparse point cloud and camera viewpointsof the Lion statue. The point cloud is colored using RGB colors from thecamera images. The camera poses are shown as red/green/yellow dots in 3Dspace, showing were each image was taken. Even though a full 360-degreecircle was made around the lion, only half of it was recovered (likely dueto insufficient data to tie together the two sides).

C. Butler Library

Finally I tested on a much larger scene, the Butler libraryon campus at Columbia University. Figure 8 shows thumb-nails of the 60 images used as input to Bundler. Figure 9shows the results from reconstructing the sparse point cloudand camera positions. This reconstruction was much moresparse than I expected, although I took images from verydifferent viewpoints in terms of orientation and distance,which it seemed to recover well. Finally, a more appealingmodel is computed from PMVS and shown on the top of

Fig. 7. [Top] PMVS output of a dense point cloud of the Lion statue. Here,the structure and color of the scene is much more clear to see in a virtual3D environment. [Bottom] Video camera reconstruction of the Lion statue.Red dots show individual camera positions in a smooth linear trajectoryaround the outside of half of the statue.

Fig. 8. 60 still frame images used as input to Bundler to reconstruct theButler library.

figure 10.There are two lawns in front of butler and a long straight

path in between these two lawns which leads up to the frontdoor of Butler. My test video trajectory was a straight pathdown this lane, although it was slightly less smooth in real-life as the camera was shaky due to my walking motion(with a computer in my hand). The bottom of figure 10shows the results of the recovered camera trajectory. Thistime the actual trajectory was a little more bumpy due to mywalking motion. You can see that although the straight pathwas roughly recovered, there were a lot more incorrect poseestimates. This is possibly due to poorer feature matching,attributed to the repetitive nature of the structures on the frontof the building. However, it might be possible to smooththis trajectory and use frame-to-frame estimates to try andconstrain large movements in small periods of time.

IV. FEATURE MATCHING EVALUATION

The ability to match features accurately lies at the heartof this method. SIFT has been extremely successful in thecomputer vision community because of its ability to matchfeatures across very wide viewpoints and perspectives. Inthis work, SIFT out-performs even my own expectations.However, it’s not magical and still often produces badmatches. As long as we get enough good matches as an inputto the RANSAC procedure, we will estimate the correct pose,and then we can fix bad feature matches. In this section, Ishow some results of both of these cases.

Figure 11 shows some feature matching results for theLion statue sequence. There are 3 rows of images. In theleft-column, I show some still-shot images which were usedas the inputs to Bundler to reconstruct the initial camerapositions. Recall that on output, we get a list of 3D scenepoints that make up the sparse point cloud, and for each 3D

Fig. 9. Bundler output of the sparse point cloud and camera viewpointsof Butler. The point cloud is colored using RGB colors from the cameraimages. The camera poses are shown as red/green/yellow dots in 3D space.

point we get a list of 2D images which can see that scenepoint as well as the SIFT descriptor that goes along with it, inthat image frame. When we match into the scene descriptorKD-tree, I store one of these images for reference, noting thatit was verified that the image points are actually the samescene point across the frames (this was verified manually,offline), indicating that Bundler does a good job of makingthis information available. This allowed me to verify that thefeature matching with the video frames actually is workingsuccessfully. For the left-column, the red circle indicates asingle feature of interest, that which is labeled as the bestmatch for some feature in the video frame on the right. In theright column, the green circles indicate the match that I getfrom the initial descriptor matching with the scene featuretree (to the red circle on the left). The red circles on the rightshow the back-projection of the 3D scene point after I get the

Fig. 10. [Top] PMVS output of a dense point cloud of Butler. Here, thestructure and color of the scene is much more clear to see in a virtual3D environment. [Bottom] Video camera reconstruction of a straight pathtowards Butler. Red dots show individual camera positions recovered fromthe current method down the straight path towards Butler’s front door.

POSIT pose estimate. Using a pinhole camera model, the 3Dpoint back-projects to a camera pixel and should representthe same 3D scene point in both frames.

The top-row shows a successful match into the scenedescriptor tree, across a very wide viewpoint. The red circleoverlays the green circle on the right, indicating a good initialmatch and a good subsequent pose estimate. If the poseestimate was wrong, the physical 3D scene point would notback-project to the correct pixel in the video frame. Thisis a loose verification of the pose estimate, but still a validone. However, the middle and bottom rows show bad initialmatches. Nevertheless, I get the correct pose estimate fromPOSIT and you can see the correct back-projection to thecorrect location in the images, essentially fixing the featurematch. It’s also a verification that the RANSAC is gettingthe right pose. The middle row shows an initial match ofa part of the building in the background which should gowith the lion’s head, and the correct pose indicates this. Thebottom row shows confusion with a tree, but does get fixedin the end.

Figure 12 shows feature matching results for Butler. Usingthe same color coding, I show 4 examples (2 good matchesand 2 bad matches). The top 2 rows show good initialmatches and subsequent pose estimates. The bottom 2 rowsshow bad initial matches with good pose estimates, whichin turn fix the locations in the image to the correct scenepoints. The third row shows a mismatch with part of abush in the foreground of the scene, correctly relocatingit to the top-left corner of Butler. The last row shows aconfused match with one of the windows at the top of Butler,which is understandable because of the repetitive nature ofthe structures on the building. However, there were enoughproper matches to fix this through a good camera poseestimate.

Note that in all of these figures, the circles being drawn donot indicate the scale of the SIFT feature, as is commonlydone. They are simply indicating locations of interest in theimage frames.

V. CONCLUSIONS

There are a few drawbacks to this method. In my ex-periments, I do not account for lens distortion and that mayaffect the accuracy of the results. In addition, I calibrated thevideo camera offline to retrieve the intrinsics (focal lengthand principal point), but the auto-focus feature can (anddoes) change this focal length, and so the focal length usedin POSIT isn’t exactly the ideal value for any given videoframe.

There are also many visible bad pose estimates, especiallyin the Butler sequence. This tends to happen when thereare more bad matches than good matches which go intothe RANSAC procedure. There are some things that couldbe done to alleviate this. One could be an after-the-factsmoothing, such as a spline. Alternatively I could apply aKalman filter to avoid moving too far between consecutiveframes, which should not happen in practice as people tendto move somewhat smoothly from frame-to-frame. Finally,

more could be done to improve the feature matching capa-bilities, such as experimenting with other descriptors.

I do believe this could be a useful extension to PhotoTourism and if done well, could provide some enjoyment topeople browsing these reconstructed 3D virtual environmentswith video. There might be some robotics applications aswell, such as assisting a robot in mapping its environmentbased on pre-built models, but more robustly than currentSLAM algorithms because of the use of robust, SIFT-like,feature matching.

Finally, some words on computational-efficiency. The cur-rent bottleneck is mostly in the SIFT extraction on eachframe of the video. SIFT is notoriously a non-realtimealgorithm. Other similar algorithms have been proposed, suchas Speeded Up Robust Features (SURF) [6] which are lesscomputationally-intensive. But I wouldn’t want to move topatch-based features, which are much worse at matchingacross viewpoints. The goal would be to keep the robustmatching capabilities of SIFT with a faster runtime. Thismight be something worth looking into to speed up theprocessing. But, the runtime of the POSIT with RANSAC isquite low (sim 25-40ms).

REFERENCES

[1] Snavely, N. and Seitz, S. M. and Szeliski, R. ”Photo tourism: Exploringphoto collections in 3D,” ACM Transactions on Graphics (SIGGRAPHProceedings), 25(3), 2006, 835-846.

[2] Furukawa, Y. and Ponce, J. ”Accurate, Dense, and Robust Multi-ViewStereopsis,” IEEE Computer Society Conference on Computer Visionand Pattern Recognition, July 2007.

[3] DeMenthon, D. and Davis, L. S. ”Model-Based Object Pose in 25Lines of Code”, International Journal of Computer Vision, 15, pp.123-141, June 1995.

[4] Lowe, D. G. ”Distinctive image features from scale-invariant key-points,” International Journal of Computer Vision, 60, 2 (2004), pp.91-110.

[5] Fischler, M. A. and Bolles, R. C. ”Random Sample Consensus: AParadigm for Model Fitting with Applications to Image Analysis andAutomated Cartography,” Comm. of the ACM, 24: 381395, June 1981.

[6] Bay, H. and Ess, A. and Tuytelaars, T. and Gool, L. V. ”SURF:Speeded Up Robust Features,” Computer Vision and Image Under-standing, Vol. 110, No. 3, pp. 346–359, 2008.

Fig. 11. Feature matching for the Lion statue. For each of the 3 rows of images, the left-side is the look-up image from the still images used in Bundler.Recall that for each 3D scene point, we have a list of each of the camera views that were able to view that point (and where). These are examples of one ofthe possibly many views that viewed this scene point. The right side is the view from the video sequence. On the left, red circles indicate a single featureof interest. On the right, the green circles indicate the match we obtained from the KD-tree SIFT matching to this feature of interest for the video frame’sfeatures to the scene descriptors. The red circles on the right show the corrected feature location after POSIT with RANSAC, more specifically, the 3Dback-projection of the scene point to the video image using the pose estimate from POSIT. When the pose is correct, a good match gets back-projected toitself (red overlaid on green on the right). Otherwise, you see a correction in the image position consistent with the pose of the camera, shown by the bluelines. [Top] An example of a good feature match across a very wide viewpoint, where SIFT does very well. [Middle, Bottom] Bad matches are shownin green on the right, but the correct pose estimate from POSIT corrects the positions in the image frame.

Fig. 12. Feature matching for Butler. The color-coding for the circles is the same as in figure 11. [Rows 1,2] Examples of good feature matches acrosswide viewpoints and distances, where SIFT does very well. The back-projection of the 3D scene point, after POSIT, reprojects to the correct location.[Rows 3,4] Bad matches are shown in green on the right, but the correct pose estimate from POSIT corrects the positions in the image frame.

Video Tourism: A Video Extension to Photo Tourismareiter/CS_Webpage/COMS6734_Final... · 2011. 5....

Documents

Transcript of Video Tourism: A Video Extension to Photo Tourismareiter/CS_Webpage/COMS6734_Final... · 2011. 5....