Image Mosaicing for Tele-Reality Applicationswhitaker/classes/image_proc/crl-94-2.pdfreality in this...
Transcript of Image Mosaicing for Tele-Reality Applicationswhitaker/classes/image_proc/crl-94-2.pdfreality in this...
ImageMosaicingfor Tele-Reality
ApplicationsRichardSzeliski
Digital EquipmentCorporation
CambridgeResearchLab
CRL 94/2 May, 1994
Digital Equipment Corporation has four research facilities: the Systems Research Center and theWestern Research Laboratory, both in Palo Alto, California; the Paris Research Laboratory, inParis; and the Cambridge Research Laboratory, in Cambridge, Massachusetts.
The Cambridge laboratory became operational in 1988 and is located at One Kendall Square,near MIT. CRL engages in computing research to extend the state of the computing art in areaslikely to be important to Digital and its customers in future years. CRL’s main focus isapplications technology; that is, the creation of knowledge and tools useful for the preparation ofimportant classes of applications.
CRL Technical Reports can be ordered by electronic mail. To receive instructions, send a mes-sage to one of the following addresses, with the word
������� in the Subject line:
On Digital’s EASYnet: CRL::TECHREPORTSOn the Internet: [email protected]
This work may not be copied or reproduced for any commercial purpose. Permission to copy without payment isgranted for non-profit educational and research purposes provided all such copies include a notice that such copy-ing is by permission of the Cambridge Research Lab of Digital Equipment Corporation, an acknowledgment of theauthors to the work, and all applicable portions of the copyright notice.
The Digital logo is a trademark of Digital Equipment Corporation.
Cambridge Research LaboratoryOne Kendall SquareCambridge, Massachusetts 02139
TM
ImageMosaicingfor Tele-Reality
ApplicationsRichardSzeliski
Digital EquipmentCorporation
CambridgeResearchLab
CRL 94/2 May, 1994
Abstract
While a largenumberof virtual reality applications,suchasfluid flow analysisandmolecular
modeling,dealwith simulateddata,many newer applicationsattemptto recreatetrue reality as
convincinglyaspossible.Buildingdetailedmodelsfor suchapplications,whichwecall tele-reality,
is a major bottleneckholding back their deployment. In this paper, we presenttechniquesfor
automaticallyderiving realistic2-D scenesand3-D texture-mappedmodelsfrom videosequences,
whichcanhelpovercomethis bottleneck.Thefundamentaltechniqueweuseis imagemosaicing,
i.e., the automaticalignmentof multiple imagesinto larger aggregateswhich are then usedto
representportionsof a 3-D scene. We begin with the easiestproblems,thoseof flat sceneand
panoramicscenemosaicing,andprogressto morecomplicatedscenes,culminatingin full 3-D
models.We alsopresenta numberof novel applicationsbasedon tele-realitytechnology.
Keywords: imagemosaics,imageregistration,imagecompositing,planarscenes,panoramic
scenes,3-D scenerecovery, motionestimation,structurefrom motion,virtual reality, tele-reality.
c�
Digital EquipmentCorporation1994.All rightsreserved.
Contents i
Contents
1 Introduction �������������������������������������������������� 1
2 Related work ������������������������������������������������� 2
3 Basic imaging equations ���������������������������������������� 3
4 Planar image mosaicing ���������������������������������������� 5
4.1 Local imageregistration �������������������������������������� 6
4.2 Globalimageregistration ������������������������������������� 8
4.3 Results �������������������������������������������������� 8
5 Panoramic image mosaicing �������������������������������������� 9
5.1 Results �������������������������������������������������� 11
6 Scenes with arbitrary depth �������������������������������������� 11
6.1 Formulation ����������������������������������������������� 13
6.2 Results �������������������������������������������������� 14
7 Full 3-D model recovery ���������������������������������������� 14
8 Applications �������������������������������������������������� 18
8.1 Planarmosaicingapplications ����������������������������������� 18
8.2 3-D modelbuilding applications ���������������������������������� 19
8.3 End-userapplications ��������������������������������������� 19
9 Discussion ��������������������������������������������������� 21
10 Conclusions �������������������������������������������������� 22
A 2-D projective transformations ������������������������������������� 28
ii LIST OFFIGURES
List of Figures
1 Rigid, affine,andprojective transformations�������������������������� 4
2 Whiteboardimagemosaicexample �������������������������������� 10
3 Panoramicimagemosaicexample(bookshelfandcluttereddesk) �������������� 12
4 Depthrecovery example:tablewith stacksof papers �������������������� 15
5 Depthrecovery example:outdoorscenewith trees ��������������������� 16
6 3-D modelrecoveryexample ����������������������������������� 17
1 Introduction 1
1 Introduction
Virtual reality is currently creatinga lot of excitementand interestin the computergraphics
community. Typical virtual reality systemsuseimmersive technologiessuchashead-mountedor
stereodisplaysanddatagloves. The intent is to convinceusersthat they areinteractingwith an
alternatephysicalworld, andalsooften to give insight into computersimulations,e.g.,for fluid
flow analysisor molecularmodeling[Earnshaw et al., 1993]. Realismandspeedof renderingare
thereforeimportantissues.
Anotherclassof virtual reality applicationsattemptsto recreatetrue reality asconvincingly
as possible. Examplesof suchapplicationsinclude flight simulators(which were amongthe
earliestusesof virtual reality), variousinteractive multi-playergames,andmedicalsimulators,
andvisualizationtools. Sincethereality beingrecreatedis usuallydistant,we usethetermtele-
reality in this paperto refer to suchvirtual reality scenariosbasedon real imagery.1 Tele-reality
applicationswhich could be built usingthe techniquesdescribedin this paperincludescanning
a whiteboardin your office to obtaina high-resolutionimage,walkthroughsof existing buildings
for re-modelingor selling,participatingin a virtual classroom,andbrowsing throughyour local
supermarketaislesfrom home(seeSection8.)
While mostfocusin virtual realitytodayis oninput/outputdevicesandthequalityandspeedof
rendering,perhapsthebiggestbottleneckstandingin thewayof widespreadtele-realityapplications
is theslow andtediousmodel-building process[Adam,1993]. This alsoaffectsmany otherareas
of computergraphics,suchascomputeranimation,specialeffects,andCAD. For smallobjects,
say0.2–2m,laser-basedscannersareagoodsolution.Thesescannersprovideregistereddepthand
coloredtexturemaps,usuallyin eitherrasteror cylindrical coordinates[CyberwareLaboratoryInc,
1990;Rioux andBird, 1993]. They have beenusedextensively in theentertainmentindustryfor
specialeffectsandcomputeranimation.Unfortunately, laser-basedscannersarefairly expensive,
limited in resolution(typically 512 256pixels),andmostimportantly, limited in range(e.g.,they
cannotbeusedto scanin abuilding.)
The image-basedrangingtechniqueswhich we develop in this paperhave the potential to
overcometheselimitations. Imaginewalking throughanenvironmentsuchasa building interior
andfilming a videosequenceof whatyou see.By registeringandcompositingthe imagesin the
1Theterm telepresenceis alsooftenused,especiallyfor applicationssuchastele-medicinewheretwo-wayinter-
activity (remoteoperation)is important[Earnshaw et al., 1993].
2 2 Relatedwork
video togetherinto large mosaicsof the scene,image-basedrangingcanachieve an essentially
unlimited resolution.2 Since the imagescan be acquiredusing any optical technology(from
microscopy, throughhand-heldvideocams,to satellitephotography),the rangeor scaleof the
scenebeingreconstructedis not an issue. Finally, asdesktopvideo becomesubiquitousin our
computingenvironments—initiallyfor videoconferencing,andlaterasanadvanceduserinterface
tool [Gold,1993;RehgandKanade,1994;BlakeandIsard,1994]—image-basedsceneandmodel
acquisitionwill becomeaccessibleto all users.
In this paper, we develop novel techniquesfor extractinglarge 2-D texturesand3-D models
from imagesequencesbasedon imageregistrationandcompositingtechniquesandalsopresent
somepotentialapplications. After a review of relatedwork (Section2) andof the basicimage
formationequations(Section3), we presentour techniquefor registeringpiecesof a flat (planar)
scene,which is thesimplestinterestingimagemosaicingproblem(Section4). We thenshow how
thesametechniquecanbeusedtomosaicpanoramicscenesobtainedby rotatingthecameraaround
its centerof projection(Section5). Section6 discusseshow to recover depthin scenes. Section
7 discussesthemostgeneralanddifficult problem,thatof building full 3-D modelsfrom multiple
images.Section8 presentssomenovel applicationsof thetele-realitytechnologydevelopedin this
paper. Finally, weclosewith acomparisonof ourwork with previousapproaches,andadiscussion
of thesignificanceof our results.
2 Related work
While 3-D modelsacquiredwith laser-basedscannersarenow commonlyusedin computerani-
mationsandfor specialeffects,3-D modelsderiveddirectly from still or videoimagesarestill a
rarity. Oneexampleof physically-basedmodelsderivedfrom imagesarethedancingvegetablesin
the“Cookingwith Kurt” video[TerzopoulosandWitkin, 1988]. Another, morerecentexample,is
the3-D modelof abuilding extractedfrom avideosequencesin [Azarbayejanietal., 1993],which
wasthencompositedwith a 3-D teapot.Somewhatrelatedaregraphicstechniqueswhich extract
camerapositioninformationfrom images,suchas[GleicherandWitkin, 1992].
2The traditionaluseof the term imagecompositing[PorterandDuff, 1984; Foley et al., 1990] is for blending
imageswhicharealreadyregistered.We usetheterm imagemosaicingin this paperfor ourwork to avoid confusion
with this previouswork,althoughcompositing(asin compositesketches)alsoseemsappropriate.
3 Basicimagingequations 3
Theextractionof geometricinformationfrom multiple imageshaslongbeenoneof thecentral
problemsin computervision[BallardandBrown,1982;Horn,1986]andphotogrammetry[Moffitt
and Mikhail, 1980]. However, many of the techniquesusedare basedon featureextraction,
producesparsedescriptionsof shape,andoften requiremanualassistance.Certaintechniques,
suchasstereo,producedepthor elevationmapswhich areinadequateto modeltrue3-D objects.
In computervision, the recoveredgeometryis normally usedeitherfor objectrecognitionor for
graspingandnavigation(robotics).Relatively little attentionhasbeenpaidto building 3-D models
registeredwith intensity (texture maps),which arenecessaryfor computergraphicsandvirtual
reality applications(but see[Mann, 1993] for recentwork which addressessomeof the same
problemsasthispaper.)
In computergraphics,compositingmultiple imagestreamstogetherto createlarger format
(Omnimax) imagesis discussedin [GreeneandHeckbert,1986]. However, in this application,
therelativepositionof thecameraswasknown in advance.Theregistrationtechniquesdeveloped
in this paperarerelatedto imagewarping [Wolberg, 1990]sinceoncethe imagesareregistered,
they canbewarpedinto acommonreferenceframebeforebeingcomposited.3 While mostcurrent
techniquesrequirethe manualspecificationof featurecorrespondences,somerecenttechniques
[Beymeret al., 1993]aswell asthe techniquesdevelopedin this papercanbeusedto automate
this process.Combinationsof local imagewarpingandcompositingarenow commonlyusedfor
specialeffectsunderthegeneralrubric of morphing[BeierandNeely, 1992]. Recently, morphing
basedon z-buffer depthsandcameramotion hasbeenappliedto view interpolationasa quick
alternativeto full 3-D rendering[ChenandWilliams, 1993]. Thetechniqueswedevelopin Section
6 canbe usedto computethe depthmapsnecessaryfor this approachdirectly from real-world
imagesequences.
3 Basic imaging equations
Many of theideasin this paperhave their simplestexpressionusingprojectivegeometry[Semple
andKneebone,1952]. However, ratherthanrelying on theseresults,we will usemorestandard
methodsandnotationsfrom computergraphics[Foley et al., 1990], andprove whatever simple
3Onecanview theimageregistrationtaskasaninversewarpingproblem,sincewearegiventwo imagesandasked
to recover theunknown warpingfunction.
4 3 Basicimagingequations
Figure1: Rigid, affine,andprojective transformations
resultswerequirein AppendixA. Throughout,wewill usehomogeneouscoordinatesto represent
points, i.e., we denote2-D points in the imageplaneas ������������ , with ���������������� being the
correspondingCartesiancoordinates[Foley etal., 1990]. Similarly, 3-D pointswith homogeneous
coordinates���������! "���� have Cartesiancoordinates�������#���������! $���� .Usinghomogeneouscoordinates,wecandescribetheclassof 2-Dplanartransformationsusing
matrixmultiplication%&&&' �"(�"(�)(* +++,.- %&&&' / 00
/01
/02/
10/
11/
12/20
/21
/22
* +++, %&&&' ���* +++, or 0 ( -21 2D 03� (1)
Thesimplesttransformationsin thisgeneralclassarepuretranslations,followedby translationsand
rotations(rigid transforms),plusscaling(similarity transforms),affinetransforms,andfull projec-
tive transforms.Figure1 showsasquareandpossiblerigid, affine,andprojectivetransformations.
Rigid andaffinetransformationshave thefollowing formsfor the 1 2D matrix,
1 rigid-2D - %&&&' cos4 5 sin 4 6�7sin 4 cos4 6�8
0 0 1
* +++, � 1 affine-2D - %&&&' / 00/
01/
02/10
/11
/12
0 0 1
* +++, �with 3 and6 degreesof freedom,respectively, while projectivetransformationshaveageneral1 2D
matrixwith 8 degreesof freedom.4
In 3-D, we have the samehierarchyof transformations,with rigid (6 degreesof freedom),
similarity (7 dof), affine(12dof), andfull projective(15dof) transforms.The 1 3D matricesin this
4Two 1 2D matricesareequivalentif they arescalarmultiplesof eachother. Weremovethisredundancy by setting922 : 1.
4 Planarimagemosaicing 5
caseare4 4. Of particularinterestaretherigid (Euclidean)transformation,; - %'=< >?"@1
*, � (2)
where < is a 3 3 orthonormalrotationmatrix and > is a 3-D translationvector, andthe 3 4
viewing matrix A -CB ˆA ?ED - %&&&' 1 0 0 0
0 1 0 0
0 0 1 �GF 0
* +++, � (3)
which projects3-D pointsthroughtheorigin ontoa 2-D projectionplanea distanceF alongthe axis[Foley etal., 1990]. The3 3 ˆA matrixcanbeageneralmatrix in thecasewheretheinternal
cameraparametersareunknown. ThelastcolumnofA
is alwayszerofor centralprojection.
Thecombinedequationsprojectinga 3-D world coordinateH - ���������! "����� ontoa 2-D screen
location 0 - ���"(I���J(K���L(K� canthusbewrittenas0 - AM; H -N1 camHO� (4)
where 1 cam is a 3 4 camera matrix. This equationis valid even if the cameracalibration
parametersand/orthecameraorientationareunknown.
4 Planar image mosaicing
The simplestpossiblesetof imagesto mosaicarepiecesof a planarscenesuchasa document,
whiteboard,or flat desktop.Imaginethatwe have a camerafixeddirectly over our desk. As we
slidea documentunderthecamera,differentportionsof thedocumentarevisible. Any two such
piecesarerelatedto eachotherby a translationanda rotation(2-D rigid transformation).
Now imaginethatwe arescanninga whiteboardwith a hand-heldvideocamera.Theclassof
transformationsrelatingtwo piecesof theboardis moregeneralin this case.It is easyto seethat
thisclassis thefamily of 2-D projective transformations(just imaginehow asquareor grid in one
imagecanappearin another.) Thesetransformationscanbecomputedwithout any knowledgeof
theinternalcameracalibrationparameters(suchasfocal lengthor opticalcenter)or of therelative
cameramotion betweenframes. The fact that 2-D projective transformationscaptureall such
possiblemappingsis abasicresultof projectivegeometry(seeAppendixA.)
6 4 Planarimagemosaicing
Giventhisknowledge,how dowegoaboutcomputingthetransformationsrelatingthevarious
scenepiecessothatwe canpastethemtogether?A varietyof techniquesarepossible,somemore
automatedthanothers.Forexample,wecouldmanuallyidentify fourormorecorrespondingpoints
betweenthe two views, which is enoughinformationto solve for theeightunknowns in the2-D
projective transformation. Or we could iteratively adjustthe relative positionsof input images
usingeithera blink comparatoror transparency [Carlbomet al., 1991]. Thesekinds of manual
approachesaretoo tediousto beusefulfor largetele-realityapplications.
4.1 Local image registration
Theapproachwe usein this paperis to directly minimize thediscrepancy in intensitiesbetween
pairsof imagesafterapplyingthetransformationwe arerecovering. Thishastheadvantageof not
requiringany easilyidentifiablefeaturepoints,andof beingstatisticallyoptimaloncewearein the
vicinity of thetruesolution[SzeliskiandCoughlan,1994]. Moreconcretely, supposewedescribe
our2-D transformationsas� (P - /0 � PJQ / 1 � PRQ / 2/6 � PRQ / 7 � PJQ 1
��� (P - /3 � PRQ / 4� PRQ / 5/6 � PRQ / 7� PRQ 1
� (5)
Our techniqueminimizesthesumof thesquaredintensityerrorsS -2T PVUXW ( ��� (P ��� (P �Y5 W ��� P ��� P ��Z 2 -[T P]\ 2P (6)
over all correspondingpairsof pixels ^ which areinsideboth imagesW ��������� and W (I���"(I���J(K� (pixels
which aremappedoutsideimageboundariesdo not contribute.) Oncewe have found the best
transformation1 2D, we canwarp image W ( into the referenceframeof W using 1`_ 12D [Wolberg,
1990]andthenblendthetwo imagestogether. To reducevisibleartifacts,weweightimagesbeing
blendedtogethermoreheavily towardsthecenter, usingabilinearweightingfunction.
To performtheminimization,we usetheLevenberg-Marquardtiterativenon-linearminimiza-
tionalgorithm[Pressetal., 1992]. Thisalgorithmrequiresthecomputationof thepartialderivatives
of \ P with respectto theunknown motionparametersa / 0 �b�b� / 7 c . Thesearestraightforwardto
compute,i.e., d \ Pd /0 - � Pe P d W (d � ( �f�b�g�E� d \ Pd /
7 - 5 � Pe P h � (P d W (d � ( Q � (P d W (d � (ji � (7)
4.1 Local imageregistration 7
wheree P is the denominatorin (5), and � d W (k� d �"(I� d W (l� d �"(m� is the imageintensitygradientof W (
at ���"(P ���"(P � . From thesepartials, the Levenberg-Marquardtalgorithm computesan approximate
Hessianmatrix n andtheweightedgradientvector o with componentsprqts -NT P d \ Pd /uq d \ Pd /=s � v q - 5 2 T Pw\ P d \ Pd /=q � (8)
andthenupdatesthemotionparameterestimatex by anamount∆ x - ��n Qzy|{ � _ 1 o , where yis a time-varyingstabilizationparameter[Presset al., 1992]. Theadvantageof usingLevenberg-
Marquardtover straightforwardgradientdescentis that it convergesin fewer iterations. The
completeregistrationalgorithmthusconsistsof thefollowing steps:
1. For eachpixel ^ at location ��� P ��� P � ,(a) computeits correspondingpositionin theotherimage ���}(P ���J(P � using(5);
(b) computetheerrorin intensitybetweenthecorrespondingpixels \ P - W ( ��� (P ��� (P �~5 W ��� P ��� P �andtheintensitygradient� d W (l� d �"(I� d W (m� d �J(m� ;
(c) computethepartialderivativeof \ P w.r.t. the /=q usingd \ Pd /=q - d W (d � ( d �"(d /uq Q d W (d � ( d �"(d /=qasin (7);
(d) addthepixel’s contributionto n and o asin (8).
2. Solve thesystemof equations��n Q�yR{ � ∆ x - o andupdatethemotionestimatex��l�I� 1� -x���� � Q ∆ x .
3. Checkthat theerror in (6) hasdecreased;if not, incrementy (asdescribedin [Presset al.,
1992])andcomputeanew ∆ x ,
4. Continueiteratinguntil the error is below a thresholdor a fixed numberof stepshasbeen
completed.
Thestepsin thisalgorithmaresimilarto theoperationsperformedwhenwarpingimages[Wolberg,
1990;Beier andNeely, 1992], with additionaloperationsfor correctingthe currentwarpingpa-
rametersbasedonthelocal intensityerrorandits gradients(similarto theprocessin [Gleicherand
Witkin, 1992]). For moredetailson theexactimplementation,see[SzeliskiandCoughlan,1994].
8 4 Planarimagemosaicing
4.2 Global image registration
Unfortunately, bothgradientdescentandLevenberg-Marquardtonly find locally optimalsolutions.
If themotionbetweensuccessive framesis large,we mustusea differentstrategy to find thebest
registration. We have implementedtwo differenttechniquesfor dealingwith this problem. The
first technique,which is commonlyusedin computervision, is hierarchical matching, which first
registerssmaller, subsampledversionsof theimageswheretheapparentmotionis smaller[Quam,
1984;Witkin etal., 1987;Bergenetal., 1992]. Motion estimatesfrom thesesmallercoarserlevels
are thenusedto initialize motion estimatesat finer levels, therebyavoiding the local minimum
problem(see[SzeliskiandCoughlan,1994] for details.) While this techniqueis not guaranteed
to find the correctregistration,we have observed empirically that it workswell whenthe initial
misregistrationis only a few pixels (the exact domainof convergencedependson the intensity
patternin theimage.)
For largerdisplacements,we usephasecorrelation [Kuglin andHines,1975;Brown, 1992].
This techniqueestimatesthe 2-D translationbetweena pair of imagesby taking 2-D Fourier
Transformsof eachimage, computingthe phasedifferenceat eachfrequency, performing an
inverseFourierTransform,andsearchingfor a peakin themagnitudeimage. In our experiments,
this techniquehasprovento work remarkablywell, providing goodinitial guessesfor imagepairs
whichoverlapby aslittle as50%. Thetechniquewill notwork,however, if theinter-framemotion
is not mostly translational(largerotationsor zooms),but this doesnot occurin practicewith our
currentimagesequences.
4.3 Results
To demonstratetheperformanceof our algorithm,we digitizedanimagesequencewith a camera
panningoverawhiteboard.Figure2ashowsthefinal mosaicof thewhiteboard,with thelocations
of theconstituentimagesin themosaicshown asblackoutlines.Figure2b is a portionof thefinal
mosaicwithouttheseoutlines.Thismosaicis 1300 2046pixels,basedoncompositing39NTSC
(640 480)resolutionimages.
To computethismosaic,we developedaninteractive imagemosaicingtool which letstheuser
coarselypositionsuccessiveframesrelativetoeachother. Thetoolalsohasanautomaticmosaicing
optionwhichcomputestheinitial roughplacementof eachimagewith respectto thepreviousone
5 Panoramicimagemosaicing 9
usingphasecorrelation.Our algorithmthenrefinesthelocationof eachimageby minimizing (6)
usingthecurrentmosaicthusfar as W ��������� andtheinput framebeingadjustedas W (I���"(����J(m� . The
imagesin Figure2wereautomaticallycompositedwithoutuserinterventionusingthemiddleframe
(centerof theimage)astheanchor image(nodeformations.)As we cansee,our techniqueworks
well on this example.Section9 discussessomeimagingartifactswhichcancausedifficulties.
5 Panoramic image mosaicing
Another imagemosaicingproblemwhich turnsout to be equallysimple to solve is panoramic
imagemosaicing.In this scenario,we rotatea cameraaroundits opticalcenterin orderto create
a full “viewing sphere”of images(in computergraphics,this representationis sometimescalled
anenvironmentmap[Greene,1986].) This is similar to theactionof panoramicstill photographic
cameraswherethe rotationof a cameraon top of a tripod is mechanicallycoordinatedwith the
film transport,theanamorphiclensusedin [Lippman,1980],andto “cinemain theround”movie
theaters[GreeneandHeckbert,1986]. In ourcase,however, wecanmosaicmultiple2-D imagesof
arbitrarydetailandresolution,andweneednotknow thecameramotion. Examplesof applications
includeconstructingtruescenicpanoramas(sayof theview at the rim of theGrandCanyon), or
limited virtual environments(recreatinga meetingroomor officeasseenfrom onelocation.)
It turnsout that imagestakenfrom the sameviewpoint (stationaryoptical center)arerelated
by 2-D projective transformations,just asin the planarscenecase(seeAppendixA for a quick
proof.) Intuitively, we cannottell therelative depthof pointsin thesceneaswe rotate(thereis no
motionparallax), so they mayaswell be locatedon any plane(in projective geometry, we could
saythey lie ontheplaneat infinity, � - 0.) For someapplications,thispropertyis usuallyviewed
asa deficiency, i.e., we cannotrecover scenedepthfrom rotationalcameramotion. For image
mosaicing,this is actuallyanadvantagesincewe just wantto compositea largesceneandbeable
to “look around”from astaticviewpoint.
More formally, the2-D transformationdenotedby 1 2D is relatedto theviewing matrices ˆAand ˆA ( andtheinter-view rotation < by
1 2D - ˆA ( < ˆA _ 1 (9)
(seeAppendixA.) In the caseof a calibratedcamera(known centerof projection),this hasa
10 5 Panoramicimagemosaicing
(a)
(b)
Figure2: Whiteboardimagemosaicexample
(a) mosaicwith componentlocationsshown as black outlines,(b) portion of mosaicin greater
detail.
5.1 Results 11
particularlysimpleform,
1 2D - %&&&' � 00 � 01 F � 02
� 10 � 11 F � 12
� 20 ��F|( � 21 �GF|(`F � 22 �GFR(* +++, � (10)
where F and FR( arethescaledfocal lengthsin thetwo views,and � q�s aretheentriesin therotation
matrix < .5 In this case,we only have to recover five independentparameters(or threeif the Fvaluesareknown) insteadof theusualeight.
How do we representa panoramicscenecompositedusingour techniques?Oneapproachis
to divide the viewing sphereinto several large,potentiallyoverlappingregions,andto represent
eachregion with a planeontowhich we pastetheimages[Greene,1986]. Anotherapproachis to
computethe relative positionof eachframerelative to somebaseframe,andto thenre-compute
anarbitraryview on thefly from all visible pieces,givena particularview direction < andzoom
factor F . In ourcurrentimplementation,weadopttheformerstrategy sinceit is asimpleextension
of theplanarcase.
5.1 Results
Figure3 showsamosaicof abookshelfandcluttereddesk,whichis obviouslyanon-planarscene.
The imageswereobtainedby tilting andpanninga video cameramountedon a tripod, without
any specialstepstakento ensurethat the rotationwasaroundthe truecenterof projection. The
completesceneis registeredquitewell.
6 Scenes with arbitrary depth
Mosaicingflat scenesor panoramicscenesmaybeinterestingfor certaintele-realityapplications,
e.g.,transmittingwhiteboardor documentcontents,or viewing anoutdoorpanorama.However,
for mostapplications,wemustrecover thedepthassociatedwith thesceneto give theillusion of a
3-D environment,e.g.,throughnearbyview synthesis[ChenandWilliams, 1993]. Two possible
approachesareto modelthesceneaspiecewise-planaror to recoverdense3-D depthmaps.
5At largefocal lengthsandsmallrotations,M2D resemblesa puretranslationmatrix.
12 6 Sceneswith arbitrarydepth
(a)
(c)
Figure3: Panoramicimagemosaicexample(bookshelfandcluttereddesk)
(a) mosaicwith componentlocationsshown as black outlines,(b) portion of mosaicin greater
detail.
6.1 Formulation 13
The first approachis to assumethat the sceneis piecewise-planar, asis the casewith many
man-madeenvironmentssuchas building exteriors and office interiors. The imagemosaicing
techniquedevelopedin Section4 canthenbeappliedto eachof theplanarregionsin the image.
Thesegmentationof eachimageinto itsplanarcomponentscaneitherbedoneinteractively(e.g.,by
drawing thepolygonaloutlineof eachregionto beregistered),or automaticallyby associatingeach
pixel with oneof severalglobalmotionhypotheses[JepsonandBlack,1993;WangandAdelson,
1993]. Oncetheindependentplanarpieceshave beencomposited,we could,in principle,recover
the relative geometryof thevariousplanesandthecameramotion (seeAppendixA.) However,
ratherthanpursuingthis approachin this paper, we will pursethesecond,moregeneralsolution
which is to recover a full depthmap,i.e., to infer themissing componentassociatedwith each
pixel in agivenimagesequence.
Whenthecameramotionis known, theproblemof depthmaprecovery is calledstereorecon-
struction(or multi-framestereoif morethantwo views areused[Matthieset al., 1989;Okutomi
andKanade,1993].) Thisproblemhasbeenextensively studiedin photogrammetryandcomputer
vision [Moffitt andMikhail, 1980;BarnardandFischler, 1982;Horn,1986;DhondandAggarwal,
1989]. Whenthe cameramotion is unknown, we have the moredifficult structure from motion
problem[Horn, 1986;Wenget al., 1993;Azarbayejaniet al., 1993;SzeliskiandKang,1994]. In
this section,we presentour solutionto this latterproblembasedon recoveringprojectivedepth,
which is particularlysimpleandrobustandfits in well with thematerialalreadydevelopedin this
paper. Readersinterestedin alternativetechniquesshouldconsulttheabove references.
6.1 Formulation
To formulatetheprojective structurefrom motionrecovery problem,we notethatthecoordinates
correspondingto apixel 0 with projectivedepth� in someotherframecanbewrittenas0 ( - A ( ; H - ˆA ( < ˆA _ 1 0 Q � ˆA ( > -N1 2D 0 Q � ˆ> � (11)
whereA
,;
, < , and > aredefinedin (2–3), and 1 2D and ˆ> arethe computedplanarprojection
matrix andepipoledirection(see(25) in AppendixA.) To recover theparametersin 1 2D andˆ>for eachframealongwith thedepthvalues� (whicharethesamefor all frames),weusethesame
Levenberg-Marquardtalgorithmasbefore.
14 7 Full 3-D modelrecovery
In moredetail,wewrite theprojectionequationas� (P - /0 � PRQ / 1� PRQ 6 0 � PRQ / 2/6 � PRQ / 7� PRQ 6 2� PRQ 1
��� (P - /3 � PRQ / 4� PRQ 6 1 � PRQ / 5/6 � PRQ / 7� PRQ 6 2� PRQ 1
� (12)
Wecomputethepartialderivativesof �"(P and �J(P w.r.t. the /=q and 6 q (whichweconcatenateinto the
motionvector x ) asbeforein (7.) We similarly computethepartialsof � (P and � (P with respectto� P , i.e., d �"(Pd � P - e P 6 0 5E6 2e 2P � d �J(Pd � P - e P 6 0 5E6 2e 2P � (13)
wheree P is thedenominatorin (12).
To estimatethe unknown parameters,we alternateiterationsof the Levenberg-Marquardtal-
gorithmover themotionparametersa / 0 �b�b�b����6 2 c in x andthedepthparametersab� P c , usingthe
partialderivativesdefinedabove to computetheapproximateHessianmatricesn andtheweighted
error vectors o asin (8). In our currentimplementation,in orderto reducethe total numberof
parametersbeingestimated,we representthe depthmapusinga tensor-productspline,andonly
recover thedepthestimatesat thesplinecontrolvertices(thecompletedepthmapis availableby
interpolation)[SzeliskiandCoughlan,1994].
6.2 Results
Figure4 showsanexampleof usingourprojectivedepthrecoveryalgorithm.Theimagesequence
wastakenby moving thecameraupandover thesceneof atablewith stacksof papers(Figure4a).
Theresultingdepth-mapisshownin Figure4basintensity-codedrangevalues. Figures4c–dshow
theoriginal intensityimagetexturemappedontothesurface(anexampleof view extrapolation),
and Figures4c–d show a set of grid-lines overlayedon the recoveredsurface. The shapeis
recoveredreasonablywell in areaswherethereis sufficient texture (uniform intensityareasgive
no depthcues),andtheextrapolatedview looksreasonable.Figure5ashows resultsfrom another
sequence,this time from an outdoorsceneof sometrees. Again, the geometryof the sceneis
recoveredreasonablywell, asindicatedin thedepthmapin Figure5b.
7 Full 3-D model recovery
Recoveringdepthmapsmaybeadequatefor many tele-realityapplications,e.g.,scanninglibrary
bookshelvesor supermarketaisles,but for otherapplications,e.g.,whenwe needto seetheback
7 Full 3-D modelrecovery 15
(a) (b)
(c) (d)
(e) (f)
Figure4: Depthrecoveryexample:tablewith stacksof papers
(a) input image,(b) intensity-codeddepthmap(darkis fartherback),(c–d)texture-mappedsurface
seenfrom novel viewpoints,(e–f)griddedsurface.
16 7 Full 3-D modelrecovery
(a) (b)
Figure5: Depthrecovery example:outdoorscenewith trees
(a) input image,(b) intensity-codeddepthmap(darkis fartherback).
sideof an object, full 3-D modelsmay be required. This is the mostdifficult imagemosaicing
problem,sincenot only do we have to recover depth,but we also have to merge (registerand
composite)multiple depthmaps,andrepresentobjectsgiven no a priori knowledgeabouttheir
roughshapeor topology.
For simpletopologiesandshapes,deformablephysically-basedmodelscando a goodjob of
recovering the unknown geometry[Terzopouloset al., 1987;PentlandandSclaroff, 1991]. For
generaltopologies,theproblemis moredifficult. Sincemany currentshaperecovery techniques
produceincompleteor sparsegeometricdescriptions,ageneral3-Dsurfaceinterpolationtechnique
mayberequiredtogenerateasmoothcontinuoussurface[Hoppeetal.,1992;SzeliskiandTonnesen,
1992].
Thereexist many techniquesfor extracting3-D shapefrom multiple views. For example,we
canrecoveravolumetricdescriptionfrom thebinarysilhouettesof anobjectagainstits background
[Szeliski,1993],computelocal optic flow (pixel motion)estimatesandconvert theseinto sparse
3-D point estimates[Szeliski,1991],or tracktheoccludingcontoursof anobjectto generate3-D
spacecurves[SzeliskiandWeiss,1993]. A completesurvey of 3-D shapeextractiontechniquesis
beyondthescopeof this paper. Instead,we presenttheresultsof two of ourpreviouslydeveloped
algorithmsappliedto animagesequenceof acuprotatingona calibratedturntable(Figure6a).
7 Full 3-D modelrecovery 17
(a) (b)
175
(c) (d)
Figure6: 3-D modelrecovery example
(a) input image,(b) octreerecoveredfrom silhouettes(c) 2-D edges,(d) 3-D extremalcontours
18 8 Applications
Ourfirst techniqueconvertseachimageinto abinarysilhouetteby differencingtheimagewith
an emptybackgroundimage. Eachcubein the octreevolumetric 3-D model is thenprojected
(usingthe known cameraposition) into the silhouette,andcubesthat fall outsidethe silhouette
areculled. Cubeswhich fall partially into thesilhouettearemarkedfor latersubdivision,andthe
processis repeatedaftereachcompleterevolutionatsuccessively finerresolutions[Szeliski,1993].
Theresultingoctreeis shown in Figure6b.
Our secondtechniqueextractssilhouette(extremal)edgesfrom theimagesequence,anduses
amulti-framestereoalgorithmto reconstructtheir3-D position(internalalbedoedgesarealsoex-
tractedandreconstructed)[SzeliskiandWeiss,1993].Figure6cshowsoneof theedgeimagesused
by thealgorithm,andFigure6dshown thecollectionof 3-D curvesestimatedby thealgorithm.
Theseexamplesdemonstratesomeof our algorithmsfor reconstructingan isolatedobject
undergoingknown motion. Similar techniquescanbeusedto solve the moregeneral3-D scene
recovery problemwherethe cameramotion is unknown. Methodsfor determiningthe motion
includethe projective motionalgorithmpresentedin the previous section,aswell astechniques
describedin [Horn, 1986;Wenget al., 1993;Azarbayejaniet al., 1993]. Comparedto the2-D
anddepth-mapmodelsdevelopedin Sections4–6,therecoveryof 3-D modelsis moredifficult and
lessrobust,but holdsthepromiseof morerealisticandmoregeneralobjectandscenemodels.
8 Applications
Givenautomatedtechniquesfor building 2-D and3-D scenesfrom videosequences,whatcanwe
do with thesemodels?In this section,we describea numberof potentialapplications,including
whiteboardanddocumentscanning,3-D modelacquisitionfor inverseCAD, modelacquisitionfor
computeranimationandspecialeffects,supermarketshoppingat home,interactive walkthroughs
of historicalbuildings,andlive tele-reality(telepresence)applications.
8.1 Planar mosaicing applications
Themoststraightforwardapplicationis scanningwhiteboardsor blackboardsasanaid to video-
conferencingor asan easyway to captureideas. Scanningcanproduceimagesof muchgreater
resolutionthansinglewide-anglelensshots. This facility couldbecombinedwith avideoprojec-
tion systemto createavirtual whiteboard similarto theDigitalDeskproposedby Wellner[Wellner,
8.2 3-D modelbuilding applications 19
1993](see[Gold, 1993]for otherexamplesof ubiquitouscomputingandaugmentedreality.) The
techniquesdevelopedin this paperenableany videocameraattachedto a computerto beused.In
specializedsituations,e.g.,in classrooms,acomputer-controlledcameracouldbeusedto perform
thescanning,removing theneedfor automaticregistrationof theimages.
Anotherobviousapplicationof this technologyis for documentscanning.Hand-heldscanners
currentlyperformthis functionquitewell. However, sincethey arebasedon linear CCDs,they
aresubjectto “skewing” problemsin eachstrip of the scannedimagewhich mustbe corrected
manually. Using2-D videoimagesasinput removessomeof theseproblems.A final globalskew
may still be neededto makethe documentsquare,but this can be performedautomaticallyby
detectingdocumentedgesor internalhorizontalandverticaledges(e.g.,columnborders).
8.2 3-D model building applications
The 3-D modelbuilding technology, while moredifficult to automatereliably, haseven greater
potential. For example,we could constructa 3-D fax which would scan3-D objectsat oneend
usinga videocameraandeithermechanicalor user-controlledmotion,anda3-D graphicsdisplay
at theotherend[Carlbometal., 1992]. This fax couldbeusedto transmit3-D modelsto a remote
sitefor viewing anddesignrevision,or to input roughCAD modelsfor reverseengineeringor clay
mock-upapplications.Of course,this technologycouldalsobecombinedwith stereolithography
or numericallycontrolledmilling to makethis a truesolid 3-D fax [Bresenham,1993].
Rapid3-D modelbuilding is also critical in the computeranimationandspecialeffects in-
dustries. Currently, it cantakedaysor weeksto build by handeachcomputermodelusedin a
feature-lengthfilm. Theability to build suchmodelsrapidlyfrom realobjectsorphysicalmock-ups
wouldbeof greatutility, especiallywhenfreedfrom therangeandresolutionlimitationsof active
rangefindingsystems.Thetechniquesweusefor computing3-D geometrycouldalsobemodified
to automaticallytrackthemotionsof objectsor actorsin ascene(this is calledmotioncapture, and
is currentlydoneusingspeciallyplacedmarkers.)
8.3 End-user applications
A moremass-marketapplicationishomeshoppingappliednottosingleobjects(eitherdemonstrated
by a modelor availablein a catalog),but to completestores,suchasyour local supermarket.This
20 8 Applications
hastheadvantageof having a familiar look andorganization,andenablesthepre-planningof your
next shoppingtrip. Theimagesof theaisles(with theircurrentcontents)canbedigitizedby rolling
a video camerathroughthe store. More detailedgeometricmodelsof individual items canbe
acquiredeitherby a 3-D modelbuilding process,or, in thefuture,directly from themanufacturer.
Theshoppercanthenstroll down theaisles,pick out individualitems,andlook at their ingredients
andprices.
A similarscenariois possiblewith otherbuildings. Architecturalwalkthroughshave longbeen
a part of computergraphics[Greenberg, 1974], but only for buildings currently underdesign.
Walkthroughof existing building canhaveanumberof applications.For example,interactive3-D
walkthroughsof your home,built by walking a video camerathroughthe roomsandmosaicing
the imagesequences,could be usedfor selling your house(an extensionof existing still-image
basedsystems),or for re-modelingor renovating its design. Walkthroughsof historic building
(e.g.,palacesor museums)canbe usedfor educationalandentertainmentpurposes.A museum
scenariomight includetheability to look at individual3-D objectssuchassculptures,andto bring
uprelatedinformationin a hypertext system[Miller etal., 1991].
Walkthroughs(or fly-throughs)canalsobedonein general(e.g.,outdoor)3-D scenes.An early
exampleof thiswastheMovie-Mapsproject[Lippman,1980]whichwasbasedonvideodisctech-
nology. Sincethis systemwasbasedon choosingfrom a numberof pre-storedmotionsequences,
the rangeof possibleviewpoints(anddetail) was limited. The basicMovie-Mapsideacan be
extendedby building true 3-D scenesfrom the input motion sequences. Such3-D tele-reality
modelshavethepotentialof beingof extremelyhighcomplexity andwill requiresolvingproblems
in representation,partial3-D models[ChenandWilliams, 1993],andswitchingbetweendifferent
levelsof resolution[FunkhouserandSequin,1993].
The ultimate in tele-realitysystemsis dynamictele-reality(sometimescalled telepresence),
which compositesvideo from multiple sourcein real-timeto createthe illusion of being in a
dynamic(andperhapsreactive) 3-D environment. An exampleof suchan applicationmight be
to view a 3-D versionof a concertwith controlover thecamerashots,evenbeingableto seethe
concertfrom themusicians’point of view. Otherexamplesmightbeto participateor consultin a
surgery from a remotelocation(tele-medicine), or to remotelyparticipatein a virtual classroom.
Building suchdynamic3-D modelsat frameratesis beyondtheprocessingpowerof today’shigh-
performancesuperscalarworkstations,but it couldbeachievedusingacollectionof suchmachines
9 Discussion 21
(or special-purposestereohardware).In suchapplications,weneedto usemultiplecameras,since
in moving sceneswe cannotrely on themotionparallaxfrom a singlecamerato give usreliable
shapeinformation.
9 Discussion
Imagemosaicingprovidesa powerful new way of creatingthe detailed3-D modelsandscenes
neededfor tele-realityapplications.By registeringmultiple imagestogether, wecancreatescenes
of extremelyhighresolutionandat thesametimerecoverpartialor full 3-D geometricinformation.
While we usetechniquesfrom computervision to performtheregistration,our focusis different:
traditionalvision techniquesaredesignedfor inspection,recognition,androbotcontrol,while our
techniquesaredesignedto producerealistic3-D modelsfor computergraphicsandvirtual reality
applications.
The approachwe use,namelydirect minimizationof intensitydifferencesbetweenwarped
images,hasa numberof advantagesover more traditional vision techniques,which are based
on trackingfeaturesfrom frameto frame[TomasiandKanade,1992;Azarbayejaniet al., 1993;
Szeliski and Kang, 1994]. Our techniquesproducedenseestimatesof shape,work in highly
texturedareaswherefeaturesmaynot bereliably observed,andmakestatisticallyoptimaluseof
all theinformation[SzeliskiandCoughlan,1994]. Ourapproachissimilarto [Bergenetal., 1992],
who alsouseintensitydifferences.However, we usefull 2-D projective modelsof motioninstead
of instantaneousquadraticflow fields,andwe alsousea projective formulationof structurefrom
motion,which eliminatestheneedfor calibratedcameras.It is alsovery similar to [Mann,1993],
whoalsouseplanarprojectivemotionestimates.Furthermore,[MannandPicard,1994]have also
demonstratedsuperresolutionresults,i.e., theability to obtainhigherresolutionimagesby simply
jittering thecameraratherthanpanning[Irani andPeleg, 1991].
While our techniqueshave workedwell in thescenesin which we have tried them,we must
becautiousabouttheir generalapplicability. Theintensity-basedtechniqueswe usearesensitive
to imageintensityvariation,suchas thosecausedby video cameragain control andvignetting
(darkeningof the cornersat wide lens apertures);working with band-passfiltered imagescan
remove mostof theseproblems.All vision techniquesarealsosensitive to geometricdistortions
(deviationsfrom thepinholemodel)in the optics,so carefulcalibrationis necessaryfor optimal
22 10 Conclusions
accuracy (theresultsin this paperwereobtainedwith uncalibratedcameras.)
Thedepthextractiontechniquesweuserely onthepresenceof texturein theimage.In areasof
sufficienttexture,it is still possiblefor theregistration/matchingalgorithmtocomputeanerroneous
depthestimate(suchgrosserrorsarelesscommonin activerangefinders.)Wheretextureis absent,
interpolationmustbe used,andthis canleadto erroneous(hallucinated)depthestimates.Other
visual cues,suchasoccludingcontoursor shading[Horn, 1986], canbe usedto mitigate these
problemsin somecases,but a general-purposereliablevision-basedrangingsystemremainsvery
muchanopenresearchproblem.
The size and structureof the modelsproducedfor tele-realityapplicationshave interesting
implicationsfor specializedgraphicshardwareandsoftwareinterfaces.Currentgraphicshardware
storestheintensityimages(texturemaps)separatelyfromthegeometry, whichisstoredaspolygons
usuallyspanningmany pixels. A moreappropriatearchitecturemightberegisteredmulti-resolution
color imagesanddepthmaps.View interpolation,ratherthan3-D rendering,couldthenbeused
to synthesizenovel 3-D views [Chen and Williams, 1993]. Thesekinds of architecturesand
algorithmspoint towardsa tighter coupling betweencomputergraphicsand imageprocessing
[Pickering,1993].
10 Conclusions
In thispaper,wehavepresentedahierarchyof scenemodels,rangingfrom2-Dplanarandpanoramic
mosaicsthroughfull 3-D objectmodels,anddevelopeda collectionof associatedscenerecovery
algorithms.Thecreationof realistichigh-resolution3-Dscenesfromvideoimageryopensupmany
new applicationsfor tele-realitytechnology. Theseincludeofficeapplicationssuchaswhiteboard,
document,andbookshelfscanning,simulatedmeetingspaces,engineeringapplicationssuchas
reverseengineeringand collaborative design,and computergraphicsapplicationssuchas 3-D
modelbuilding. Moreimportantly, this technologyenablesmassmarketapplicationssuchashome
shopping,education(the virtual classroom),and entertainment(virtual travel). Ultimately, as
processingspeedsandreconstructionalgorithmsimprove further, we will seedynamicreal-time
3-D sceneandmodelrecovery beingusedto provide anevenmoreexciting rangeof telepresence
andtele-realityapplications.
10 Conclusions 23
Acknowledgements
Bob Thomasintroducedme to the term tele-reality, which he usedto describethe live concert
telepresencescenario.JamesCoughlandevelopedthe imageregistrationcode. David Tonnesen,
DemetriTerzopoulos,SingBing Kang,andRichardWeisswerecollaboratorsonsomeof the3-D
reconstructionalgorithmsdescribedin this paper. William Hsu helpedmeproducethe resultsin
thispaper. Ingrid CarlbomandGudrunKlinker providedusefulcommentson themanuscript.
References
[CyberwareLaboratoryInc, 1990] CyberwareLaboratoryInc. 4020/RGB3D Scannerwith color
digitizer. Monterey, 1990.
[Adam,1993] J.A. Adam. Virtual reality is for real. Spectrum, :22–29,October1993.
[Azarbayejanietal., 1993] A. Azarbayejani,B. Horowitz, andA. Pentland.Recursiveestimation
of structureandmotion using relative orientationconstraints. In IEEE ComputerSociety
Conferenceon ComputerVision andPattern Recognition(CVPR’93), pages294–299,New
York, New York, June1993.
[BallardandBrown, 1982] D. H. Ballard and C. M. Brown. ComputerVision. Prentice-Hall,
EnglewoodClif fs, New Jersey, 1982.
[BarnardandFischler, 1982] S.T. BarnardandM. A. Fischler. Computationalstereo.Computing
Surveys, 14(4):553–572,December1982.
[BeierandNeely, 1992] T. BeierandS. Neely. Feature-basedimagemetamorphosis.Computer
Graphics(SIGGRAPH’92), 26(2):35–42,July1992.
[Bergenetal., 1992] J.R.Bergen,P. Anandan,K. J.Hanna,andR.Hingorani.Hierarchicalmodel-
basedmotionestimation.In SecondEuropeanConferenceon ComputerVision (ECCV’92),
pages237–252,Springer-Verlag,SantaMargheritaLiguere,Italy, May 1992.
[Beymeretal., 1993] D. Beymer, A. Shashua,andT. Poggio.ExampleBasedImageAnalysisand
Synthesis. A. I. Memo1431,MassachusettsInstituteof Technology, November1993.
[BlakeandIsard,1994] A. BlakeandM. Isard.3d position,attitude,andshapeinput usingvideo
trackingof handsandlips. ComputerGraphics(SIGGRAPH’94), July 1994.
24 10 Conclusions
[Bresenham,1993] J. Bresenham.Realvirtuality: Stereolithography– rapidprototypingin 3-D.
ComputerGraphics(SIGGRAPH’93), :377–378,August1993.
[Brown, 1992] L. G. Brown. A survey of imageregistrationtechniques.ComputingSurveys,
24(4):325–376,December1992.
[Carlbomet al., 1992] I. Carlbomandothers.Modelingandanalysisof empiricaldatain collab-
orativeenvironments.Communicationsof theACM, 35(6):74–84,April 1992.
[Carlbomet al., 1991] I. Carlbom,D. Terzopoulos,andK. M. Harris. Reconstructingandvisual-
izing modelsof neuronaldendrites.In N. M. Patrikalakis,editor, ScientificVisualizationof
PhysicalPhenomena, pages623–638,Springer-Verlag,New York, 1991.
[ChenandWilliams, 1993] S. Chenand L. Williams. View interpolationfor imagesynthesis.
ComputerGraphics(SIGGRAPH’93), :279–288,August1993.
[DhondandAggarwal,1989] U. R. Dhond and J. K. Aggarwal. Structurefrom stereo—are-
view. IEEE Transactionson Systems,Man, and Cybernetics, 19(6):1489–1510,Novem-
ber/December1989.
[Earnshaw etal., 1993] R. A. Earnshaw, M. A. Gigante,andH. Jones,editors. Virtual Reality
Systems. AcademicPress,London,1993.
[Faugeras,1992] O. D. Faugeras.What can be seenin threedimensionswith an uncalibrated
stereorig? In SecondEuropeanConferenceonComputerVision(ECCV’92), pages563–578,
Springer-Verlag,SantaMargheritaLiguere,Italy, May 1992.
[Foley etal., 1990] J.D. Foley, A. vanDam,S.K. Feiner, andJ.F. Hughes.ComputerGraphics:
PrinciplesandPractice. Addison-Wesley, Reading,MA, 2 edition,1990.
[FunkhouserandSequin,1993] T. A. FunkhouserandC. H. Sequin. Adaptive displayalgorithm
for interactive frameratesduringvisualizationof complex virtual environments.Computer
Graphics(SIGGRAPH’93), :247–254,August1993.
[GleicherandWitkin, 1992] M. GleicherandA. Witkin. Through-the-lenscameracontrol. Com-
puterGraphics(SIGGRAPH’92), 26(2):331–340,July 1992.
[Gold, 1993] R. Gold. Ubiquitouscomputingandaugmentedreality. ComputerGraphics(SIG-
GRAPH’93), :393–394,August1993.
[Greenberg, 1974] D. P. Greenberg. Computergraphicsin architecture. ScientificAmerican,
:98–106,May 1974.
10 Conclusions 25
[Greene,1986] N. Greene. Environmentmappingandother applicationsof world projections.
IEEE ComputerGraphicsandApplications, 6(11):21–29,November1986.
[GreeneandHeckbert,1986] N. GreeneandP. Heckbert.CreatingrasterOmnimaximagesfrom
multipleperspectiveviewsusingtheellipticalweightedaveragefilter. IEEEComputerGraph-
icsandApplications, 6(6):21–27,June1986.
[Hartley, 1994] R. I. Hartley. Euclideanreconstructionfrom uncalibratedviews. In IEEE Com-
puterSocietyConferenceon ComputerVision andPatternRecognition(CVPR’94), Seattle,
Washington,June1994.
[Hoppeetal., 1992] W. Hoppeandothers.Surfacereconstructionfrom unorganizedpoints.Com-
puterGraphics(SIGGRAPH’92), 26(2):71–78,July 1992.
[Horn, 1986] B. K. P. Horn. RobotVision. MIT Press,Cambridge,Massachusetts,1986.
[Irani andPeleg, 1991] M. IraniandS.Peleg. Improvingresolutionby imageregistration.Graph-
ical ModelsandImageProcessing, (3), May 1991.
[JepsonandBlack,1993] A. JepsonandM. J. Black. Mixture modelsfor optical flow computa-
tion. In IEEE ComputerSocietyConferenceon ComputerVision and Pattern Recognition
(CVPR’93), pages760–761,New York, New York, June1993.
[Kuglin andHines,1975] C. D. Kuglin andD. C. Hines. Thephasecorrelationimagealignment
method.In IEEE 1975Conferenceon CyberneticsandSociety, pages163–165,New York,
September1975.
[Lippman,1980] A. Lippman. Movie maps:An applicationof theopticalvideodiscto computer
graphics.ComputerGraphics(SIGGRAPH’80), 14(3):32–43,July1980.
[Mann,1993] S. Mann. Compositingmultiple picturesof the samescene: Generalizedlarge-
displacement8-parametermotion. In Proceedingsof the46thAnnualIS&T Conference, The
Societyof ImagingScienceandTechnology, Cambridge,Massachusetts,May 1993.
[MannandPicard,1994] S. Mann and R. W. Picard. The projective group model for multi-
frameresolutionenhancement.In First IEEEInternationalConferenceonImageProcessing
(ICIP’94), (submitted)1994.
[Matthieset al., 1989] L. H. Matthies,R. Szeliski,andT. Kanade.Kalmanfilter-basedalgorithms
for estimatingdepth from image sequences.International Journal of ComputerVision,
3:209–236,1989.
26 10 Conclusions
[Miller etal., 1991] G. Miller and others. The virtual museum: Interactive 3D navigation of
a multimediadatabase.Journal of Visualizationand ComputerAnimation, 3(3):183–198,
1991.
[Moffitt andMikhail, 1980] F. H. Moffitt andE. M. Mikhail. Photogrammetry. Harper& Row,
New York,3 edition,1980.
[OkutomiandKanade,1993] M. Okutomi and T. Kanade. A multiple baselinestereo. IEEE
TransactionsonPatternAnalysisandMachineIntelligence, 15(4):353–363,April 1993.
[PentlandandSclaroff, 1991] A. PentlandandS.Sclaroff. Closed-formsolutionsfor physically-
basedshapemodelingandrecognition.IEEETransactionsonPatternAnalysisandMachine
Intelligence, 13(7):715–729,July1991.
[Pickering,1993] W. R. Pickering.Merging 3-D graphicsandimaging– applicationsandissues.
ComputerGraphics(SIGGRAPH’93), :395–396,August1993.
[PorterandDuff, 1984] T. PorterandT. Duff. Compositingdigital images.ComputerGraphics
(SIGGRAPH’84), 18(3):253–259,July1984.
[Presset al., 1992] W. H. Press,B. P. Flannery, S.A. Teukolsky, andW. T. Vetterling.Numerical
Recipesin C: The Art of ScientificComputing. CambridgeUniversity Press,Cambridge,
England,secondedition,1992.
[Quam,1984] L. H. Quam. Hierarchical warp stereo. In Image UnderstandingWorkshop,
pages149–155,ScienceApplicationsInternationalCorporation,New Orleans,Louisiana,
December1984.
[RehgandKanade,1994] J.RehgandT.Kanade.Visualtrackingof highdofarticulatedstructures:
anapplicationto humanhandtracking. In Third EuropeanConferenceon ComputerVision
(ECCV’94), pages35–46,Springer-Verlag,Stockholm,Sweden,May 1994.
[Rioux andBird, 1993] M. RiouxandT. Bird. Whitelaser, syncedscan.IEEEComputerGraphics
andApplications, 13(3):15–17,May 1993.
[SempleandKneebone,1952] J.G. SempleandG. T. Kneebone.Algebraic ProjectiveGeometry.
ClarendonPress,Oxford,England,1952.
[Szeliski,1991] R.Szeliski.Shapefrom rotation.In IEEEComputerSocietyConferenceonCom-
puter Vision andPattern Recognition(CVPR’91), pages625–630,IEEE ComputerSociety
Press,Maui, Hawaii, June1991.
10 Conclusions 27
[Szeliski,1993] R. Szeliski. Rapidoctreeconstructionfrom imagesequences.CVGIP: Image
Understanding, 58(1):23–32,July1993.
[Szeliski,1994] R.Szeliski.A leastsquaresapproachto affineandprojectivestructureandmotion
recovery. 1994.In preparation.
[SzeliskiandCoughlan,1994] R. SzeliskiandJ.Coughlan.Hierarchicalspline-basedimagereg-
istration.In IEEEComputerSocietyConferenceonComputerVisionandPatternRecognition
(CVPR’94), Seattle,Washington,June1994.
[SzeliskiandKang,1994] R. SzeliskiandS. B. Kang. Recovering 3D shapeandmotion from
imagestreamsusingnonlinearleastsquares.Journal of Visual Communicationand Image
Representation, 5(1):10–28,March1994.
[SzeliskiandTonnesen,1992] R. SzeliskiandD. Tonnesen.Surfacemodelingwith orientedpar-
ticle systems.ComputerGraphics(SIGGRAPH’92), 26(2):185–194,July1992.
[SzeliskiandWeiss,1993] R. SzeliskiandR. Weiss.Robustshaperecovery from occludingcon-
toursusinga linearsmoother. In ImageUnderstandingWorkshop, pages939–948,Morgan
KaufmannPublishers,Washington,D. C., April 1993. A longerversionis availableasCRL
TR 93/7.
[TerzopoulosandWitkin, 1988] D. Terzopoulosand A. Witkin. Physically-basedmodelswith
rigid anddeformablecomponents.IEEE ComputerGraphicsandApplications, 8(6):41–51,
1988.
[Terzopoulosetal., 1987] D. Terzopoulos,A. Witkin, andM. Kass. Symmetry-seekingmodels
and 3D object reconstruction. International Journal of ComputerVision, 1(3):211–221,
October1987.
[TomasiandKanade,1992] C. TomasiandT. Kanade. Shapeandmotion from imagestreams
under orthography: A factorizationmethod. International Journal of ComputerVision,
9(2):137–154,November1992.
[WangandAdelson,1993] J. Y. A. WangandE. H. Adelson.Layeredrepresentationfor motion
analysis.In IEEEComputerSocietyConferenceonComputerVisionandPatternRecognition
(CVPR’93), pages361–366,New York, New York, June1993.
[Wellner, 1993] P. Wellner. Interactingwith paperon the DigitalDesk. Communicationsof the
ACM, 36(7):86–96,July 1993.
28 A 2-D projectivetransformations
[Wengetal., 1993] J. Weng, T. S. Huang, and N. Ahuja. Motion and Structure from Image
Sequences. Springer-Verlag,Berlin, 1993.
[Witkin et al., 1987] A. Witkin, D. Terzopoulos,and M. Kass. Signal matchingthroughscale
space.InternationalJournalof ComputerVision, 1:133–144,1987.
[Wolberg, 1990] G. Wolberg. Digital ImageWarping. IEEEComputerSocietyPress,LosAlami-
tos,California,1990.
A 2-D projective transformations
Planarsceneviewsarerelatedby projectivetransformations.This is truein themostgeneralcase,
i.e., for any 3-D to 2-D mapping, 0 -21 camH3� (14)
whereH is a3-D worldcoordinate,0 is the2-D screenlocation,and 1 cam is a3 4 camera matrix
definedin (4). Since1 cam is of rank3, we haveH -z1�� 0 Q�� x (15)
where 1 � is a left inverseof 1 cam and x is in thenull spaceof 1 cam, i.e., 1 camx - 0. The
equationof a planein world coordinatescanbewrittenasp � Q v�� Q]� -N� or ���GH - 0 (16)
from whichwe canconcludethat� @ 1 � 0 Q]� �u�Gx - 0 or � - 5���� @ 1 � 0��t�������Gx�� (17)
andhence H - � { 5[���u�Gx�� _ 1 xE� @ � 1 � 0 - ˜1 0O� (18)
Fromany otherviewpoint,wehave 0 ( -N1 (cam˜1 0 -N1 2D 0O� (19)
which is a2-D planarprojective transformation.
A 2-D projectivetransformations 29
For theremainderof thisappendix,it is convenientto assumethattheworld coordinatesystem
coincideswith thefirst cameraposition,i.e.,; - { and 1 cam - A - B ˆA ? D , andtherefore1 � - ˆA _ 1 and x - � 0 � 0 � 0 � 1 � @ in (15–18). An imagepoint 0 thencorrespondsto a world
coordinateH , H - %' ˆA _ 1 0� *, � (20)
where � is theunknown fourth coordinateof H (calledprojectivedepth) which controlshow far
thepoint is from theorigin.
For panoramicimagemosaicing,we rotatethe point H aroundthe cameraoptical centerto
obtain H ( - %'.< ??"@1
*, H - %'�< ˆA _ 1 0� *, � (21)
with a correspondingscreencoordinate0 ( - ˆA ( < ˆA _ 1 0 -N1 2D 03� (22)
Thus,themappingbetweenthetwo screencoordinatesystemscanbedescribedby a2-D projective
transformation.
For piecewise-planarscenes,thepoint H will lie onsomeplane� whoseequationwecanwrite
as � q �GH - � ˆ� q �b5 1 ���GH - 0.6 Then,(18) reducestoH q - %' {� @q *, ˆA _ 1 0 (23)
asthe3-D positiononplane� correspondingto thescreenposition 0 . In asecondcameraposition,
theworld coordinateshave undergonea EuclideantransformationH (q - ; H q with rotation < and
translation> . Thenew screencoordinatesaretherefore0 (q - A ( ; H q - ˆA ( � < Q > � @q � ˆA _ 1 0 -N1 q 0 (24)
Thus,weseethat2-D projectivemotionmatricesfor thevariousplanarpiecesareinterrelated,and
arein fact determinedby the planeequations( � q ), the rigid motion ( < and > ), andthe viewing
matrices( ˆA and ˆA ( ). For methodsfor recovering theseunknown parametersfrom the 1 q , see
[Faugeras,1992;Hartley, 1994].
6Weexcludeplanespassingthroughthefirst cameracenterin this formulation.
30 A 2-D projectivetransformations
For generaldepthrecovery, eachimagepoint 0 correspondsto a3-D point H with anunknown
projective depth � asgiven in (20). In someotherframe,whoserelative orientationto the first
frameis denotedby;
, we have0 ( - A ( ; H - ˆA ( < ˆA _ 1 0 Q � ˆA ( > -N1 2D 0 Q � ˆ> � (25)
Thus,themotionbetweenthe two framescanbedescribedby our familiar 2-D planarprojective
motion, plus an amountof motion proportionalto the projective depth along the direction ˆ>(which is calledthe epipole) [Faugeras,1992]. Note that the valuesof 1 2D, ˆ> , and � arenot
unique: thereis a scaleambiguitybetween� and ˆ> , andmultiplesof ˆ> can be addedto 1 2D
with theappropriateadjustmentsto � . To remove theseambiguities,we mustenforcetherigidity
constraints1 2D - ˆA ( < ˆA _ 1 [Szeliski,1994].