Superimposed video disambiguation for increased field of view

Superimposed video disambiguationfor increased field of view

Roummel F. Marcia, Changsoon Kim, Cihat Eldeniz,Jungsang Kim, David J. Brady, and Rebecca M. Willett

Department of Electrical and Computer Engineering,Duke University,

Durham, NC 27708, USA

[email protected]

Abstract: Many infrared optical systems in wide-ranging applicationssuch as surveillance and security frequently require large fields of view(FOVs). Often this necessitates a focal plane array (FPA) with a largenumber of pixels, which, in general, is very expensive. In a previous paper,we proposed a method for increasing the FOV without increasing the pixelresolution of the FPA by superimposing multiple sub-images within a staticscene and disambiguating the observed data to reconstruct the originalscene. This technique, in effect, allows each sub-image of the scene to sharea single FPA, thereby increasing the FOV without compromising resolution.In this paper, we demonstrate the increase of FOVs in a realistic setting byphysically generating a superimposed video from a single scene using anoptical system employing a beamsplitter and a movable mirror. Withoutprior knowledge of the contents of the scene, we are able to disambiguatethe two sub-images, successfully capturing both large-scale features and finedetails in each sub-image. We improve upon our previous reconstructionapproach by allowing each sub-image to have slowly changing components,carefully exploiting correlations between sequential video frames to achievesmall mean errors and to reduce run times. We show the effectivenessof this improved approach by reconstructing the constituent images of asurveillance camera video.

© 2008 Optical Society of America

OCIS codes: (100.2000) Digital image processing; (100.7410) Wavelets; (110.1758) Compu-tational imaging; (110.3010) Image reconstruction techniques; (110.4155) Multiframe imageprocessing

References and links1. Y. Hagiwara, “High-density and high-quality frame transfer CCD imager with very low smear, low dark current,

and very high blue sensitivity,” IEEE Trans. Electron Devices 43, 2122–2130 (1996).2. H. S. P. Wong, R. T. Chang, E. Crabbe, and P. D. Agnello, “CMOS active pixel image sensors fabricated using a

1.8-V, 0.25-mu m CMOS technology,” IEEE Trans. Electron Devices 45, 889–894 (1998).3. S. D. Gunapala, S. V. Bandara, J. K. Liu, C. J. Hill, S. B. Rafol, J. M. Mumolo, J. T. Trinh, M. Z. Tidrow, and

P. D. Le Van, “1024 x 1024 pixel mid-wavelength and long-wavelength infrared QWIP focal plane arrays forimaging applications,” Semicond. Sci. Technol. 20, 473–480 (2005).

4. S. Krishna, D. Forman, S. Annamalai, P. Dowd, P. Varangis, T. Tumolillo, A. Gray, J. Zilko, K. Sun, M. G. Liu,J. Campbell, and D. Carothers, “Demonstration of a 320x256 two-color focal plane array using InAs/InGaAsquantum dots in well detectors,” Appl. Phys. Lett. 86, 193,501 (2005).

5. R. Szeliski, “Image mosaicing for tele-reality applications,” Proc. IEEE Workshop on Applications of ComputerVision pp. 44–53 (1994).

#98982 - $15.00 USD Received 18 Jul 2008; revised 11 Sep 2008; accepted 22 Sep 2008; published 29 Sep 2008

(C) 2008 OSA 13 October 2008 / Vol. 16, No. 21 / OPTICS EXPRESS 16352

6. R. A. Hicks, V. T. Nasis, and T. P. Kurzweg, “Programmable imaging with two-axis micromirrors,” Opt. Lett. 32,1066–1068 (2007).

7. S. C. Park, M. K. Park, and M. G. Kang, “Super-resolution image reconstruction: A technical overview,” IEEESignal Process. Mag. 20, 21–36 (2003).

8. R. C. Hardie, K. J. Barnard, J. G. Bognar, E. E. Armstrong, and E. A. Watson, “High-resolution image recon-struction from a sequence of rotated and translated frames and its application to an infrared imaging system,”Opt. Eng. 37, 247–260 (1998).

9. J. C. Gillett, T. M. Stadtmiller, and R. C. Hardie, “Aliasing reduction in staring infrared imagers utilizing subpixeltechniques,” Opt. Eng. 34, 3130–3137 (1995).

10. M. Irani and S. Peleg, “Improving resolution by image registration,” CVGIP: Graph. Models Image Process. 53,231–239 (1991).

11. R. F. Marcia, C. Kim, J. Kim, D. Brady, and R. M. Willett, “Fast disambiguation of superimposed images forincreased field of view,” Accepted to “Proc. IEEE Int. Conf. Image Proc. (ICIP 2008)”.

12. P. D. O’Grady, B. A. Pearlmutter, and S. T. Rickard, “Survey of sparse and non-sparse methods in source sepa-ration,” Int. J. Imag. Syst. Tech. 15, 18–33 (2005).

13. A. M. Bronstein, M. M. Bronstein, M. Zibulevsky, and Y. Y. Zeevi, “Sparse ICA for blind separation of transmit-ted and reflected images,” Int. J. Imag. Syst. Tech. 15, 84–91 (2005).

14. E. Be’ery and A. Yeredor, “Blind separation of superimposed shifted images using parameterized joint diagonal-ization,” IEEE Trans. Image Process. 17, 340–353 (2008).

15. J. Bobin, J.-L. Starck, J. Fadili, and Y. Moudden, “Morphological Diversity and Source Separation,” IEEE Trans.Signal Process. 13, 409–412 (2006).

16. S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM J. Sci. Comput.20, 33–61 (electronic) (1998).

17. M. A. T. Figueiredo, R. D. Nowak, and S. J. Wright, “Gradient projection for sparse reconstruction: Applica-tion to compressed sensing and other inverse problems,” IEEE Journal of Selected Topics in Signal Processing:Special Issue on Convex Optimization Methods for Signal Processing (To appear).

18. E. Candes and T. Tao, “Near Optimal Signal Recovery From Random Projections: Univer-sal Encoding Strategies,” (2006). To be published in IEEE Transactions on Information Theory.http://www.acm.caltech.edu/∼emmanuel/papers/OptimalRecovery.pdf.

19. D. L. Donoho and Y. Tsaig, “Fast solution of �1-norm minimization problems when the solution may be sparse,”Preprint (2006).

20. R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Roy. Statist. Soc. Ser. B 58, 267–288 (1996).21. R. F. Marcia and R. M. Willett, “Compressive coded aperture video reconstruction,” Accepted to “Proc. Sixteenth

European Signal Processing Conference (EUSIPCO 2008)”.22. “Benchmark Data for PETS-ECCV 2004,” in Sixth IEEE International Workshop on Performance Evaluation

of Tracking and Surveillance (2004). URL http://www-prima.imag.fr/PETS04/caviar\char‘data.html.

23. M. Duarte, M. Davenport, D. Takhar, J. Laska, T. Sun, K. Kelly, and R. Baraniuk, “Single-pixel imaging viacompressive sampling, ” IEEE Signal Process. Mag. 25, 83 - 91, March 2008.

1. Introduction

The performance of a typical imaging system is characterized by the resolution (the smallestfeature that the system can resolve) and the field of view (FOV: the maximum angular extentthat can be observed at a given instance). In most electronic imaging systems today, the detectorelement is a focal plane array (FPA) typically made out of semiconductor photodetectors [1,2, 3, 4]. A FPA performs spatial sampling of the optical intensities at the image plane, withthe maximum resolvable spatial frequency being inversely proportional to the center-to-centerdistance between pixels. Therefore, to obtain a high-resolution image with a given FPA, theoptics must provide sufficient magnification, which limits the FOV. There are many applicationswhere this trade-off between the resolution and FOV needs to be overcome. A good example isthermal imaging surveillance systems operating at the mid- and long-wave infrared wavelengths(3− 20 μm). As the FPAs sensitive to this spectral range remain very expensive, techniquescapable of achieving a wide FOV with a small-pixel-count FPA are desired.

Many techniques proposed to date to overcome the FOV-resolution trade-off are based onacquisition of multiple images and their subsequent numerical processing. For example, imagemosaicing techniques increase the FOV while retaining the resolution, by tiling multiple se-



quentially captured images corresponding to different portions of the overall FOV [5, 6]. Whenapplied to a video system, these techniques require acquisition of all sub-images for each videoframe in order to accurately capture relative motion between adjacent frames. This means thatthe scanning element must scan through the sub-images at a rate much faster than the videoframe rate, which is challenging to implement using conventional video cameras. In anotherexample, super-resolution techniques provide means to overcome the resolution limit imposedby the FPA pixel size. In this technique, multiple images are obtained from a single scene,with each image having a different sub-pixel displacement from the others [7, 8]. The sub-pixel displacements provide additional information about the scene as compared with a singleimage, which can be exploited to construct an image of the scene with resolution better thanthat imposed by the FPA pixel size. For these technqiues to succeed, the displacements of thelow-resolution images need to be known with sub-pixel accuracy either by a precise control ofhardware motion or by an accurate image registration algorithm [9, 10]. As the pixel size of theFPAs continues to shrink, this requirement translates to micron-level control/registration whichis difficult to maintain in a realistic operating environment subject to vibrations and temperaturevariations.

Recently, we proposed a numerical method by which the FOV of an imaging system can beincreased without compromising its resolution [11]. In our setup, a static scene to be imaged ispartitioned into smaller scenes, which are imaged onto a single FPA to form a composite im-age. We developed an efficient video processing approach to separate the composite image intoits constituent images, thus restoring the complete scene corresponding to the overall FOV. Tomake this otherwise highly ill-posed problem of disambiguating the image tractable, the super-imposed sub-images are moved relative to one another between video frames. The disambigua-tion problem that we considered is similar to the blind source separation problem [12, 13, 14],where the manner in which the sub-images are superimposed is unknown. In our case, wecontrol how the sub-images are superimposed by prescribing the relative motion between thesub-images. Incorporating this knowledge into our proposed optimization algorithms, we cansuccesfully and efficiently differentiate the sub-images to accurately reconsruct the originalscene.

In this paper, we significantly extend our previous results in three principal aspects. First,to demonstrate the increase of FOVs in a realistic setting, we physically generate a compositevideo from a scene using an optical system employing a beamsplitter and a movable mirror.Without considerable prior knowledge of the contents of the scene, we are able to separatethe sub-images, succesfully capturing both the large-scale features and fine details. Second,we show the effectiveness of the proposed approach by reconstructing with small mean squareerrors a dynamic scene where, in constrast to the previous demonstration, objects in the sceneare moving. Third, we improve upon the previous computational methods by exploiting cor-relations between sequential video frames, particularly the sparsity in the difference betweensuccessive frames, leading to accurate solutions with reduced computational time.

The paper is organized as follows: In Sec. 2, we discuss the concept of our technique and adetailed description of the proposed architecture. Sec. 3 shows how the video disambiguationproblem can be formulated and solved using optimization techniques based on sparse repre-sentation algorithms. In Sec. 4, we describe the both physical and numerical experiments. Weconclude with a summary of the paper in Sec. 5.

2. Proposed camera architecture for generating a superimposed video

Figure 1(a) schematically shows the basic concept of superimposition and disambiguation.In the superimposition process, multiple sub-images are merged to form a composite image(shown on the right side of Fig. 1(a)) in a straightforward manner; the intensity of each pixel in



the composite image is the sum of the intensities of the corresponding pixels in the individualimages. However, the inverse process – the disambiguation of the individual sub-images fromthis composite image – is more challenging. For this, we must determine how the intensityof each pixel in the composite image is distributed over the corresponding pixels in the indi-vidual sub-images so that the resulting reconstruction accurately represents the original scene.Our technique achieves this task by measuring a composite video sequence, where the posi-tion of each sub-image is slightly altered at each frame. It is the movement of these individualsub-images that allows disambiguation to succeed. For simplicity, we consider the examples ofsuperimposing only two sub-images in our experiments, but the approach we describe can beextended to more general cases.

Superimpose

Disambiguate

S iSuperimpose p p

Disambiguate DisambiguateDi bi t1 1

(a)

Camera

Mirror

Beamsplitter

Scene

x(1)t x(2)

t

x

y

z

(b)

Fig. 1. (a) Basic concept of superimposition and disambiguation. (b) Proposed camera ar-chitecture for superimposing two sub-images from the top view. The scene is split into two

halves, x(1)t and x(2)

t . The optical field from the left half propagates directly through thebeamsplitter to hit the FPA in the camera. The optical field from the right half hits a mov-able mirror before propagating to the beamsplitter and being reflected to the FPA in thecamera.

Superimposed images which are shifted relative to one another at different frames can eas-ily be recorded using a simple camera architecture, depicted for two sub-images in Fig. 1(b).Constructed using beamsplitters and movable mirrors, the proposed assembly merges the sub-images into a single image and temporally varies the relative position of the two sub-images asthey hit the detector. The optical field from the left half of the scene propagates directly throughthe beamsplitter and hits the FPA in the camera at the same relative position for every frame.The optical field from the right half of the scene, however, is reflected by a movable mirror fol-lowed by the beamsplitter before hitting the FPA. When the mirror, mounted on a linear stage,is moved, the right half of the scene is moved correspondingly. The image recorded by the FPAis then the sum of the stationary left sub-image and the right sub-image that is moved for eachframe, resulting in a superimposed video sequence.



3. Mathematical model and computational approach for disambiguation

Let {xt} be a sequence of frames representing a slowly changing scene. The superimpositionprocess (Fig. 1(a)) can be modeled mathematically at the t th frame as

zt = Atxt + εt , (1)

where zt ∈ IRm×1 is the observed composite image, xt ∈ IRn×1 is the (unknown) scene to bereconstructed, At ∈ IRm×n is the projection matrix that describes the superimposition, andεt is noise at frame t. We assume in this paper that εt is zero-mean white Gaussian noise.The disambiguation problem is the inverse problem of solving for x t given the observationszt and the matrix At . In this setting, n > m, which makes Eq. (1) underdetermined. There areseveral techniques for approaching this ill-posed statistical inverse problem of disambiguatingthe sub-images, many of which exploit the sparsity of x t in one or more bases (cf. [15, 16, 17]).We note that Eq. (1) is different from the formulation in our previous paper [11], in that thescene is now dynamic, i.e., xt can have (small) changes for each t, whereas previously, x t isstatic, i.e., xt+1 = xt for all t.

In the camera architecture described in Sec. 2, one sub-image is held stationary relative to

the other. If xt = [x(1)t ;x(2)

t ] are the pixel intensities corresponding to the two images, then A t

is the underdetermined matrix [ I St ], where I is the identity matrix and St describes themovement of the second sub-image in relation to the first at the t th frame. Here, we assume that

x(1)t corresponds to the stationary sub-image while x (2)

t corresponds to the sub-image whoseshifting is induced by the moving mirror (see Fig. 1(b)). Then the above system can be modeledmathematically as

zt = [ I St ]

[x(1)

t

x(2)t

]+ εt = StWθθθ t + εt , (2)

where St = [ I St ]. Here, we write xt = Wθθθ t , where θθθ t denotes the vector of coefficients of thetwo sub-images in the wavelet basis and W denotes the inverse wavelet transform. (We use thewavelet transform here because of its effectiveness with many natural images, but alternativebases could certainly be used depending on the setting.) We note that this formulation is slightlydifferent from that found in our previous paper [11], where the coefficients for each sub-imageare treated separately. In Eq. (2), θθθ t contains the wavelet coefficients for the entire image x t ,

as opposed to the concatenation of the wavelet coefficients of x (1)t and x(2)

t , resulting in a moreseamless interface between the sub-images.

We formulate the reconstruction problem as a sequence of nonlinear optimization problems,minimizing the norm of the error ‖z t − StWθθθ t‖ along with a regularization term τ‖θθθ t‖, forsome tuning parameter τ , at each time frame and using the computed minimum as the initialvalue for the following frame. Since the underlying inverse problem is underdetermined, theregularization term in the objective function is necessary to make the disambiguation problemwell-posed. This formulation of the reconstruction problem is similar to the � 2− �1 formulationof the compressed sensing problem [18, 19, 20] for suitably chosen norms: using the Euclideannorm for the error term gives the least-squares error while using the one norm for the regular-ization term induces sparsity in the solution. Sparse solutions in the wavelet domain provideaccurate reconstructions of the original signal since the wavelet transform typically retains themajority of natural images’ energy in a relatively small number of basis coefficients. To solvethe problem of disambiguating two superimposed images, we thus formulate it as the nonlinearoptimization problem

θθθ t = argminθθθ t

∥∥∥zt − StWθθθ t

∥∥∥2

2+ τ ‖θθθ t‖1 . (3)



If we solve the optimization problem (3) for each frame independently, the � 1 regularizationterm can lead to reasonably accurate solutions to an otherwise underdetermined and ill-posedinverse problem, particularly when the true scene is very sparse in the wavelet basis and sig-nificant amounts of computation time are devoted to each frame. However, when the sceneis stationary or slowly varying relative to the frame rate of the imaging system, subsequentframes of observations can be used simultaneously to achieve significantly better solutions. Wedescribe a family of methods that depend on the number of frames solved simultaneously forexploiting interframe correlations.

1-Frame Method. For a scene that changes only slightly from frame to frame, the reconstruc-tion from a previous frame is often a good approximation to the following frame. In the 1-FrameMethod, we use the solution θθθ t to the optimization problem (3) at the t th frame to initialize theoptimization problem for the (t +1) th frame.

2-Frame Method. We can improve upon the 1-Frame Method by solving for multiple framesin each optimization problem. In the 2-Frame Method we solve for two successive framessimultaneously. However, rather than solving for θθθ t and θθθ t+1, we solve for θθθ t and Δθθθ t ≡θθθ t+1 − θθθ t for two main reasons. First, for slowly changing scenes, θθθ t+1 ≈ θθθ t and since bothθθθ t+1 and θθθ t are already sparse, Δθθθ t is even sparser, making Δθθθ t even more appropriate for thesparsity-inducing �2− �1 minimization. Second, solving for Δθθθ t allows for coupling the framesin an otherwise separable objective function, leading to accurate solutions to both θθθ t and Δθθθ t .The minimization problem can be formulated as follows:

θθθ[2]t ≡

[θθθ t

Δθθθ t

]= argmin

θθθ t ,Δθθθ t

∥∥∥∥∥[

zt

zt+1

]−

[St 00 St+1

][W 0W W

][θθθ t

Δθθθ t

]∥∥∥∥∥2

2

+ τ∥∥∥∥[

θθθ t

Δθθθ t

]∥∥∥∥1, (4)

where Si = [ I Si ] for i = t and t +1. The following optimization problem for frame (t +1)is initialized using [

θθθ (0)t+1

Δθθθ (0)t+1

]≡

[θθθ t + Δθθθ t

Δθθθ t

],

which should already be a good approximation to the solution θθθ[2]t+1. Note that the formula-

tion in (4) is different from that proposed in our previous paper [11], where θθθ t correspondsto coefficients of static images, i.e., θθθ t = θθθ t+1, whereas here, we allow for movements withineach sub-image. In addition, since Δθθθ t is significantly sparser than θθθ t , we use a different reg-ularization parameter ρ on ‖Δθθθ t‖1 to encourage very sparse Δθθθ t solutions, which leads to thefollowing optimization problem:

θθθ[2]t = argmin

θθθ t ,Δθθθ t

∥∥∥∥∥[

zt

zt+1

]−

[St 00 St+1

][W 0W W

][θθθ t

Δθθθ t

]∥∥∥∥∥2

2

+ τ ‖θθθ t‖1 + ρ ‖Δθθθ t‖1 ,

In our experiments, we use ρ = (1.0×103)τ .

4-Frame Method. The 4-Frame Method is very similar to 2-Frame Method, but we solve forthe coefficients using four successive frames instead of two, using the observation vectors z t+2

and zt+3 and the observation operation matrices S t+2 and St+3. By coupling more frames, thecoefficients are required to satisfy more equations, leading to more accurate solutions. The



drawback, however, is that the corresponding linear systems to be solved are larger and requiremore computation time. The corresponding minimization problem is given by

θθθ[4]t = argmin

θθθ t

∥∥zt − StWθθθ t∥∥2

2 + τ ‖θθθ t‖1 + ρ2

∑j=0

‖Δθθθ t+ j‖1, (5)

where the minimizer θθθ[4]t = [ θθθ t ; Δθθθ t ; Δθθθ t+1; Δθθθ t+2 ], Δθθθ i ≡ θθθ i+1−θθθ i for i = t, · · · ,t +2, and

zt =

⎡⎢⎢⎣

zt

zt+1

zt+2

zt+3

⎤⎥⎥⎦, St =

⎡⎢⎢⎢⎣

St

St+1

St+2

St+3

⎤⎥⎥⎥⎦, W =

⎡⎢⎢⎢⎣

WW WW W WW W W W

⎤⎥⎥⎥⎦, and θθθ t =

⎡⎢⎢⎣

θθθ t

Δθθθ t

Δθθθ t+1

Δθθθ t+2

⎤⎥⎥⎦.

Here, Si = [ I Si ] for i = t, · · · ,t +3. There is another formulation for simultaneously solvingfor four frames (see [21]). However, results from that paper indicate that solving (5) is moreeffective in generating more accurate solutions. As in the 2-Frame Method, we place the sameweights (ρ = 1.0×103 · τ) on ‖Δθθθ i‖1 for i = t, · · · ,t +2 to encourage very sparse solutions.

A general n-Frame Method can be defined likewise for simultaneously solving for n frames.In our numerical experiments, we also use the 8- and 12-Frame Methods.

4. Experimental methods

To demonstrate that the FOV of an optical system can be increased using the proposed super-imposition and disambiguation technique, we perform two studies, one physical and one nu-merical. The physical experiment involves actually building the proposed camera architecturein Sec. 2 to obtain a composite video that is separated into the original scene. Here, since thescene is static, we use the most successful method in our previous paper [11] for disambiguat-ing stationary scenes. The numerical experiment superimposes the two dynamic sub-images ofa surveillance video. This experiment demonstrates that the two sub-images can be successfullydisambiguated and that the slow moving components of the original scene can be captured withthe proposed approach by exploiting the inter-frame correlations.

In these experiments, we solve the optimization problems for the various proposed methodsusing the Gradient Projection for Sparse Reconstruction (GPSR) algorithm of Figueiredo et al.[17]. GPSR is a gradient-based optimization method that is very fast, accurate, and efficient. Inaddition, GPSR has a debiasing phase, where upon solving the � 2−�1 minimization problem, itfixes the non-zero pattern of the optimal θθθ t and minimizes the �2 term of the objective function,resulting in a minimal error in the reconstruction while keeping the number of non-zeros in thewavelet coefficients at a minimum. It has been shown to outperform many of the state-of-the-artcodes for solving the �2 − �1 minimization problem or its equivalent formulations.

Ideally, the extent of [x(1)t ;x(2)

t ] should cover the entire scene at all frames. However, as

x(2)t moves, according to either the movement of the mirror in the optical experiment or the

prescribed motion in the numerical experiment, there can be a portion of the scene that is notcontained in the superimposed image at some frames, creating a “blind zone”. For a frame whenthe disambiguated image does not cover the entire scene, the result obtained for the blind zonefrom the previous frame is combined with the disambiguated image to reconstruct the entirescene.

The reconstruction video for both physical and numerical experiments are available athttp://www.ee.duke.edu/nislab/videos/disambiguation under the namesduke-earth-day.avi and surveillance.avi.



4.1. Optical experiment: Duke Earth Day

(a)

(b)

(c)

Fig. 2. (a) Original “Duke Earth Day” scene used in the experiment. The box with a solidred border represents the extent of x(1), which is stationary during the superimpositionprocess. As the mirror moves in a circular motion in the x− z plane shown in Fig. 1(b),the blue box with a solid border, which represents the moving boundary of x(2) at objectplane, oscillates between the left and right turning points, represented by the blue boxeswith dashed and dotted borders, respectively. (b) Superimposed image (left panel) and re-constructed scene (right panel) when the moving boundary is near the mid-point of the os-cillation in the superimposed video. (c) Superimposed image (left panel) and reconstructedscene (right panel) when the moving boundary is near the left turning point, where sub-images are not completely disambiguated: the man in the hat and the banner (circled inyellow) partly appear in the left half of the disambiguated image.

As mentioned above, the system shown in Fig. 1(b) is capable of generating a compositevideo where the sub-image corresponding to the right half of the scene, x (2), is moved while thatcorresponding to the left half of the scene, x (1), remains still. In our experiment, the movementof x(2) was along the x-direction with its position following a sinusoidal function of frame. Thiswas achieved by moving the mirror with a motion controller along a circular path on the x-zplane with a constant velocity; the displacement of the mirror along the x-direction causes x (2)



to move in the same direction whereas the motion of the mirror in the z-direction does not createany change in the composite video. To determine St in Eq. (2) corresponding to a given circularmovement of the mirror, we performed a calibration experiment where a scene with a whitebackground contained a black dot on its right half. By tracking the dot in the recorded video,we verified that the movement of the dot in the video was indeed sinusoidal, and also determinedits amplitude and period. In the actual experiment, the scene was replaced with a photograph(“Duke Earth Day”) while leaving the rest of the system unaltered. Hence, the amplitude andperiod of the movement of x(2) are the same as those obtained in the calibration experiment.The phase of the sinusoidal movement was determined for each recording by calculating amean square difference between adjacent frames to identify the frame at which x (2) moved tothe farthest right (or left).

The results of the physical experiment are shown in Fig. 2. Figure 2(a) shows the original

scene used in the experiment. Also shown is the extent of x (1)t , as well as that of x(2)

t at differentphases of the sinusoidal oscillation. Figure 2(b) shows the superimposed image and the recon-

structed scene when the moving boundary of x (2)t shown Fig. 2(a) was near the mid-point of its

sinusoidal oscillation. In the right panel of Fig. 2(b), main figures in the original scene are easilyidentified and the texts are sufficiently clear to be read. Similar results were obtained at otherphases of the oscillation except when the moving boundary was near either the left or right turn-ing point. Since the relative velocity of the two sub-images approaches to zero at these phases,

x(1)t and x(2)

t cannot be completely disambiguated; the disambiguation result shown in Fig. 2(c)

reveals that some objects belonging in x(2)t , such as the man in the hat and the banner (circled

in yellow) partly appear in the left half of the disambiguated image. As shown in Fig. 1(b), the

optical path length for x(2)t is longer than that for x(1)

t , resulting in a small discrepancy in size.

In our setup, x(1)t is 7% larger than x(2)

t . For this reason, the tilings of the disambiguated imagesseem less seamless.

4.2. Numerical experiment: Surveillance video

The video used in this numerical experiment is obtained from the Benchmark Data for PETS-ECCV 2004 [22]. Called Fight_OneManDown.mpg, it depicts two people fighting, withone man eventually falling down while the other runs away. The video was filmed using a wideangle camera lens in the entrance lobby of the INRIA Labs at Grenoble, France. Originallyin 384× 288 pixel resolution, the color video is rescaled to be 512× 256 and converted tograyscale for ease of processing. This type of video is appropriate for our application since thescene of the lobby is relatively static with only some small scene changes corresponding topeople moving in the lobby. We only use parts of the video where there is movement on bothhalf of the scenes to test whether our approach will be able to assign each moving componentto the proper sub-image (see objects circled in yellow in Fig. 3(a)). Zero-mean white Gaussiannoise is added in our simulations. We add 10 frames between video frames using interpolationto simulate a faster video frame rate.

For a fairer comparison between the various methods, we allowed the optimization algo-rithm to run for specified times (5 and 20 seconds). For example, the 2-Frame Method mightlead to more accurate solutions than the 1-Frame Method since it solves for two frames simul-taneously. However, it is allowed fewer GPSR iterations than the 1-Frame Method since thecomputational run time for each 2-Frame optimization iteration is longer than that for the 1-Frame Method. Figures 4(a) and 4(b) show the mean squared error (MSE) values for the variousmethods described in Sec. 3 with the two specified time (5 and 20 seconds) allotted for GPSRminimization with debiasing. We make the following observations. First, with the exception ofthe 1-Frame Method, every method benefits from allotting more time to solve the optimization



Fig. 3. (a) Original surveillance video with two moving components (circled in yellow), (b)the observed superimposed sub-images of the left and right half of the scene, and (c) thereconstruction using the 8-Frame Method allowing 20 seconds of GPSR iterations.

problem (20 seconds vs. 5 seconds). The 1-Frame Method does not benefit from this increasein time allotment because GPSR often converges (i.e., one of the criteria for algorithm termi-nation becomes satisfied) within the 5 second time frame. Second, solving for arbitrarily manyframes does not necessarily lead to more accurate solutions. Even though the 12-Frame Methodsimultaneously solves for many more frames, its MSE performance is worse than the 8-FrameMethod, especially if only 5 seconds are allotted for each frame. Figure 4(a) shows that theperformance of the 12-Frame Method is even worse than the 1-Frame Method for most of theframes that were considered. This poor performance is mainly due to the fact that because theresulting linear system to be solved for the 12-Frame Method is so large and that the allottedtime is so short, only three GPSR iterations were allowed for each frame in our reconstructionmethod, which is hardly sufficient for the GPSR iterates to be near the solution. However, whenthe allotted time is increased to 20 seconds, the MSE performance of the 12-Frame Method sig-nificantly improves. Third, the reconstruction for the initial frames for the 8-Frames approachare generally worse than the 2- and 4-Frame Methods, but the sharp decrease in MSE valuefor the 8-Frame Method indicates that the solutions from the previous frames are being usedeffectively to initialize the current frame optimization. Fourth, the relatively ragged behaviorof the various methods in Fig. 4(a) compared to that in Fig. 4(b), especially in the 8- and 12-Frame Methods, can perhaps be attributed to the difference in time restriction. Because of therelatively few GPSR iterations allowed within the 5 second time limit per frame, sufficientlygood solutions, relative to the other frames, are not found in some instances. Qualitatively, thedisambiguated reconstruction captures both large-scale features and fine details of the originalscene. The overall structure of the lobby is correctly depicted, while small details such as those



of the several kiosks and handrails and motions such as those of the man’s arms on the righthalf of the scene are reproduced accurately. The ghosting on the bottom of the right half of thereconstruction results from the lack of contrast in regions of high pixel intensity values on theleft half of the scene. Yet in spite of this ghosting, details such as the edges of the lobby floortiles are still distinguishable in the areas where this ghosting occurs.

Fig. 4. MSE values for the 90 frames allowing 5 seconds (a) and 20 seconds (b) to solvethe optimization problems for the different n-Frame Methods for the surveillance video.

4.3. Discussion

While we showed in this section that two superimposed sub-images (and in previous demon-strations, up to four sub-images [11]) can be successfully disambiguated using our proposedtechnique, we recognize that there are practical limitations to our method, especially in ex-tending it to disambiguating more sub-images or resolving quickly moving objects. First, theproblem of reconstructing a scene becomes even more ill-posed when the number of sub-imagesthat can be superimposed increases. Consequently, it is necessary to ensure that the movementsof each sub-image relative to each other not coincide and are fundamentally different. Thus, thecorresponding hardware becomes more complex. However, we note that the camera architec-ture will only involve more beamsplitters and movable mirrors, which costs substantially lessthan a larger FPA, especially for a system operating at the mid- and long-wave infrared wave-lengths. Second, the motion within each sub-image must remain sufficiently slow. The abilityof the proposed techniques to disambiguate superimposed sub-images relies upon the temporalcorrelations of successive frames. Our reconstruction techniques succeed precisely because theframes are strongly correlated (i.e., the difference between consecutive frames is mostly zeroin the wavelet domain) and this sparsity is exploited accordingly. The assumption that stronginter-frame correlations exist can be satisfied by increasing the video frame rate or limiting theapplication to scenes with relatively slowly moving components. For example, in some sur-veillance applications, in which the activity of interest is very fast and transient, the proposedarchitecture might not be suitable.

Aside from these challenges associated with hardware and temporal correlations, the math-ematical methods described in this paper should extend to handling much larger numbers ofsub-images. For example, consider an extreme case in which there is one sub-image for eachpixel in the high resolution scene. In this setting, each observation frame would be the sum ofa different subset of the pixels (sub-images) in the scene because the shifting of sub-images



would make some of them unobservable by the detector at different times. This model is highlyanalogous to the Rice “Single Pixel Camera” [23], in which each measurement (in time) isthe sum of a random collection of pixels. Duarte et al. demonstrate that if sufficiently manymeasurements are collected over time using this setup, then a static scene can be reconstructedwith high accuracy.

We note that the proposed approach is different from classical mosaicing as described inSec. 1. For example, consider a scene with a quickly moving, transient object in one location.With classical mosaicing, this object would only be observed at half the frame rate (since theother half of the frame rate is used to observe the other half of the scene); however, it would beobserved with high spatial resolution. In contrast, the proposed technique would observe everypart of the image during each frame acquisition, resulting in very high temporal resolution fordetecting transient objects. However, because the disambiguation procedure relies on temporalcorrelations, the spatial resolution of the reconstructed object would be relatively poor – i.e., itwould look blurred.

5. Conclusions

In this paper, we propose a novel camera architecture for collecting high resolution, wide field-of-view videos in settings such as infrared imaging where large focal plane arrays are un-available. This architecture is mechanically robust and easy to calibrate. Associated with thisarchitecture is a fast and accurate technique for disambiguating the composite video image con-sisting of the superimposition of multiple sub-images. We demonstrated the increase of FOVsin a realistic setting by physically generating a composite video from a single scene using anoptical system employing a beamsplitter and a movable mirror and successfully disambiguatingthe video. Without prior knowledge of the contents of the scene, our approach was able to dis-ambiguate the two sub-images, successfully capturing both large-scale and fine details in eachsub-image. Additionally, we improve upon our previous reconstruction approach by allowingeach sub-image to have slowly changing components, carefully exploiting correlations betweensequential video frames. Simulation results demonstrate that our optimization approach can re-construct the constituent images and the moving components with small mean square errors,and that the errors improve by solving for multiple frames simultaneously.

Acknowledgments

The authors would like to thank Les Todd, assistant director of Duke Photography, for allowingthe use of the “Duke Earth Day” photograph in our physical experiments. The authors werepartially supported by DARPA Contract No. HR0011-04-C-0111, ONR Grant No. N00014-06-1-0610, and DARPA Contract No. HR0011-06-C-0109.



Superimposed video disambiguation for increased field of view

Documents

Transcript of Superimposed video disambiguation for increased field of view