1

March 2007 (vol. 8, no. 3), art. no. 0703-o3006 1541-4922 2007 IEEE Published by the IEEE Computer Society

From the Editor: Distributed Multimedia Community Multi-View Video: Get Ready for Next-Generation Television Ishfaq Ahmad University of Texas at Arlington

Many believe that multi-view video is poised to change how people watch television and that it could become a driving force in interactive multimedia entertainment, for both desktop and mobile environments. An MVV system acquires several video sequences of the same scene simultaneously from more than one angle and transports these streams remotely. Scenes can be displayed interactively, letting the user rotate the view from multiple angles as if it were 3D and enjoy the feeling of being in the scene. Owing to the massive amount of data involved and extensive processing requirements, real-time MVV processing presents research issues that lie at the frontier of video coding, image processing, computer vision, and display technologies. Building a complete end-to-end MVV system also hinges on several additional technologies, such as real-time acquisition, transmission, and display of dynamic scenes that users can view interactively on conventional screens. Several research groups around the world are actively researching MVV.

Applications

MVV technology could lead to exciting new applications in areas such as education, medicine, surveillance, communication, and entertainment. It could also lead to a mass-media shake-up and the birth of a new industry, especially in the mobile domain. Furthermore, researchers will also need to examine surround sound with a fresh perspective to accompany the video style. MVV can also profoundly affect telecommunication, given that telecommunications ultimate goal is highly effective interpersonal information exchange.

For instance, media sports coverage technology keeps evolving. In the past, only a few TV channels aired the games that interested people. Now audio and video coverage can be delivered over the Internet or broadcast in HDTV format. Technology has always dazzled sports fans. Instant replays, introduced in the early 1960s, added a new dimension that in-stadium fans couldnt see, and miniature cameras let viewers see what referees see on the field. As MVV technology matures, we can expect a revolution in coverage of sportscar racing, soccer, football, basketball, and so on. With multiple cameras capturing and broadcasting the scene live to viewers and letting them rotate the viewing angle, sports viewing could become a whole new concept.

Current videoconferencing systems provide a fixed view of the remote scene, so they dont give you the feeling of being there. Multi-view video could have a broad impact on such systems. One important feature of future communications will be interactivity with stereoscopic and 3D vision, which make you feel more as if youre present in the scene. In a videoconferencing scenario, participants at different geographical sites could meet virtually and see one another in free viewpoint video or 3DTV style.

Surveillance and remote monitoring of important sites, such as critical infrastructures, traffic, parking lots, and banks, could also benefit from this technology because it can provide coverage of very large areas from multiple angles. Other potential application areas include entertainment (such as concerts, multiuser games, and movies), education (such as digital libraries and archives, training and instruction manuals with real video, and surgeon training), culture (such as zoos, aquariums, and museums), and archiving (such as scientific archives, national treasures, and traditional entertainment).

IEEE Distributed Systems Online (vol. 8, no. 3), art. no. 0703-o3006 1

Research issues

An MVV system consists of components for data acquisition, compression, and delivery. The acquisition component captures videos from multiple cameras and obtains the acquisitions parameters. The processing part analyzes the acquired data, extracts features of it, and compresses it for delivery and storage. On the receiving side, decoding and display devices reconstruct the view in either two or three dimensions, depending on the devices capabilities.

Video acquisition and representation

For MVV content generation, numerous scene-acquisition methods are possible. The scene-modeling and real-time processing requirements and the available bandwidth for video transmission determine the variation in the number, type, and placement of cameras. For instance, for model-based representation, good-quality 3D video can be rendered using the input from only a limited number of cameras. Image-based correspondence techniques, however, might require a large number of input streams but little processing. Some video-acquisition schemes require static background capture before introducing the scenes dynamic parts.

Estimating the setups extrinsic and intrinsic parameters might require camera calibration. You can classify acquisition setups on the basis of camera placement geometry, camera type (stationary or motional), distance from the objects of interest, and synchrony of video acquisition. Other parameters, such as intrinsic parameters of different camera types, also distinguish different setups. On the basis of the acquisition system setup, MVV scenarios fall into different categories. The camera configuration can be parallel,1 convergent, or a combination of both.2 Convergent configurations are generally used with model-based representations of the dynamic scenes captured.3 Other capturing systems also exist.47

In an MVV system, the video streams must be synchronized to ensure that all the cameras shutters open at the same instant when theyre sampling the scene from different angles.3,8 Video captured from different cameras is used together with timing information to create novel views in multi-view video. The input from the cameras can be synchronized using external sources such as a light flash at periodic intervals.4 External synchronization can slow down the frame rate considerably.

One way of representing multi-view video is to use 2D video plus a disparity map and 3D structure. MPEG-4 multiview coding8 proposed using video streams and a disparity map. Various rendering methods can be used with this scheme on the client side. The blue-c project at ETH (Eidgenssische Technische Hochschule) Zrich has used a 3D hierarchical data point representation.4 It allowed efficient spatial coding into different data streams (tree structure, color, position, and normal information) and temporal coding using update, insert, and delete operators.

Multi-view video processing

MVV compression involves more than just compressing independent multiple streams, without which scene reconstruction wouldnt be possible. Traditional 2D video-coding standards, such as MPEG and H.2XX, exploit the human eyes characteristics, including its sensitivity to color. 2D video coding also takes advantage of the motion as well as the spatial and statistical redundancies in video data. In general, MVV is reconstructed from multiple 2D video sequences. More than one view video sequence must be transmitted or stored, leading to a massive amount of data.

MVV compression algorithms should reduce redundancy in information from multiple views as much as possible to provide a high degree of compression, subject to distortion and resource constraints. The redundancy in MVV streams consists of intraframe redundancy (spatial): intraframe prediction coding; interframe redundancy (temporal): motion-compensated prediction coding; inter-view redundancy (geometrical): disparity-compensated prediction coding; transform redundancy (frequency): DCT (Discrete Cosine Transform) or wavelet transform coding; redundancy of human visual system: scalable coding.


3D video compression has the following additional requirements:

Visual quality. Decompressed data should provide good visual quality. Criteria include subjective quality (that is, how it looks to the human visual system), objective quality, and quality consistency among views (that is, the data should provide perceptually similar visual quality over different views that will be presented in the same time frame).

Synthesizability for reconstructed video. Decompressed data should support robust generation of a virtual or interpolated view. So, camera calibration information and the depth/disparity map should be compressed along with view data.

Compatibility. Should be compatible for current and future video standards.1

Low delay. The compression algorithms should provide low delay for real-time applications. Such delays include encoding and decoding delays, view change delays, and end-to-end delay.

Camera motion. Should support encoding of video sequences, subject to camera motion.

Scalability. This includes signal-noise ratio scalability, spatial scalability, temporal scalability, complexity scalability, view scalability, and scalability on a multitude of terminals and under different network conditions.

Networking and transportation

Delivering MVV video to end users will pose serious networking challenges, involving protocols, quality of service, channel-delay management, and error concealment and recovery. Depending on their environments and requirements, MVV systems can be built on different architectures (see figure 1).


Figure 1. Various multi-view video system architectures: (a) distributed-acquisition and distributed-viewers model (DADV); (b) Local-acquisition and local-viewers model (LALV or Saitos model);6 (c) distributed-acquisition and local-viewers model (DALV or Heinrich-Hertz Institute model;9 (d) local-data-acquisition and distributed-viewers model (LADV or University of Central Florida model).10

Projects

Because multi-view video is a new and widely applicable research area with a broad range of open problems, numerous related research efforts are under way worldwide. In Europe, the Digital Stereoscopic Imaging and Application (DISTIMA) project addressed the production, presentation, coding, and transmission of digital stereoscopic video signals over integrated broadband communications networks. Another European research project, the Package for New Operational Autostereoscopic Multiview System ( PANORAMA), has aimed to facilitate the hardware and software development of an MVV autostereoscopic telecommunication system. The Advanced Three-dimensional Television System Technology (ATTEST) project aims to design an entire 3D-video chain, including content creation, coding, transmission, and display. Mitsubishi Electric Research Laboratories, Carnegie Mellon Universitys computer vision lab, Kyoto University, Heinrich Hertz Institute in Germany, and the blue-c project are pursuing similar endeavors.


Outlook

A workgroup of the International Organization for Standardizations Motion Picture Expert Group has been exploring 3D audiovisual technology. The 3DAV has discussed various applications and technologies in relation to the term multi-view video. A multi-view profile is available in the MPEG-2 standard, which was defined in 1996 as an amendment for stereoscopic TV. The MVP extends the well-known hybrid coding toward exploitation of inter-view/channel redundancies by implicitly defining disparity-compensated prediction; however, it doesnt support interactivity. MPEG-4 version 2 includes the Multiple Auxiliary Component, defined in 2001. MACs basic idea is that grayscale shape is used not only to describe the video objects transparency but also can be defined in a more general way. MACs are defined for a video object plane on a pixel-by-pixel basis and contain data related to the video object, such as disparity, depth, and additional texture. Since 2003, MPEG has also accelerated its work on MVV coding standards. The Multiview Video Coding initiative has passed MPEGs call-for-proposals stage. The proposals were based on the H.264/AVC video coding standard. Thus, the MVC is currently being developed and standardized as an extension of this standard in a joint ad hoc group on MVC (AHG on MVC) in JVT. For more information on the MPEG standardization efforts, see www.chiariglione.org/mpeg/working_documents.htm.

MVV-based products are expected to appear in two to three years. Watching television passively might soon be a thing of the past.

References

1. W. Matusik and H. Pfister, 3D TV: A Scalable System for Real-Time Acquisition, Transmission, and Autostereoscopic Display of Dynamic Scenes, ACM Trans. Graphics, vol. 23, no. 3, 2004, pp. 814824.

2. T. Kanade, H. Saito, and S. Vedula, The 3D Room: Digitizing Time-Varying 3D Events by Synchronized Multiple Video Streams, tech. report CMU-RI-TR-98-34, Carnegie Mellon Univ., 1998.

3. T. Kanade, P.W. Rander, and P.J. Narayan, Virtual Reality: Constructing Virtual Worlds from Real Scenes, IEEE Multimedia, vol. 4, no. 1, 1997, pp. 3447.

4. M. Gross et al., blue-c: A Spatially Immersive Display and 3D Video Portal for Telepresence, Proc. ACM Intl Conf. Computer Graphics and Interactive Techniques (SIGGRAPH 03), ACM Press, 2003, pp. 819827.

5. S. Moezzi et al., Immersive Video, Proc. Virtual Reality Ann. Intl Symp. (VRAIS 96), IEEE CS Press, 1996, pp. 1724.

6. H. Saito, S. Baba, and T. Kanade, Appearance-Based Virtual View Generation from Multicamera Videos Captured in the 3-D Room, IEEE Trans. Multimedia, vol. 5, no. 3, 2003, pp. 303316.

7. A. Smolic and D. McCutchen, 3DAV Exploration of Video-Based Rendering Technology in MPEG, IEEE Trans. Circuits and Systems for Video Technology, vol. 14, no. 3, 2004, pp. 348356.

8. H. Cheng, Temporal Registration of Video Sequences, Proc. 2003 IEEE Intl Conf. Acoustics, Speech, and Signal Processing (ICASSP 03), vol. 3, IEEE Press, 2003, pp. III 48992.

9. K. Mueller et al., Coding of 3D Meshes and Video Textures for 3D Video Objects, Proc. Picture Coding Symp. (PCS 04), 2004.

10. O. Javed and M. Shah, Tracking and Object Classification for Automated Surveillance, Proc. 7th European Conf. Computer Vision (ECCV 02), Springer, 2002, pp. 343357.

Ishfaq Ahmad is a professor of computer science and engineering at the University of Texas at Arlington. Contact him at [email protected].


Related Links

x DS Online's Distributed Multimedia Communityx "Automated Visual Surveillance in Realistic Scenarios," IEEE Multimediax "TAVERNS: Visualization and Manipulation of GIS Data in 3D Large Screen Immersive

Environments," Proc. ICAT 06

Cite this article:Ishfaq Ahmad, "Multiview Video: Get Ready for Next-Generation Television," IEEE Distributed Systems Online, vol. 8, no. 3, 2007, art. no. 0703-o3006.


1

Documents

Transcript of 1