Introduction
In recent years, there has been an increasing interest in generating
free-viewpoint video sequences from multiple camera views. Apart
from purely image-based approaches, free-viewpoint video can be
computed by extracting geometry and texture information from a set
of concentric views of the same object. Today, the robust generation
and transmission of free-viewpoint video is still a challenging
problem.
Our research on free-viewpoint or 3D video systems is motivated
by our interest in novel immersive projection and acquisition environments
for tele-presence. At ETH, we developed the blue-c (http://blue-c.ethz.ch/),
two networked virtual reality portals consisting of a CAVE-like
environment, augmented by an array of cameras and an active lighting
system. Thus, blue-c combines the simultaneous acquisition of multiple
video streams with advanced 3D projection technology. Both portals
can be used for 3D video acquisition and in full operation mode
they enable networked collaborative applications enhanced by 3D
video-conferencing features. Even though most efforts in rendering
and compression of 3D data focus on meshes, we opted for point samples
as the basic primitive in our 3D video representation. In fact,
a general approach handling dynamic objects must allow for topology
changes, and a topology change on a 3D mesh is a costly operation
and hard to achieve in real-time. Furthermore, point samples can
be considered as a straightforward generalization of 2D video pixels
into 3D space. In particular, our 3D video system is composed of
16 camera nodes, which acquire images and perform 2D image processing.
The resulting information is streamed to a reconstruction node,
which computes the actual 3D representation of the observed object.
Camera and reconstruction nodes are at the same physical location
and are connected in a local area network. The 3D video data is
then streamed to a rendering node, which, in a real-world tele-presence
application, runs at a remote location.
For free-viewpoint video, a scene is typically captured by
N cameras. Different camera settings are possible, e.g. parallel
view, convergent view, divergent view, but in general any setting
of cameras (e.g. combinations of the above) is possible, i.e. multiple
view. From these views a 3D video object is created which comprises
shape and appearance. The shape can be described by, e.g., polygon
meshes, point samples, implicit surfaces, depth images, or layered
depth images. The appearance data is mapped onto the shape and allows
the 3D video object to be seamlessly blended into new 2D or 3D video
content. Appearance is typically described by a series of video
streams, comprising textures and other surface properties. The 3D
video object can be composed into existing content, or it can be
interactively viewed from different directions, or under different
illumination. 3D video objects can be applied to broadcast, storage
(e.g., DVD), and interactive online applications.
|