Vision Algorithms: Theory and Practice -- Abstracts


Reconstruction, Structure and Motion I


Generalized Voxel Coloring
W. Bruce Culbertson, Thomas Malzbender, Greg Slabaugh (Hewlett-Packard, Palo Alto)

Image-based reconstruction from randomly scattered views is a challenging problem. We present a new algorithm that extends Seitz and Dyer's Voxel Coloring algorithm. Unlike their algorithm, ours can use images from arbitrary camera locations. The key problem in this class of algorithms is that of identifying the images from which a voxel is visible. Unlike Kutulakos and Seitz's Space Carving technique, our algorithm solves this problem exactly and the resulting reconstructions yield better results in our application, which is synthesizing new views. One variation of our algorithm minimizes color consistency comparisons; another uses less memory and can be accelerated with graphics hardware. We present efficiency measurements and, for comparison, we present images synthesized using our algorithm and Space Carving.

Keywords: scene reconstruction image-based rendering and modeling multi-view analysis and modeling new view synthesis voxel coloring space carving


Direct Recovery of Planar-Parallax from Multiple Frames
Michal Irani, P. Anandan, Meir Cohen (The Weizmann Inst, Israel and Microsoft Research)

In this paper we present an algorithm that estimates dense planar-parallax motion from multiple uncalibrated views of a 3D scene. This generalizes the ``plane + parallax'' recovery methods to more than two frames. Our algorithm uses the exact equations relating the parallax motion of pixels across multiple frames (relative to a planar surface) to the 3D scene structure and camera epipoles. The parallax field and the epipoles are estimated directly from image brightness variations across multiple frames (without pre-computing correspondences). Our algorithm thus fits within the larger family of {\em direct methods} for recovering structure from motion (SFM). However, so far the implementations of all these methods have relied on the {\em instantaneous approximation} to image motion. By using {\em exact equations}, our algorithms extends this family of methods to a wider class of problems and scenarios.

Keywords: Direct (gradient-based) methods, Plane+Parallax, Multi-frame analysis, Correspondence estimation, Structure from motion.


Optimization Criteria and Geometric Algorithms for Motion and Structure Estimation
Jana Kosecka, Yi Ma, Shankar Sastry (UC Berkeley)

The prevailing efforts to study the standard formulation of motion and structure recovery have been recently focused on issues of sensitivity and robustness of existing techniques. While many cogent observations have been made and verified experimentally, many statements do not hold in general settings and make a comparison of existing techniques difficult. With an ultimate goal of clarifying these issues we study the main aspects of the problem: the choice of objective functions and optimization techniques.

We propose a new ``optimal triangulation'' procedure, formulated as a constrained optimization problem which clearly reveals the relationship among different objective functions, such as ``(normalized) epipolar constraints'', ``reprojection error'' or ``triangulation''. Regardless of various choices of the objective function, the optimization problems all inherit the same unknown parameter space, the so called ``essential manifold''. We will extend the new optimization techniques on Riemanian manifolds and show that they are directly applicable to the optimization problems on essential manifolds.

Using these analytical results we provide a clear account of sensitivity and robustness of the proposed linear and nonlinear optimization techniques and study the analytical and practical equivalence of different objective functions. The geometric characterization of critical points of a fuction defined on essential manifold and the simulation results clarify the difference between the effect of bas relief ambiguity and other types of local minima leading to a consistent interpretations of simulation results over large range of signal-to-noise ratio and variety of configurations. Further more we justify the choice of the linear techniques for (re)initialization and for detection of incorrect local minima.

Keywords: motion and structure recovery, optimal triangulation, essential manifold, geometric optimization, sensitivity and robustness issues


Structure and Motion II


Coordinate-frame Independence in Optimization Algorithms for 3D Vision
Philip McLauchlan (University of Surrey, Guildford, UK)

We attack the problem of coordinate frame dependence and gauge freedoms in structure-from-motion. We are able to formulate a bundle adjustment algorithm whose results are independent of both the coordinate frame chosen to represent the scene and the ordering of the images. This method is more efficient that existing approaches to the problem in photogrammetry.

We demonstrate that to achieve coordinate frame independent results, (i) Rotations should be represented by quaternions or local rotation parameters, not angles, and (ii) the translation vector describing the camera/scene motion should be represented in scene 3D coordinates, not camera 3D coordinates, the two representations which are normally treated as interchangeable. The algorithm allows 3D point and line features to be reconstructed. Implementation is via the efficient recursive partitioning algorithm common in photogrammetry.

Keywords: structure from motion, image sequence analysis, 3D vision


Error Characterization of the Factorization Approach to Shape and Motion Recovery
Zhaohui Sun, Visvanathan Ramesh and A. Murat Tekalp (Siemens Corporate Research, Princeton, NJ)

This paper is focused on error characterization of the factorization approach to shape and motion recovery from image sequence based on matrix perturbation theory. Given the 2-D projections of a set of points across multiple image frames and small perturbation on image coordinates, first order perturbation and covariance matrices for 3-D affine/Euclidean shape and motion are derived and validated with the ground truth. The propagation of the small perturbation and covariance matrix provides better understanding of the factorization approach and its results, provides error sensitivity information for 3-D affine/Euclidean shape and motion subject to small image error, and suggests ways for potential performance improvement. Experiment results are demonstrated to support the analysis and show how the error analysis and error measures can be used.

Keywords: Performance evaluation and analysis, shape representation and recovery, structure from motion, the factorization approach


Uncertainty Modeling for Optimal Structure from Motion
Daniel D. Morris, Kenichi Kanatani, Takeo Kanade (Carnegie Mellon and Gunma Universities)

The parameters estimated by Structure from Motion contain inherent indeterminacies. In particular, shape and motion parameters are only recovered up to a similarity transformation. Past work on uncertainty modeling has imposed coordinate and scale constraints to eliminate these indeterminacies. In this paper our approach is instead to develop a covariance-based uncertainty representation that includes a model of these indeterminacies without the need to apply constraints. To do this we define the normal form for a covariance matrix which captures the essential geometric uncertainty in the parameters along with the indeterminacies. The indeterminacies imply that Structure from Motion over-parametrizes the problem and that the parameters do not have direct physical meaning. We show how to obtain physically meaningful quantities and how to obtain their covariances by transforming our normal covariance matrix.

Keywords: Uncertainty modeling, Indeterminacy, Computer Vision, Maximum Likelihood, Invariants


Technical Review #1


Bundle Adjustment for Structure and Motion
Andrew Fitzgibbon, Richard Hartley, Philip McLauchlan and Bill Triggs

The recovery of camera positions and scene geometry from multiple images is a subject of much current interest in computer vision. Many applications require, as a basis, accurate and reliable estimates of structure and motion. However, best practice is not widely agreed, and the broad photogrammetry literature is often overlooked.

This technical briefing considers the most widely used method, nonlinear minimization of a statistically motivated image-based error function, known in photogrammetry as ``(Ray) Bundle Adjustment''. We define the problem, discussing the relative strengths and weaknesses of bundle and other methods. This includes techniques that are often proposed as alternatives but are in fact special cases or suboptimal variants.

Topics covered include:

The session will conclude with recommended best practices for several common situations.

Error Modelling, Correspondence, Tracking


Bootstrapping a Heteroscedastic Regression Model with Application to 3D Rigid Motion Evaluation
Bogdan Matei and Peter Meer (CAIP Center, Piscataway, NJ)

The bootstrap is a numerical technique, with solid theoretical foundations, to obtain statistical measures about the quality of an estimate by using only the available data. Performance assessment through bootstrap provides the same or better accuracy than the traditional error propagation approach, most often without requiring tedious analytical derivations. In many computer vision tasks a regression problem in which the measurement errors are point dependent has to be solved. Such regression problems are called heteroscedastic and appear in the linearization of quadratic forms encountered in the ellipse fitting and epipolar geometry, in camera calibration, or in 3D rigid motion estimation. The performance of these complex vision tasks is difficult to evaluate analitically, therefore we propose in this paper the use of bootstrap. To assure the validity of the results derived by bootstrap the data has to be independent and identically distributed (i.i.d.). Since in heteroscedastic regression the data are not i.i.d. we first derive a noise process by an estimator taking into account the heteroscedasticity and then satisfy the i.i.d. condition by a whiten-color cycle. The proposed technique is illustrated for 3D rigid motion estimation. In real applications the 3D data is usually extracted from stereo images, and the geometry of the cameras combined with the error in the image points yield an uncertainty in the reconstructed 3D measurements which is much higher along the depth and increases with the distance from the cameras. Experiments with real and synthetic data show, beside the importance of taking the heteroscedasticity into account also the validity of bootstrap as an evaluation tool.

Keywords: Performance Evaluation, Bias Reduction, Motion Analysis, Bootstrap, Heteroscedastic Errors-in-Variables Regression


Characterizing the Performance of Multiple-image Point-correspondence Algorithms using Self-Consistency
Yvan G. Leclerc, Q.-Tuan Luong, Pascal Fua (SRI International)

A new approach to characterizing the performance of point-correspondence algorithms is presented. Instead of relying on any ``ground truth', it uses the self-consistency of the outputs of an algorithm independently applied to different sets of views of a static scene. It allows one to evaluate algorithms for a given class of scenes, as well as to estimate the accuracy of every element of the output of the algorithm for a given set of views. Experiments to demonstrate the usefulness of the methodology are presented.

Keywords: performance characterization, correspondence algorithms, stereo


A General Method for Feature Matching and Model Extraction
Clark F. Olson (Jet Propulsion Laboratory, Pasadena)

Popular algorithms for feature matching and model extraction fall into two broad categories: generate-and-test and Hough transform variations. However, both methods suffer from significant problems in practical implementations. Generate-and-test methods are sensitive to noise in the data. They often fail when the generated model fit is poor due to error in the selected features. Hough transform variations are somewhat less sensitive the noise, but implementations for complex problems suffer from large time and space requirements and detection of false positives. This paper describes a general method for solving problems where a model is extracted from or fit to data that draws benefits from both generate-and-test methods and those based on the Hough transform, yielding a method superior to both. The main components of the method are the subdivision of the problem into many subproblems. This allows efficient generate-and-test techniques to be used, including the use of randomization to limit the number of subproblems that must be examined. However, the subproblems are solved using pose space analysis techniques similar to the Hough transform, which lowers the sensitivity of the method to noise. This strategy is easy to implement and results in practical algorithms that are efficient and robust. We apply this method to object recognition, geometric primitive extraction, robust regression, and motion segmentation.

Keywords: Feature matching, object recognition, robust regression, motion segmentation, Hough transform, generate-and-test


A Sampling Algorithm for Tracking Multiple Objects
Hai Tao, Harpreet S. Sawhney, Rakesh Kumar (Princeton and Sarnoff Corporation)

Tracking of multiple objects moving in cluttered scenes is an important yet difficult research issue in computer vision. One problem is the possible addition, deletion, and occlusion of objects. The CONDENSATION algorithm and its variants proposed recently enable the estimation of arbitrary multi-modal posterior distributions that potentially represent multiple tracked objects. However, the specific state representation adopted in the earlier work does not explicitly support counting, addition, deletion and occlusions of objects. Furthermore, the representation may increasingly bias the posterior density estimates towards objects with dominant likelihoods as the estimation progresses over many frames.

In this paper, a novel formulation and an associated CONDENSATION-like sampling algorithm is proposed for the detection and tracking of multiple objects that explicitly supports counting, addition and deletion of objects. We represent all objects in an image and their transformation parameters as an object configuration. The a posteriori distribution of all possible configurations is explored and maintained using sampling techniques. The dynamics of configurations allow addition and deletion of objects. With the help of domain information, occlusions can also be handled. Furthermore, in order to deal with the potentially high dimensional sample space of configurations, an efficient hierarchical algorithm is proposed to decompose the sampling process into two stages. Promising comparative experimental results of CONDENSATION and the new algorithm on both synthetic and real data are demonstrated.

Keywords: Flexible tracking, Sampling methods, Problem modelling and parametrization, Efficient algorithm design and data structures, Multiple models, Bias reduction


Invited Talk #1


Computer-Vision for the Post-Production World: Facts and Challenges through the REALViZ Experience
Luc Robert, REALViZ S.A.

In the world of film-making, computer vision can bring a great deal of automation in a variety of labour-intensive tasks. In this presentation, I will first give an overview of the products developed at REALViZ, their algorithmic bases, and their impact on the post-production work. I will then discuss what I believe are the main challenges in making vision algorithms actually useful for the CG artists, the problems which are considered as solved in this domain, and the important open problems. The talk will be illustrated with a number of practical examples and demos.


Technical Review #2


Direct vs. Feature Based Methods in Motion Analysis
P. Anandan, Michal Irani, Philip Torr and Andrew Zisserman

The past 15-20 years have seen the emergence of two classes of techniques to attack problems in visual motion and multi-view image analysis such as image alignment, moving object detection, multiple motion analysis, and the recovery of 3D structure from motion or multi-view images. The first class which may be called "feature-based" analysis breaks the problem into the following 3 stages: (i) feature extraction, (ii) feature correspondence estimation and tracking using multi-view relations, and (iii) analysis of the multi-view correspondences for estimating camera pose changes (or motion), 3D structure, plane recovery and alignment, moving object detection, etc. The second class of methods which may be called "direct" methods does not make such a separation, but instead works directly with spatio-temporal image brightness (or color) variations to simultaneously recover the scene parameters and establish pixel correspondences across the images.

This technical review session will briefly describe the progress in both classes of methods. We will aim to identify the scenarios and applications for which each method has been successful and where they have had limitations. While superficially these two classes of techniques look very different (e.g. dense vs. sparse shape recovery, small vs. large baseline, constrained vs. unconstrained correspondence estimation), closer scrutiny of the particular techniques under each class reveals that there is a continuum between the two classes. This space of techniques will also be discussed with the aim of shedding light on solutions to practical issues that arise in motion and multi-view image analysis.


Tracking, Representations and Geometry


Real-time tracking of complex structures for visual servoing
Tom Drummond and Roberto Cipolla (University of Cambridge, UK)

This paper presents a visual servoing system which incorporates a novel three-dimensional model-based tracking system. This tracking system extends constrained active contour tracking techniques into three dimensions, placing them within a Lie algebraic framework. This is combined modern graphical rendering technology to create a system which can track complex three dimensional structures in real time at video frame rate (25 Hz). The system is based on an internal CAD model of the object to be tracked which is rendered using binary space partition trees to perform hidden line removal. The visible features are identified on-line at each frame and are tracked in the video feed. Analytical and statistical edge saliency are then used as a means of increasing the robustness of the tracking system.

Keywords: Visual tracking, Visual servoing, Real-time


Point- and Line-Based Parameterized Image Varieties for Image-Based Rendering
Yakup Genc and Jean Ponce (The Beckman Institute, Urbana, IL)

This paper generalizes the parameterized image variety approach to image-based rendering proposed in [ICCV98] so it can handle both points and lines in a unified setting. We show that the set of all images of a rigid set of $m$ points and $n$ lines observed by a weak perspective camera forms a six-dimensional variety embedded in $R^{2(m+n)}$. A parameterization of this variety by the image positions of three reference points is constructed via least squares techniques from point and line correspondences established across a sequence of images. It is used to synthesize new pictures without any explicit 3D model. Preliminary experiments with real image sequences are presented.

Keywords: Image Synthesis, Image-Based Rendering, Motion Analysis, Model Acquisition


Projective Reconstruction from N views having one view in common
Martin Urban, Tomas Pajdla, Vaclav Hlavac (Prague, Czech Republic)

Projective reconstruction recovers 3D points in projective space P^3 from their several projections in 2D images.

There are two basic approaches to projective reconstruction. The first one, introduced by Hartley, is based on estimation of epipoles via matching tensors and consecutive estimation of projection matrices directly from image data. So far, this approach has been known only for two, three, or four views. The second one, presented by Sturm and Triggs, is based on factorization of so-called joint image matrix. This method computes projective reconstruction from N views. However, correspondences across all N views are required.

We introduce a method for the projective reconstruction from N views. The method is based on concatenation of trifocal constraints and requires correspondences only across triplets of views. The algorithm has the analogical structure to Hartley's approach and relies on linear estimates only. The method is not symmetrical with respect to input data. One of the captured images is selected as a reference image and plays a special role. The method can be viewed as a generalization of Hartley's algorithm, or as a particular application of Triggs' closure relations.

An accuracy and stability of the proposed algorithm with respect to pixel errors were tested. Experiments on real data are presented too.

Keywords: Multi-view Geometry, Projective Reconstruction, Matching Constraints and Tensors


Circular Motion Recovery from Image Profiles
P. R. S. Mendonca, K-Y. K. Wong, R. Cipolla (University of Cambridge, UK)

Topic: Multi-camera geometry, pose estimation.

This paper addresses the problem of motion recovery from image profiles, in the important case of turntable sequences. No correspondences between points or lines are used. Symmetry properties of surfaces of revolution are explored to obtain, in a robust and simple way, the image of the rotation axis of the sequence and the homography relating epipolar lines. These, together with geometric constraints for images of rotating objects, are used to obtain epipoles and, consequently, the full epipolar geometry of the camera system.

This sequential approach (image of rotation axis - homography - epipoles) avoids many of the problems usually found in other algorithms for motion recovery from profiles. In particular, the search for the epipoles, by far the most critical step for the estimation of the epipolar geometry, is carried out as a unidimensional optimization problem, with a smooth unimodal cost function. The initialization of the parameters is trivial for all three stages of the algorithm. After the estimation of the epipolar geometry, the motion is recovered using the fixed intrinsic parameters of the camera, obtained either from a calibration grid or from self-calibration techniques.

Results from real data are presented, demonstrating the efficiency and practicality of the algorithm.

Keywords: Image profiles, circular motion, epipolar geometry.


Stereo


An Experimental Comparison of Stereo Algorithms
Richard Szeliski (Microsoft) and Ramin Zabih (Cornell)

While many algorithms for computing stereo correspondence have been proposed, there has been very little work on experimentally evaluating algorithm performance, especially using real (rather than synthetic) imagery. In this paper we propose an experimental comparison of several different stereo algorithms. We use real imagery, and explore two different methodologies, with different strengths and weaknesses. Our first methodology relies upon ground truth; here we make use of a stereo pair from the Univeristy of Tsukuba with dense ground truth computed by hand. Our second methodology relies upon the notion of prediction error, which is the ability of a disparity map to predict an (unseen) third image, taken from a known camera position with respect to the input pair. We present preliminary results for both correlation-style stereo methods and techniques based on energy minimization. Our results suggest that our two methodologies give qualitatively consistent results.

Keywords: quantitative comparisons; algorithm performance; stereo matching


Invited Talk #2


Annotation of Video by Alignment to Reference Imagery
Keith Hanna, Sarnoff Corporation

Video as an entertainment or information source in consumer, military, and broadcast television applications is widespread. Typically however, the video is simply presented to the viewer, with only minimal manipulation. Examples include chroma-keying (often used in news and weather broadcasts) where specific color components are detected and used to control the video source. In the past few years, increases in computational power and the advent of digital video have meant that more complex manipulations can be performed. In this paper we present some highlights of our work in annotating video by aligning features extracted from the video to a reference set of features.

Video insertion and annotation require manipulation of the video stream to composite synthetic imagery and information with real video imagery. The manipulation may involve only the 2D image space or the 3D scene space. The key problems to be solved are : (i) indexing and matching to determine the location of insertion, (ii) stable and jitter-free tracking to compute the time variation of the camera, and (iii) seamlessly blended insertion for an authentic viewing experience. We highlight our approach to these problems by showing three example scenarios: (i) 2D synthetic pattern insertion in live video, (ii) annotation of aerial imagery through geo-registration with stored reference imagery and annotations, and (iii) 3D object insertion in a video for a 3D scene.


Vision Algorithms 99 home page Problems? - contact our webmaster