surfaces with occlusions from layered...
Post on 26-Jul-2020
9 Views
Preview:
TRANSCRIPT
SURFACES WITH OCCLUSIONS
FROM LAYERED STEREO
a dissertation
submitted to the department of computer science
and the committee on graduate studies
of stanford university
in partial fulfillment of the requirements
for the degree of
doctor of philosophy
Michael H. Lin
December 2002
c© Copyright by Michael H. Lin 2003
All Rights Reserved
ii
I certify that I have read this dissertation and that, in
my opinion, it is fully adequate in scope and quality as a
dissertation for the degree of Doctor of Philosophy.
Carlo Tomasi(Principal Advisor)
I certify that I have read this dissertation and that, in
my opinion, it is fully adequate in scope and quality as a
dissertation for the degree of Doctor of Philosophy.
Christoph Bregler
I certify that I have read this dissertation and that, in
my opinion, it is fully adequate in scope and quality as a
dissertation for the degree of Doctor of Philosophy.
Dwight Nishimura(Electrical Engineering)
Approved for the University Committee on Graduate
Studies:
iii
Abstract
Stereo, or the determination of 3D structure from multiple 2D images of a scene,
is one of the fundamental problems of computer vision. Although steady progress
has been made in recent algorithms, producing accurate results in the neighborhood
of depth discontinuities remains a challenge. Moreover, among the techniques that
best localize depth discontinuities, it is common to work only with a discrete set of
disparity values, hindering the modeling of smooth, non-fronto-parallel surfaces.
This dissertation proposes a three-axis categorization of binocular stereo algo-
rithms according to their modeling of smooth surfaces, depth discontinuities, and
occlusion regions, and describes a new algorithm that simultaneously lies in the most
accurate category along each axis. To the author’s knowledge, it is the first such
algorithm for binocular stereo.
The proposed method estimates scene structure as a collection of smooth surface
patches. The disparities within each patch are modeled by a continuous-valued spline,
while the extent of each patch is represented via a labeled, pixelwise segmentation of
the source images. Disparities and extents are alternately estimated by surface fitting
and graph cuts, respectively, in an iterative, energy minimization framework. Input
images are treated symmetrically, and occlusions are addressed explicitly. Boundary
localization is aided by image gradients.
Qualitative and quantitative experimental results are presented, which demon-
strate that, for scenes consisting of smooth surfaces, the proposed algorithm signifi-
cantly improves upon the state of the art, more accurately localizing both the depth
of surface interiors and the position of surface boundaries. Finally, limitations of the
proposed method are discussed, and directions for future research are suggested.
iv
Acknowledgements
I would like to thank my advisor, Professor Carlo Tomasi, for bestowing upon
me his unwavering support, guidance, generosity, patience, and wisdom, even during
unfortunate circumstances of his own.
I would also like to thank the members of my reading and orals committees—
Professors Chris Bregler, Dwight Nishimura, Amnon Shashua, Robert Gray, and
Steve Rock—for their thought-provoking questions and encouraging comments on
my work.
I am indebted to Stan Birchfield, whose research and whose prophetic words led
me to the subject of this dissertation.
Thanks also to Burak Gokturk, Hector Gonzalez-Banos, and Mark Ruzon for their
assistance in the preparation of my thesis defense, and to the other members of the
Stanford Vision Lab and Robotics Lab.
I am grateful to Daniel Scharstein and Olga Veksler, whose constructive critiquing
of preliminary versions of this manuscript helped to shape its final form.
Finally, I would like to thank my family and friends, and especially Lily Kao, who
have made the process of completing this work much more pleasant.
v
Contents
Abstract iv
Acknowledgements v
1 Introduction 1
1.1 Foundations of Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Solving the Correspondence Problem . . . . . . . . . . . . . . . . . . 4
1.3 Surface Interiors vs. Boundaries . . . . . . . . . . . . . . . . . . . . . 7
1.4 A Categorization of Stereo Algorithms . . . . . . . . . . . . . . . . . 8
1.4.1 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.2 Discontinuity . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.3 Uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 A Brief Survey of Stereo Methods . . . . . . . . . . . . . . . . . . . . 15
1.5.1 Pointwise Color Matching . . . . . . . . . . . . . . . . . . . . 16
1.5.2 Windowed Correlation . . . . . . . . . . . . . . . . . . . . . . 16
1.5.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5.4 Cooperative Methods . . . . . . . . . . . . . . . . . . . . . . . 19
1.5.5 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . 20
1.5.6 Graph-Based Methods . . . . . . . . . . . . . . . . . . . . . . 22
1.5.7 Layered Methods . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.6 Our Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.7 Outline of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . 24
vi
2 Preliminaries 25
2.1 Design Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Mathematical Abstraction . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Desired Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.2 Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.3 Non-triviality . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Energy Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Surface Fitting 31
3.1 Defining Surface Smoothness . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Surfaces as 2D Splines . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Surface Non-triviality . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Surface Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Surface Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6 Surface Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4 Segmentation 38
4.1 Segmentation by Graph Cuts . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Segmentation Non-triviality . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Segmentation Smoothness . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Segmentation Consistency . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5 Segmentation Optimization . . . . . . . . . . . . . . . . . . . . . . . 44
5 Integration 45
5.1 Segmentation Consistency, Revisited . . . . . . . . . . . . . . . . . . 45
5.2 Overall Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2.1 Iterative Descent . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.2 Merging Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.3 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2.4 Post-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 51
vii
6 Experimental Results 53
6.1 Quantitative Evaluation Metric . . . . . . . . . . . . . . . . . . . . . 53
6.2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2.1 “Map” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2.2 “Venus” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2.3 “Sawtooth” . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2.4 “Tsukuba” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.3.1 “Cheerios” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.3.2 “Clorox” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.3.3 “Umbrella” . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7 Discussion and Future Work 72
7.1 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.2 Theory vs. Practicality . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.3 Generality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
A Image Interpretation 76
A.1 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
A.2 Certainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Bibliography 79
viii
List of Tables
2.1 Contributions to energy . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1 Our overall optimization algorithm . . . . . . . . . . . . . . . . . . . 48
5.2 Our post-processing algorithm . . . . . . . . . . . . . . . . . . . . . . 51
6.1 Layout for figures of complete results . . . . . . . . . . . . . . . . . . 56
ix
List of Figures
1.1 The geometry of an ideal pinhole camera . . . . . . . . . . . . . . . . 3
1.2 The geometry of triangulation . . . . . . . . . . . . . . . . . . . . . . 4
1.3 The geometry of the epipolar constraint . . . . . . . . . . . . . . . . 6
1.4 An example of a smooth surface . . . . . . . . . . . . . . . . . . . . . 9
1.5 An example of discontinuities with occlusion . . . . . . . . . . . . . . 11
6.1 “Map” results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.2 “Map” error distributions . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3 “Venus” results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.4 “Venus” error distributions . . . . . . . . . . . . . . . . . . . . . . . . 61
6.5 “Sawtooth” results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.6 “Sawtooth” error distributions . . . . . . . . . . . . . . . . . . . . . . 63
6.7 “Tsukuba” results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.8 “Tsukuba” error distributions . . . . . . . . . . . . . . . . . . . . . . 65
6.9 “Cheerios” results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.10 “Clorox” results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.11 “Umbrella” results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
x
Chapter 1
Introduction
Ever since antiquity, people have wondered: How does vision work? How is it
that we “see” a three-dimensional world? For nearly two millennia, the generally-
accepted theory (proposed by many, including Euclid [ca. 325-265 BC] and Ptolemy
[ca. AD 85-165]) was that peoples’ eyes send out probing rays which “feel” the world.
This notion persisted until the early 17th century, when, in 1604, Kepler published
the first theoretical explanation of the optics of the eye [45], and, in 1625, Scheiner
observed experimentally the existence of images formed at the rear of the eyeball.
Those discoveries emphasized a more focussed question: How does depth perception
work? How is it that, from the two-dimensional images projected on the retina, we
perceive not two, but three dimensions in the world?
With the advent of computers in the mid-20th century, a broader question arose:
how can depth perception be accomplished in general, whether by human physi-
ology or otherwise? Aside from possibly elucidating the mechanism of human depth
perception, a successful implementation of machine depth perception could have
many practical applications, from terrain mapping and industrial automation to
autonomous navigation and real-time human-computer interaction. In a sense, Eucild
and Ptolemy had the right idea: methods using active illumination (including sonar,
radar, and laser rangefinding) can produce extremely accurate depth information and
are relatively easy to implement. Unfortunately, such techniques are invasive and have
limited range, and thus have a restricted application domain; purely passive methods
1
2 CHAPTER 1. INTRODUCTION
would be much more generally applicable. So how can passive depth perception be
accomplished? In particular, how can images with only two spatial dimensions yield
information about a third?
Static monocular depth cues (including occlusion, relative size, and vertical
placement within the field of view, but excluding binocular disparity and motion
parallax) are very powerful and often more than sufficient; after all, we are typically
able to perceive depth even when limited to using only one eye from only one view-
point. Inspired by that human ability, much research has investigated how depth
information can be inferred from a single “flat” image (e.g., Roberts [59] in 1963).
However, most such monocular approaches depend heavily upon strong assumptions
about the scene, and for general scenes, such knowledge has proven to be very difficult
to instill in a computer.
Julesz [42, 43] demonstrated in 1960 that humans can perform binocular depth
perception in the absence of any monocular cues. This discovery led to a prolifer-
ation of research on the extraction of 3D depth information from two 2D images
of the same static scene taken from different viewpoints. In this dissertation, we
address this problem of computational binocular stereopsis, or stereo: recovering
three-dimensional structure from two color (or intensity) images of a scene.
1.1 Foundations of Stereo
The foundations for reconstructing a 3D model from multiple 2D images, are
correspondence and triangulation. To understand how this works, let us first take a
look at the reverse process, of the formation of 2D images from a 3D scene. Suppose
we have an ideal pinhole camera (Figure 1.1), with center of projection P and image
plane Π. Then the image p of a world point X is located at the intersection of Π with
the line segment through P and X. Conversely, given the center of projection P , if
we are told that p is the image of some world point X, we know that X is located
somewhere along the ray from P through p.
Although this alone is not enough to determine the 3D location of X, if there is
a second camera with center of projection Q, whose image of the same world point
1.1. FOUNDATIONS OF STEREO 3
Π
X
P
p
Figure 1.1: The geometry of an ideal pinhole camera: ray from center of projectionP through world point X intersects image plane Π at image point p.
X is q, then we additionally know that X is located somewhere along the ray from
Q through q (Figure 1.2). That is, barring degenerate geometry, the position of
X is precisely the intersection of the rays−→Pp and
−→Qq. This well-known process of
triangulation, or position estimation by the intersection of two back-projected rays, is
what enables precise 3D localization, and thus forms the theoretical basis for stereo.
However, note that the preceding description of triangulation requires the posi-
tions of two image points p and q which are known to be images of the same world
point X. Moreover, reconstructing the entire scene would require knowing the posi-
tions of all such pairs (p, q) that are the image of some pairwise-common world
point. Determining this pairing is the correspondence problem, and is a prerequisite
to performing triangulation. Correspondence is much more difficult than triangu-
lation, and thus forms the pragmatic basis for stereo.
Although stereo is not difficult to understand in theory, it is not easy to solve
in practice. Mathematically, stereo is an ill-posed, inverse problem: its solution
essentially involves inverting a many-to-one transformation (image formation in this
case), and thus is underconstrained. Specifically, triangulation is very sensitive to
4 CHAPTER 1. INTRODUCTION
X
P Q
qp
Figure 1.2: The geometry of triangulation: world point X lies at intersection of raysfrom centers of projection P, Q through respective image points p, q.
input perturbations: small changes in the position of one or both image points can
lead to arbitrarily large changes in the position of the triangulated world point. In
other words, any uncertainty in the result of correspondence can potentially yield
a virtually unbounded uncertainty in the result of triangulation. This difficulty is
further exacerbated by the fact that correspondence is often prone to small errors.
1.2 Solving the Correspondence Problem
Thus we see that accurately solving the correspondence problem is the key to
accurately solving the stereo problem. In effect, we have narrowed our question
from “How can passive binocular stereopsis be done?” to “How can passive binocular
correspondence be done?”
The fundamental hypothesis behind multi-image correspondence is that the
appearance of any sufficiently small region in the world changes little from image
to image. In general, “appearance” might emphasize higher-level descriptors over
raw intensity values, but in its strongest sense, this hypothesis would mean that the
1.2. SOLVING THE CORRESPONDENCE PROBLEM 5
color of any world point remains constant from image to image. In other words, if
image points p and q are both images of some world point X, then the color values
at p and q are equal. This color constancy (or brightness constancy in the case of
grayscale images) hypothesis is in fact true with ideal cameras if all visible surfaces
in the world are perfectly diffuse (i.e., Lambertian). In practice, given photometric
camera calibration and typical scenes, color constancy holds well enough to justify
its use by most algorithms for correspondence.
The geometry of the binocular imaging process also significantly prunes the set of
possible correspondences, from lying potentially anywhere within the 2D image, to
lying necessarily somewhere along a 1D line embedded in that image. Suppose that
we are looking for all corresponding image point pairs (p, q) involving a given point
q (Figure 1.3). Then we know that the corresponding world point X, of which q is
an image, must lie somewhere along the ray through q from the center of projection
Q. The image of this ray−→Qq in the other camera’s image plane Π lies on a line l
that is the intersection of Π with the plane spanned by the points P, Q, q. Because
X lies on−→Qq, its projection p on Π must lie on the corresponding epipolar line l.
(When corresponding epipolar lines lie on corresponding scanlines, the images are
said to be rectified ; the difference in coordinates of corresponding image points is
called the disparity at those points.) This observation, that given one image point, a
matching point in the other image must lie on the corresponding epipolar line, is called
the epipolar constraint. Use of the epipolar constraint requires geometric camera
calibration, and is what typically distinguishes stereo correspondence algorithms from
other, more general correspondence algorithms.
Based on color constancy and the epipolar constraint, correspondence might
proceed by matching every point in one image to every point with exactly the same
color in its corresponding epipolar line. However, this is obviously flawed: there would
be not only missed matches at the slightest deviation from color constancy, but also
potentially many spurious matches from anything else that happens to be the same
color. Moreover, with real cameras, sensor noise and finite pixel sizes lead to addi-
tional imprecision in solving the correspondence problem. It is apparent that color
constancy and the epipolar constraint are not enough to determine correspondence
6 CHAPTER 1. INTRODUCTION
l
Q
qp?p?
P
X?
X?
X?
p?
Π
Figure 1.3: The geometry of the epipolar constraint: image point p corresponding toimage point q must lie on epipolar line l, which is intersection of image plane Π withplane spanned by q and centers of projection P, Q.
with sufficient accuracy for reliable triangulation. Thus, some additional constraint
is needed in order to reconstruct a meaningful three-dimensional model. What other
information can we use to solve the correspondence problem?
Marr and Poggio [52] proposed two such additional rules to guide binocular corre-
spondence: uniqueness, which states that “each item from each image may be assigned
at most one disparity value,” and continuity, which states that “disparity varies
smoothly almost everywhere.”
In explaining the uniqueness rule, Marr and Poggio specified that each “item
corresponds to something that has a unique physical position,” and suggested that
detected features such as edges or corners could be used. They explicitly cautioned
against equating an “item” with a “gray-level point,” describing a scene with trans-
parency as a contraindicating example. However, this latter interpretation, that each
image location be assigned at most one disparity value, is nonetheless very prevalent
in practice; only a small number of stereo algorithms (such as [71]) attempt to find
more than one disparity value per pixel. This common simplification is in fact justi-
fiable, if pixels are regarded as point samples rather than area samples, under the
1.3. SURFACE INTERIORS VS. BOUNDARIES 7
assumption that the scene consists of opaque objects: in that case, each image point
receives light from, and is the projection of, only the one closest world point along
its optical ray.
In explaining the continuity rule, Marr and Poggio observed that “matter is
cohesive, it is separated into objects, and the surfaces of objects are generally smooth
compared with their distance from the viewer” [52]. These smooth surfaces, whose
normals vary slowly, generally meet or intersect in smooth edges, whose tangents
vary slowly [36]. When projected onto a two-dimensional image plane, these three-
dimensional features result in smoothly varying disparity values almost everywhere
in the image, with “only a small fraction of the area of an image . . . composed of
boundaries that are discontinuous in depth” [52]. In other words, a reconstructed
disparity map can be expected to be piecewise smooth, consisting of smooth surface
patches separated by cleanly defined, smooth boundaries.
These two rules further disambiguate the correspondence problem. Together
with color constancy and the epipolar constraint, uniqueness and continuity typically
provide sufficient constraints to yield a reasonable solution to the stereo correspon-
dence problem.
1.3 Surface Interiors vs. Boundaries
A closer look at the continuity rule shows a clear distinction between the interiors
and the boundaries of surfaces: depth is smooth at the former, and non-smooth at the
latter. This bifurcation of continuity into two complementary aspects is often reflected
in the design of stereo algorithms, because, as noted by Belhumeur [7], “depth, surface
orientation, occluding contours, and creases should be estimated simultaneously.”
On the one hand, it is important to recover surface interiors by estimating their
depth and orientation, because such regions typically constitute the vast majority of
the image area. On the other hand, it is important to recover surface boundaries by
estimating occluding contours and creases, because boundaries typically are the most
salient image features. (In fact, much of the earliest work on the three-dimensional
interpretation of images focussed on line drawing interpretation, in which boundaries
8 CHAPTER 1. INTRODUCTION
are the only image features.) Birchfield [13] in particular emphasized the importance
of discontinuities, which generally coincide with surface boundaries.
Moreover, in general, neither surface interiors nor surface boundaries can be unam-
biguously derived solely from the other. For example, given only a circular boundary
that is discontinuous in depth, the interior could be either a fronto-parallel circle or
the front hemisphere of a ball; this difficulty is inherent in the sparseness of bound-
aries. Conversely, given the disparity value at every pixel in an image, while one could
threshold the difference between neighboring values to detect depth discontinuities,
the selection of an appropriate threshold would be tricky at best; detecting creases by
thresholding second-order differences would be even more problematic. This difficulty
is inherent in the discrete nature of pixel-based reconstructions that do not otherwise
indicate the presence or absence of discontinuities.
Furthermore, not only should surface interiors and surface boundaries each be
estimated directly, but because boundaries and interiors are interdependent, with the
former bordering on the latter, they should in fact be estimated cooperatively within
a single algorithm, rather than independently by separate algorithms.
In other words, a stereo algorithm should explicitly and simultaneously consider
both of the two complementary aspects of the continuity rule: smoothness over surface
interiors, and discontinuity across surface boundaries.
1.4 A Categorization of Stereo Algorithms
Many stereo algorithms are based upon the four constraints listed in Section 1.2:
the epipolar constraint, color constancy, continuity, and uniqueness. Of these, the
former two are relatively straightforward, but the manner in which the latter two
are applied varies greatly [4, 22, 27, 65]. We propose a three-axis categorization
of binocular stereo algorithms according to their interpretations of continuity and
uniqueness, where we subdivide continuity according to the discussion of Section 1.3.
In the following subsections, we list last, for all three axes, that category which we
consider to be the most preferable.
1.4. A CATEGORIZATION OF STEREO ALGORITHMS 9
Figure 1.4: An example of a smooth surface (left and right images shown).
1.4.1 Continuity
The first axis describes the modeling of continuity over disparity values within
smooth surface patches.
As an example of a smooth surface patch, consider a slanted plane, with left and
right images as shown in Figure 1.4. Then the true disparity over the surface patch
would also be a slanted plane (i.e., a linear function of the x and y coordinates within
the image plane).
How might a stereo algorithm model the disparity along this slanted plane, or
along any one smooth surface in general? Using this example as an illustration, we
propose to categorize smooth surface models into three broad groups; most models
used by prior stereo algorithms fall into one of these categories.
Constant
In these most restricted models, every point within any one smooth surface patch
is assigned the same disparity value. This value is usually chosen from a finite, pre-
determined set of possible disparities, such as the set of all integers within a given
range, or the set of all multiples of a given fraction (e.g., 1/4 or 1/2) within a given
range. Examples of prior work in this category include traditional sum-of-squared-
differences correlation, as well as [17, 30, 44, 47, 52].
Applied to our example, these models would likely recover several distinct fronto-
parallel surfaces. This would be a poor approximation to the true answer. While one
could lump together multiple constant-disparity surfaces to simulate slanted or curved
10 CHAPTER 1. INTRODUCTION
surfaces, such a grouping would likely contain undesirably large internal jumps in
disparity, especially in textureless regions. It would be desirable to be able to represent
directly, not only fronto-parallel surfaces, but also slanted and curved surfaces.
Discrete
In these intermediate models, disparities are again limited to discrete values,
but with multiple distinct values permitted within each surface patch. Surface
smoothness in this context means that within each surface, neighboring pixels should
have disparity values that are numerically as close as possible to one another. In
other words, intra-surface discontinuities are expected to be small. For identical
discretization, the “smooth” surfaces expressible by these models are a strict superset
of those expressible by “Constant” models. Examples of prior work in this category
include [7, 41, 60, 86].
Applied to our example, this category would improve upon the previous by
shrinking the jumps in disparity to the resolution of the disparity discretization.
However, it would be even better if the jumps were completely removed. Although
one could fit a smooth surface to the discretized data from these models, such a fit
would still be subject to error; e.g., if our slanted plane had a disparity range less than
one discretization unit, these models would likely recover a fronto-parallel surface.
Real
In these most general models, disparities within each smooth surface patch vary
smoothly over the real numbers (or some computer approximation thereof). This
category can be thought of as the limit of the “Discrete” category as the discretization
step approaches zero. Various interpretations of smoothness can be used; most try to
minimize local first- or second-order differences in disparity. Examples of prior work
in this category include [1, 3, 11, 70, 75].
Applied to our example, in the absence of other sources of error, these models
should correctly find the true disparity. Therefore, among these three categories for
modeling smooth surfaces, we find this one to be the most preferable, because it
allows for the greatest precision in estimating depth.
1.4. A CATEGORIZATION OF STEREO ALGORITHMS 11
Figure 1.5: An example of discontinuities with occlusion. Top: left and right images.Bottom: left and right disparity maps (dark gray = small disparity; light gray = largedisparity; white = no disparity).
1.4.2 Discontinuity
The second axis describes the treatment of discontinuities at the boundaries of
smooth surface patches.
As an example of a scene with boundaries discontinuous in depth, consider a small,
fronto-parallel square “floating” in front of a larger one, with left and right images as
shown in Figure 1.5. Then the true disparity for this scene would be a small square
of larger disparity inside a larger square of smaller disparity, with step edges along
the top and bottom of the smaller square.
How might a stereo algorithm model the disparity across this depth discontinuity?
Using this example as an illustration, we propose to categorize discontinuity models
into four broad groups; most models used by prior stereo algorithms fall into one of
these categories. Specifically, the penalty associated with a discontinuity is examined
as a function of the size of the jump of the discontinuity.
12 CHAPTER 1. INTRODUCTION
Free
In this category, discontinuities are not specifically penalized. That is, with all else
being equal, no preference is given for continuity in the final, reconstructed disparity
map. In particular, these methods often fail to resolve the ambiguity caused by
periodic textures or textureless regions. Examples of prior work in this category
include traditional sum-of-squared-differences correlation, as well as [44, 52, 74, 84, 86].
Applied to our example, these models would likely produce frequent, scattered
errors throughout the interior of the squares: the cross-hatched pattern is perfectly
periodic, so the identity of the single, best match would likely be determined by
random perturbations.
Infinite
In this category, discontinuities are penalized infinitely; i.e., they are disallowed.
The entire image is treated as one “smooth” surface. That is, the entire image, as
a unit, is subject to the chosen model of smooth surface interiors; “almost every-
where” continuity in fact applies everywhere. The recovered disparity map is smooth
everywhere, although potentially not uniformly so. (Note, however, that the surface
smoothness model may itself allow small discontinuities within a single surface.)
Examples of prior work in this category include [1, 5, 37, 70].
Applied to our example, these models would not separate the foreground and
background squares, but would instead connect them by smoothing over their common
boundary. The width of the blurred boundary can vary depending on the specific
algorithm, but typically, the boundary will be at least several pixels wide.
Convex
In this category, discontinuities are allowed, but a penalty is imposed that is a
finite, positive, convex function of the size of the jump of the discontinuity. Typically,
that convex cost function is either the square or the absolute value of the size of the
jump. The resulting discontinuities often tend to be somewhat blurred, because the
cost of two adjacent discontinuities is no more than that of a single discontinuity of
the same total size. Examples of prior work in this category include [41, 60, 75].
1.4. A CATEGORIZATION OF STEREO ALGORITHMS 13
Applied to our example, these models would likely separate the foreground and
background squares successfully. However, at the top and bottom edges of the smaller
square, where there is just a horizontal line, these models might output a disparity
value in between those of the foreground and background.
Non-convex
In this category, discontinuities are allowed, but a penalty is imposed that is a
non-convex function of the size of the jump of the discontinuity. One common choice
for that non-convex cost function is the Potts energy [57], which assesses a constant
penalty for any non-zero discontinuity, regardless of size. The resulting discontinuities
usually tend to be fairly clean, because the cost of two adjacent discontinuities is
generally more than that of a single discontinuity of the same total size. Examples
of prior work in this category include [7, 21, 24, 30].
Applied to our example, these models would likely separate the foreground and
background squares successfully. Moreover, the recovered disparity values would
likely contain only two distinct depths: those of the foreground and the background.
Therefore, among these four categories for modeling discontinuities, we find this one
to be the most preferable, because it reconstructs boundaries the most cleanly, with
minimal warping of surface shape.
1.4.3 Uniqueness
The third axis describes the application of uniqueness, especially to the occlusions
that accompany depth discontinuities.
As an example of a scene with occlusions, consider again the example of the
floating square shown in Figure 1.5. The true disparity for this scene would be a
small square of larger disparity inside a larger square of smaller disparity, with an
occlusion region of no disparity to the left or right side of the smaller square in the left
or right images, respectively. The occlusion region is the portion of the background
square which, in the other image, is occluded by the foreground square. Points
within the occlusion region have no disparity because they do not have corresponding
points in the other image, and we define disparity using correspondence (as opposed
14 CHAPTER 1. INTRODUCTION
to inverse depth). In general, disparity discontinuities are always accompanied by
occlusion regions [30], except when the boundary lies along an epipolar line. This is
a consequence of uniqueness and symmetry.
How might a stereo algorithm model the disparity in this scene, or in others with
discontinuities and occlusions? Using this example as an illustration, we propose to
categorize occlusion models into four broad groups; most models used by prior stereo
algorithms fall into one of these categories.
Transparent
In this category, uniqueness is not assumed; these models allow for transparency.
Our “floating squares” example does not exhibit transparency, so these models would
be of little benefit for it; however, for scenes that do exhibit transparency, these models
would be essential for adequate reconstruction. Furthermore, natural scenes often
contain fine details (such as tree branches against the sky) that are only a few pixels
in width; because of pixelization, these images effectively contain transparency as
well [62]. Unfortunately, stereo reconstruction with transparency is a very challenging
problem with few existing solutions; one such example of prior work is [71].
One-way
In this category, uniqueness is assumed within a chosen reference image, but
not considered within the other. That is, each location in the reference image
is assigned at most one disparity, but the disparities at multiple locations in the
reference image may point to the same location in the other image. Typically, each
location in the reference image is assigned exactly one disparity, and occlusion rela-
tionships are ignored. That is, these models generally search for correspondences
within occlusion regions as well as within non-occlusion regions. Examples of prior
work in this category include traditional SSD correlation, as well as [11, 21, 44, 84].
Applied to our example, these models would likely find the correct disparity within
the unoccluded regions. Points within the occluded regions will typically be assigned
some intermediate disparity value between those of the foreground and background.
Note that such an assignment would result in the occluded point and a different,
1.5. A BRIEF SURVEY OF STEREO METHODS 15
unoccluded point both being paired with the same point in the other image. This
“collision,” or failure of reciprocal uniqueness, is an undesirable yet readily detectable
condition; these models allow it and are thus less than ideal.
Asymmetric Two-way
In this category, uniqueness is encouraged for both images, but the two images
are treated unequally. That is, reasoning about occlusion is done, and the occlusions
that accompany depth discontinuities are qualitatively recovered, but there is still
one chosen reference image, resulting in asymmetries in the reconstructed result.
Examples of prior work in this category include [3, 17, 52, 75, 86].
Applied to our example, these models would likely find the correct disparity within
the unoccluded regions, and most of the occlusion region would be marked as such.
However, some occluded pixels near the edge of the occlusion region might mistakenly
be assigned to the nearer surface; the outline of the reconstructed smaller square would
likely look different on its left versus right edges.
Symmetric Two-way
In this category, uniqueness is enforced in both images symmetrically; detected
occlusion regions are marked as being without correspondence. Examples of prior
work in this category include [7, 30, 41, 47].
Applied to our example, these models would likely find the correct disparity (or
lack thereof) everywhere, barring other sources of error. Therefore, among these four
categories for modeling uniqueness and occlusions, we find this one to be the most
preferable for our purposes, because it encourages the greatest precision in localizing
boundaries in image space by fully utilizing the assumption of uniqueness.
1.5 A Brief Survey of Stereo Methods
Section 1.1 explained that stereo correspondence is fundamentally an undercon-
strained, inverse problem. Section 1.2 proposed that it is fairly straightforward to
impose color constancy and the epipolar constraint (by matching each image point
16 CHAPTER 1. INTRODUCTION
with every image point of the same color in the corresponding epipolar line), but
observed that additional constraints are necessary. Section 1.2 concluded by claiming
that uniqueness and continuity generally suffice as those further constraints, without
discussing how they are applied.
This section motivates and reviews a few selected approaches to stereo that have
been used in the past, discusses how they use uniqueness and continuity, and explains
how they fit within our three-axis categorization.
1.5.1 Pointwise Color Matching
Under the assumption of opacity, uniqueness implies that each image point may
correspond only to at most one other image point. Thus, the simplest way to apply
uniqueness on top of color constancy and the epipolar constraint would be to match
each image point with the one image point of the most similar color in the corre-
sponding epipolar line. This naive technique might work well in an ideal, Lambertian
world in which every world point has a unique color, but in practice, the discretized
color values of digital images can cause problems.
As an extreme example, let us consider a binary random-dot stereogram, in which
each image consists of pixels which are randomly and independently black or white.
With pointwise correspondence, there will be a true match at the correct disparity,
but there will also be a 50% chance of a false match at any incorrect disparity. This
is because looking at the color at a single point does not provide enough information
to uniquely identify that point.
Thus we see that even with the use of color constancy, uniqueness, and the
epipolar constraint, without continuity, direct pointwise stereo correspondence is still
ambiguous, in that false matches may appear as good as the correct match.
1.5.2 Windowed Correlation
With the use of continuity, which implies that neighboring image points will likely
have similar disparities, one can pool information among neighboring points to reduce
1.5. A BRIEF SURVEY OF STEREO METHODS 17
ambiguity and false matches. This is the basic idea behind windowed correlation, on
which many early stereo methods are based.
Classical windowed correlation consists of comparing a fixed-size window of pixels,
rather than individual pixels, and choosing the disparity that yields the best match
over the whole window. Windowed correlation is simple, efficient, and effective at
reducing false matches. For example, let us reconsider the aforementioned random-
dot stereogram. Whereas matching individual pixels gave a false match rate of 50%,
with a 5× 5 window of pixels for correlation, the probability of an exact false match
would be reduced to 2−25, or under 1 in 33 million.
However, for a window to match exactly, the disparity within the window must
be constant. Otherwise, no disparity will yield a perfect match, and the algorithm
will pick whichever disparity gives the smallest mismatch, which may or may not
be the disparity at the center of the window. In other words, windowed correlation
methods depend on the implicit assumption that disparities are “locally” constant;
these methods work best where that is indeed the case.
The meaning of “locally” above is determined by the size and shape of the corre-
lation window. However, choosing the configuration of the window is not easy. On
the one hand, if the window is too small, spurious matches will remain, and many
incorrect matches will look “good”; larger windows are better at reducing ambiguity
by minimizing false matches. On the other hand, if the window is too large, it will
be unlikely to contain a single disparity value, and even the correct match will look
“bad”; larger windows are also less likely to contain only a single disparity value, and
thus more likely to reject true matches.
To some extent, this problem of choosing a window size can be alleviated by using
adaptive windows, which shift and/or shrink to avoid depth discontinuities, while
remaining larger away from discontinuities. Kanade and Okutomi [44] use explicit,
adaptive, rectangular windows. Hirschmuller [35] uses adaptive, piecewise square
windows. Scharstein and Szeliski [63] use implicit “windows” formed by iterative,
nonlinear diffusion.
In practice, windowed correlation techniques work fairly well within smooth,
textured regions, but tend to blur across any discontinuities. Moreover, they generally
18 CHAPTER 1. INTRODUCTION
perform poorly in textureless regions, because they do not specifically penalize discon-
tinuities in the recovered depth map. That is, although these methods assume conti-
nuity by their use of windows, they do not directly encourage continuity in the case
of ambiguous matches.
1.5.3 Regularization
One way to enforce continuity, even in the presence of ambiguous matches, is
through the use of regularization. Generically, regularization is a technique for stabi-
lizing inverse problems by explicitly quantifying smoothness and adding it as one
more simultaneous goal to optimize. Applied to stereo, it treats disparity as a real-
valued function of image location, defines a functional measuring the “smoothness” of
such a disparity function, and tries to maximize that functional while simultaneously
maximizing color constancy. Such “smoothness” can be quantified in many ways, but
in general, nearby image locations should have similar disparities.
Horn and Schunck [38] popularized both the use of regularization on otherwise
underconstrained image correspondence problems, and the use of variational methods
to solve the resulting energy minimization problems in a continuous domain. Horn
[37], Poggio et al. [56], and Barnard [5] suggested several formulae for quantifying
smoothness, all of which impose uniform smoothness and forbid discontinuities.
Terzopoulos [77] and Lee and Pavlidis [49] investigated regularization with discontinu-
ities. Rivera and Marroquın [58] formulated a higher-order, edge-preserving method
that does not penalize constant, non-zero slopes.
Computationally, regularization tends to yield challenging nonlinear optimization
problems that, without fairly sophisticated optimization algorithms, can be highly
dependent on good initial conditions. Often, multiscale or multigrid methods are
needed [76], but Akgul et al. [1] presented an alternate method of ensuring reliable
convergence, by starting with two initial conditions, and evolving them cooperatively
until they coincide. Allowing discontinuities further complicates the optimization
process; Blake and Zisserman [16] propose a method for optimizing certain specific
models of regularization with discontinuities.
1.5. A BRIEF SURVEY OF STEREO METHODS 19
Aside from optimization challenges, the primary weakness of regularization
methods is that they do not readily allow occlusions to be represented. Every image
point is forced to have some disparity; no point can remain unmatched, as would be
required for proper occlusions under the constraint of uniqueness.
1.5.4 Cooperative Methods
We have seen that neither windowed correlation nor regularization methods
support proper occlusions: they try to find unique disparity values for one reference
image, but without checking for “collisions” in the other image. In effect, uniqueness,
although assumed, is not enforced; only one-way uniqueness is applied
Inspired by biological nervous systems, cooperative methods directly implement
the assumptions of continuity and two-way uniqueness in an iterative, locally
connected, massively parallel system. These techniques operate directly in the space
of correspondences (refered to as the matching score volume by [85], and as the
disparity-space image, or DSI, by [17, 40, 65]), rather than in image space, evolving a
3D lattice of continuous-valued weights via mutual excitation and inhibition.
This space of possible correspondences can be parameterized in several ways.
Typically, (x, y, d) is used, with (x, y) representing position in the chosen reference
image, and d representing disparity. Assuming rectified input images, however,
an alternate, symmetric parameterization is (xl, xr, y). Qualitatively, a weight at
(xl, xr, y) in such a coordinate system represents the likelihood that (xl, y) in the left
image and (xr, y) in the right image correspond to one another (i.e., are images of
the same, physical world point).
Initially, this matching score volume is populated with local similarity measures,
typically obtained via correlation with small windows. Subsequently, the weights are
updated in parallel as follows: if a weight at (xl, xr, y) is large, then for uniqueness,
weights at (x, xr, y) and (xl, x, y) are inhibited, and for continuity, weights at any
other, non-inhibited points “near” (xl, xr, y) are excited. Upon convergence of this
relaxation algorithm, these real-valued weights are compared with one another and
thresholded to determine final correspondences (or the lack thereof).
20 CHAPTER 1. INTRODUCTION
Different cooperative algorithms use different models of excitation corresponding
to different models of smooth surfaces. Marr and Poggio [52] use a fixed, 2D excitation
region for “Constant” surfaces; moreover, their algorithm is only defined for binary-
valued (e.g., black and white) input images. Zitnick and Kanade [86] use a fixed, 3D
excitation region for “Discrete” surfaces; their algorithm is designed for real-valued
images.
In practice, [86] can give very good results, but because it uses a fixed window for
excitation, boundaries can be rounded or blurred (analogous to classical windowed
correlation). To improve boundary localization, Zhang and Kambhamettu [85] use a
variable, 3D excitation region that is dependent on an initial color segmentation of
the input images; the idea is that depth discontinuities will likely correlate well with
monocular color edges.
Regarding convergence, because the cooperative update is local in nature, accurate
results depend upon good initialization. In particular, although a limited number of
false matches can start with a good initial score, true matches must start with a
good initial score. These methods support discontinuities with non-convex penalties;
two-way uniqueness is encouraged, generally asymmetrically.
1.5.5 Dynamic Programming
Like cooperative methods, dynamic programming methods also operate in a
discretized disparity space in order to encourage bidirectional uniqueness along
with continuity. However, while cooperative methods are iterative and find a
locally optimal set of real-valued weights that must then be thresholded, dynamic
programming is non-iterative and finds a globally optimal set of binary weights that
directly translate into the presence or absence of each candidate correspondence.
Thus, dynamic programming methods for stereo are much faster, and apparently also
more principled, than their cooperative counterparts.
However, the downsides to dynamic programming are twofold. First, dynamic
programming can only optimize one scanline at a time. Many desired interactions
among scanlines, such as that required for continuity between scanlines, requires less
principled, ad hoc post-processing, usually with no guarantee of optimality. Second,
1.5. A BRIEF SURVEY OF STEREO METHODS 21
dynamic programming depends upon the validity of the ordering constraint, which
states that in each pair of corresponding scanlines, corresponding pixels appear in the
same order in the left and right scanlines. Because the ordering constraint is a gener-
alization that is not always true [28], and because optimizing scanlines independently
can be rather prone to noise, dynamic programming is better suited for applications
where speed is an important consideration.
Various dynamic programming approaches differ in their treatment of continuity.
Baker and Binford [2] and Ohta and Kanade [54] impose continuity simply and
directly, by first matching edges, then interpolating over the untextured regions.
Unfortunately, such interpolation does not preserve sharp discontinuities.
Taking the opposite approach, Intille and Bobick [17, 40] do not use continuity at
all, relying upon “ground control points” and the ordering constraint to obviate the
need for any external smoothness constraints. Their asymmetric method uses neither
intra- nor inter-scanline smoothness, and treats each scanline independently. The
method of Geiger, Ladendorf, and Yuille [30] also treats scanlines independently, but
supposes that disparities are piecewise constant along each scanline, and symmetri-
cally enforces a strict correspondence between discontinuities in one image and occlu-
sions in the other.
In contrast, both Cox et al. [24] and Belhumeur and Mumford [8] impose 2D conti-
nuity through inter-scanline constraints. Cox et al. [24] count the total number of
depth discontinuities (horizontal plus vertical), and specify that this number should
be minimized as a subordinate goal; they suggest either one or two passes of dynamic
programming as efficient methods for approximating this minimization. Belhumeur
and Mumford [8] also require the minimization of the number of pixels at which
discontinuities are present, but Belhumeur [6, 7] generalizes the notion of disconti-
nuity, counting both step edges and crease edges. Belhumeur formulates a symmetric
energy functional that incorporates this count, and proposes that it be minimized
with iterated stochastic dynamic programming.
Aside from depending upon the ordering constraint, all of these methods have
discrete, pixelized approaches to continuity that are at most one-dimensional. These
limitations are the primary weaknesses of dynamic programming for stereo.
22 CHAPTER 1. INTRODUCTION
1.5.6 Graph-Based Methods
As do dynamic programming methods, graph-based methods leverage combina-
torial optimization techniques for their power, but unlike dynamic programming
methods, graph-based methods are able to optimize continuity over the entire
2D image, instead of only along individual 1D scanlines. Graph-based methods are
based upon efficient algorithms [32, 46] for calculating the minimum-cost cut (or,
equivalently, the maximum flow [29]) through a network graph.
There are two general flavors of graph-based stereo methods. One flavor computes
the global minimum of a convex energy functional with a single minimum-cost cut;
typically, the cost of a discontinuity is a linear function of its size. Roy and Cox [60]
propose one such method, which discards the ordering constraint used in dynamic
programming in favor of a local coherence constraint. Their method uses an undi-
rected graph built upon a chosen reference image, and find exactly one disparity for
each pixel therein. Ishikawa and Geiger [41] propose another such method, which
retains the ordering constraint, and furthermore distinguishes among ordinary, edge,
and junction pixels. Their method uses a directed graph, and symmetrically enforces
two-way uniqueness. Both of these methods tend to produce discontinuities that
are somewhat blurred, because they are incapable of using non-convex, sub-linear
penalties for discontinuities.
The other flavor of graph-based methods computes a strong local minimum of
a non-convex energy functional with iterated minimum-cost cuts. Boykov, Veksler,
and Zabih [18] developed one such optimization technique that is applicable to an
extremely wide variety of non-convex energies. Boykov et al. [19] subsequently
developed another such optimization technique that is somewhat less widely appli-
cable, but which produces results that are provably within a constant factor of being
globally optimal [79]. Boykov et al. [20, 21] apply these techniques to stereo, again
using an undirected graph built upon a chosen reference image, and finding exactly
one disparity for each pixel therein. Kolmogorov and Zabih [47] build more complex
graphs; their method enforces symmetric, two-way uniqueness, but is limited to
constant-disparity continuity.
1.5. A BRIEF SURVEY OF STEREO METHODS 23
In general, graph-based methods are not only quite powerful, but also fairly effi-
cient for what they do. For our purposes, their main weakness is their restriction to
computing discrete-valued disparities, due to their inherently combinatorial nature.
1.5.7 Layered Methods
Like regularization with discontinuities, layered models [25, 26, 80] estimate real-
valued disparities while allowing discontinuities, producing piecewise-smooth surface
reconstructions. However, while the former methods represent all surface patches
together with a single function mapping image location to disparity value, layered
methods separately represent each surface patch with its own such function, and
combine them into a single depth map through the use of support maps, which define
the image regions in which each surface is “active.”
The primary consequence of this representational enrichment is that when
combining support maps, there need not be exactly one “active” surface per pixel.
In particular, it is trivial to represent pixels at which no surface is active. This lets
layered methods readily model occlusion regions, enabling them to consider two-way
uniqueness.
Another consequence of modeling surfaces with layers is that each surface gets
its own identity, independent of its support map. This allows image regions to be
grouped semantically, rather than merely topologically. In other words, layered
models have the advantage of being able to model hidden connections among visible
surface patches that are separated by occluding objects. For example, consider a
scene seen through chain link fence. With standard regularization, either the fence
must be ignored, or the remainder of the scene must be cut into many independent
pieces. With a layered model, the fence and the background can both survive intact.
Baker, Szeliski, and Anandan [3] developed a layered method for stereo recon-
struction that is based upon minimizing the resynthesis error obtained by comparing
input images warped according to the recovered depth map. Their method models
the disparity of each surface patch as a plane with small, local deviations. Their
theory includes transparency, but their implementation uses asymmetric, two-way
24 CHAPTER 1. INTRODUCTION
uniqueness. Like windowed correlation methods, they achieve some degree of conti-
nuity by spatially blurring the match likelihood.
Birchfield and Tomasi [11] developed a method that models each surface as a
connected, slanted plane. They estimate the assignment of pixels to surfaces with the
graph-based techniques of [18], and favor placing boundaries along intensity edges,
again yielding exactly one disparity value for each image location. Among prior work,
their algorithm is the most similar to ours.
1.6 Our Proposed Approach
In Section 1.4, we proposed a three-axis categorization of binocular stereo algo-
rithms according to their treatment of continuity and uniqueness. In the remainder
of this dissertation, we propose an algorithm that simultaneously lies in the most
preferable category along all three axes: real-valued disparities, non-convex disconti-
nuity penalties, and symmetric two-way occlusions. To the author’s knowledge, ours
is the first such algorithm for binocular stereo.
We contend that, for scenes consisting of smooth surfaces, our algorithm improves
upon the current state of the art, achieving both more accurate localization in depth
of surface interiors via subpixel disparity estimation, and more accurate localization
in the image plane of surface boundaries via the symmetric treatment of images with
proper handling of occluded regions.
1.7 Outline of Dissertation
In Chapter 2, we describe our mathematical model of the stereo problem and solu-
tions thereof. In Chapters 3 and 4, we describe surface fitting and boundary local-
ization, respectively. In Chapter 5, we describe the interaction between surface fitting
and boundary localization, and give the overall optimization algorithm. In Chapter 6,
we present some promising qualitative and quantitative experimental results. Finally,
in Chapter 7, we offer a few concluding remarks.
Chapter 2
Preliminaries
In this chapter, we develop a mathematical abstraction of the stereo problem.
This abstract formulation is defined within a continuous domain; discretization of
the problem for computational feasiblity will be discussed in subsequent chapters.
2.1 Design Principles
Because the stereo problem is so sensitive to perturbations, in order to get best
results, it is especially important that the algorithm be designed to minimize the
unnecessary introduction and propagation of errors. To this end, we follow two
guiding principles: least commitment, and least discretization.
In the computation of our final answer, we would like to make the best possible
use of all available data. This means that, at any particular stage in the computation,
we would like to be the least committed possible to any particular interpretation of
the data. Because of this, it is better to match images directly, instead of matching
only extracted features (such as edges and corners): we don’t want to discard the
dense image data so early in the process. Similarly, it is better to directly estimate
subpixel disparities, rather than fit smooth surfaces to pre-calculated, integer-only
disparities, for the same reason.
In a similar spirit, we also would like to avoid rounding errors as much as possible,
25
26 CHAPTER 2. PRELIMINARIES
so our computations are done in a continuous space as much as possible. Most basi-
cally, our algorithm estimates floating point disparity values defined on a continuous
domain; these disparity values are only discretized in our implementation by finite
machine precision. In addition, since we are trying to recover subpixel (non-integer)
disparity values, we need to match image appearance at inter-pixel image positions.
This means that we must define image appearance at inter-pixel image positions; that
is, we must interpolate the input images. We describe how we do this in Appendix A.
2.2 Mathematical Abstraction
As motivated in Section 1.5, in order to place in the most preferable category
along each of our three proposed axes, we use a layered model [25, 26, 80] to represent
possible solutions to the stereo problem.
Our stereo algorithm follows the common practice of assuming that input images
have been normalized with respect to both photometric and geometric calibration.
In particular, we assume that the images are rectified. Let
I = {p = (x, y, t)} = (R×R× {‘left’, ‘right’})
be the space of image locations, and let
I : I 7→ Rm
be the given input image pair. Typically, m = 3 (with the components of Rm repre-
senting red, green, and blue) for color images, and m = 1 for grayscale images, but our
algorithm does not depend on the semantic interpretation of Rm; any feature space
can be used. Note that the image space is defined to be continuous, not discrete; we
discuss this matter further in Appendix A.
Our abstract model of a hypothesized solution consists of a labeling (or segmen-
tation) f , which assigns each point of the two input images to zero or one of n surfaces,
plus n disparity maps d[k], each of which assigns a disparity value to each point of
2.2. MATHEMATICAL ABSTRACTION 27
the two input images:
[segmentation] f : I 7→ {0, 1, . . . , N}
[disparity map] d[k] : I 7→ R for k in {1,2, . . . ,N}
In other words, these functions are the independent unknowns that are to be esti-
mated.
The segmentation function f specifies to which one of n surfaces, if any, each
image location “belongs.” We take belonging to mean the existence of a world point
which (a) projects to the image location in question, and (b) is visible in both images.
For each surface, the signed disparity function d[k] defines the correspondence (or
matching) function m[k] between image locations:
m[k] : I 7→ I
m[k](x, y, t) =(
x + d[k](x, y, t), y, ¬t)
where ¬‘left’ = ‘right’ and vice versa. That is, for each surface k, m[k] maps each
location in one image to the corresponding location in the other image. Note that, for
all k, d[k] and m[k] are both defined for all (x, y, t), regardless of the value of f(x, y, t).
Furthermore, for standard camera configurations, d[k] will generally be positive in the
right image and negative in the left image, if it represents a real surface.
Thus, the interpretation of this model is:
for all p:
f(p) = k with k > 0 ⇒ p corresponds to m[k](p)
f(p) = 0 ⇒ p corresponds to no location in the other image
That is, a hypothesized solution specifies a set of correspondences between left and
right image locations, where each image location is a member of at most one corre-
spondence.
28 CHAPTER 2. PRELIMINARIES
2.3 Desired Properties
Given this abstract representation of a solution, how can we evaluate any
particular hypothesized solution? What are some conditions that characterize a
“good” solution? We propose three desired properties: consistency, smoothness, and
non-triviality.
2.3.1 Consistency
Correspondence of image locations should be bidirectional. In other words, if
points p and q are images of the same world point, then each corresponds to the
other; otherwise, neither corresponds to the other. If a hypothesized solution were to
say that p corresponds to q but that q does not correspond to p, it would make no
sense; we call such a solution inconsistent.
Within each surface, this translates into a constraint on m[k] that:
for all k, p: m[k](m[k](p)) = p (2.1)
which is equivalent to a constraint on each d[k]. In particular, for each k, given one of
d[k](·, ·, ‘left’) or d[k](·, ·, ‘right’), the other is uniquely determined. This reflects
the notion that d[k](x, y, ‘left’) and d[k](x, y, ‘right’) are two representations of
the same surface.
Regarding segmentation, we also have the constraint on f that
for all p: f(p) = k with k > 0 ⇒ f(m[k](p)) = k (2.2)
Ideally, these consistency constraints should be satisfied exactly, but for computa-
tional purposes, we merely attempt to maximize consistency.
2.3.2 Smoothness
Continuity dictates that a recovered disparity map should be piecewise smooth,
consisting of smooth surface patches separated by cleanly defined, smooth boundaries.
2.3. DESIRED PROPERTIES 29
Thus, in trying to estimate the best reconstruction, we would like to maximize the
“smoothness,” both of the surface shapes defined by d[k], and of the boundaries
defined by f .
Because the disparity maps d[k] are continuous-valued functions, they are
amenable to the usual meaning of smoothness. We take smoothness of d[k] to mean
differentiability, with the magnitude of higher derivatives being relatively small.
Because the segmentation function f can only take on the integer values 0 . . .N ,
it is piecewise constant, with line-like boundaries separating those pieces. We take
smoothness of f to mean simplicity of these boundaries, with the total boundary
length being relatively small.
2.3.3 Non-triviality
Good solutions should conform to, and explain, rather than ignore, the input
data as much as possible. For example, any two input images could be interpreted as
views of two painted, planar surfaces, each presented to one camera. Such a trivial
interpretation, yielding no correspondence for any image location, would be valid but
undesirable. In general, we expect that a correspondence exists for “most” image
locations; i.e., we expect that the segmentation function f is “mostly” non-zero:
for most p: f(p) > 0
Moreover, although color constancy is sometimes violated (e.g., due to specular-
ities), and smoothness and consistency are needed to fill in the gap, a solution that
supposes a perfectly smooth and consistent surface, at the expense of violating color
constancy everywhere, is also not desirable. In other words, we expect that color
constancy holds for “most” image locations:
for most p where f(p) > 0: I(
m[f(p)](p))
= I(p)
Intuitively, using the language of differential equations, consistency and
smoothness provide the homogeneous terms that result in the general solution,
30 CHAPTER 2. PRELIMINARIES
disparity maps segmentation
non-triviality E match I E unassigned
smoothness E smooth d E smooth f
consistency E match d E match f
Table 2.1: Contributions to energy.
while non-triviality provides the non-homogeneous terms that result in the particular
solution.
2.4 Energy Minimization
Now that we have defined the form of a solution, and stated its desired properties,
how do we find the best solution?
We formalize the stereo problem in the framework of energy minimization. In
general, energy minimization approaches split a problem into two parts: defining
the cost, or energy, of all hypothesized solutions [31], and finding the best solution
by minimizing that energy. This separation is advantageous because it facilitates the
use of general-purpose minimization techniques, enabling more focus upon the unique
aspects of the specific application.
For our application, we formulate six energy terms, corresponding to each of the
three desired properties, applied to both surface interiors and surface boundaries (see
Table 2.1). These terms are developed in the next two chapters; total energy is the
sum of these terms.
Chapter 3
Surface Fitting
In this chapter, we consider a restricted subproblem. Rather than simultaneously
estimating both the 3D shape (given by d[k]) and the 2D support (given by f) of each
surface, we consider the problem of estimating 3D shape, given 2D support. That
is, supposing that the segmentation f is known, how can the disparity maps d[k] be
found? Using this context, we explain our model of smooth surfaces; formulate the
three energy terms that encourage surface non-triviality, smoothness, and consistency;
and discuss the minimization of these energy terms.
3.1 Defining Surface Smoothness
Fitting a smooth surface to sparse and/or noisy data is a classic mathematical
problem. Sometimes there are obvious gaps in a data set, and one would like to fill
them in; other times, the data set is complete but is a mixture of signal and noise,
and one would like extract the signal. In either of these cases, the key detail to
determining the solution is the exact specification of the smoothness that one expects
to find. What are some ways to define, and subsequently impose, surface smoothness?
Perhaps the simplest way to impose surface smoothness is to decree that the
surface belong to some pre-defined class of known “smooth” surfaces, for example
planar or quadric surfaces. This approach is extremely efficient: because qualified
surfaces can be described by a small number of parameters (e.g., horizontal tilt,
31
32 CHAPTER 3. SURFACE FITTING
vertical tilt, and perpendicular distance for planar surfaces), there are only a few
degrees of freedom, and it is relatively easy to find the optimal surface. Moreover,
this approach can also be quite robust and tolerant of perturbations, for the same
reason. Thus, when they adequately approximate true scene geometry, and effeciency
is of concern, global parametric models can be a good choice (e.g., [3, 11]).
Unfortunately, true scene geometry often contains local details that cannot be
represented by a global parametric model. Low-order parametric models simply lack
sufficient degrees of freedom, and generally smooth over all but the coarsest details.
High-order global parametric models are less well-behaved, and can produce recon-
structions with large-amplitude ringing. Thus, when the reconstruction of accurate
details is required, global parametric models are generally unsuitable.
At the other end of the spectrum are regularized, pixel-level lattices, where the
complete disparity map is specified not by a few global parameters, but instead by
the individual disparity values at every pixel. Each of these individual disparities is
a separate degree of freedom, so fine detail can be represented faithfully, but some
degree of interdependence between them is imposed, to encourge solutions that are
“regular,” or smooth, in some sense. This is generally accomplished by defining a
quantitative measure of departure from smoothness, and “weakly” minimizing it while
simultaneously achieving the primary goal (typically color constancy). By varying the
type and the strength of the regularization, one can usually balance the competing
requirements for accurate details and overall smoothness.
However, a pixel-level lattice alone is obviously discrete, specifying disparity values
only at its grid points, while the underlying surface has a value at every point. In
order to determine the disparity at subpixel positions, one must therefore interpolate
between pixel positions. Moreover, the method of interpolation should be smooth
enough not to interfere with the chosen form of regularization; for example, one would
not use nearest-neighbor interpolation while minimizing quadratic variation. On the
other hand, smoothly interpolating between pixels amounts to fitting a spline surface
to the grid of data points, with one “control point” per pixel, and with regularization
acting upon the control points.
We generalize this idea of a regularized spline by allowing the spacing of the
3.2. SURFACES AS 2D SPLINES 33
control point grid to vary; for example, there could be one spline control point at
every n-th pixel in each direction. This model subsumes both global parametric
models and pixel-level lattices: the former correspond to splines where one patch
covers the entire image, and the latter correspond to splines where there is a control
point at every pixel.
3.2 Surfaces as 2D Splines
We model the disparity map of each surface as a bicubic B-spline. This gives us
the flexibility to represent a wide range of slanted or curved surfaces with subpixel
disparity precision, while ensuring that disparity values and gradients vary smoothly
over the surface. In addition, splines define analytic expressions for disparity values
and gradients everywhere, and in particular at subpixel positions, as required by our
abstract mathematical model.
The control points of the bicubic B-spline are placed on a regular grid with fixed
image coordinates (but variable disparity value). The resulting spline surface can be
thought of as a linear combination of shifted basis functions, with shifts constrained
to the grid.
Mathematically, we restrict each d[k] to take the form of a bicubic spline with
control points on a fairly coarse, uniform rectangular grid:
d[k](x, y, t) =∑
i,j
(
D[k][i, j, t] · b(x− in, y − jn))
(3.1)
where b is the bicubic basis function, D is the lattice of control points, and n is the
spacing thereof.
In general, the spacing of the grid of spline control points should be fine enough
so that surface shape details can be recovered, but not much finer than that so that
computational costs are not inflated unnecessarily. In our experiments, for each view
(left and right) of each hypothesized surface, we use a fixed, 5×5 grid of control points,
giving a 2× 2 grid of spline patches that divide the image into equal quadrants.
34 CHAPTER 3. SURFACE FITTING
3.3 Surface Non-triviality
This energy term, often called the “data term” in other literature, expresses
the basic assumption of color constancy, penalizing any deviation therefrom. There
are many possible ways to quantify deviation from color constancy; commonly used
measures include the absolute difference, squared difference, truncated quadratic, and
other robust norms [14, 34, 39]. For simplicity with differentiability, we use a scaled
sum of squared differences:
E match I =∑
p
g(
I(m[k](p))− I(p); A(p))
if f(p) = k with k > 0,
0 otherwise,(3.2)
where g(v; A) = vT · A · v, and where A(p) is a space-variant measure of certainty
defined in Appendix A. Qualitatively, A(p) normalizes for the local contrast of I
around p.
Note that, ideally, E match I would be defined as an integral over all p ∈ I.
However, for computational convenience, we approximate the integral with a finite
sum over discrete pixel positions, for this and other energy terms. This is a reasonable
approximation if the summand is spatially smooth.
3.4 Surface Smoothness
Since we expect smooth surfaces to be more likely to occur, we would like to
quantify and penalize any deviation from smoothness. Although our spline model
already ensures some degree of surface smoothness, this inherent smoothness is limited
to a spatial scale not much larger than that of the spline control point grid. On the
other hand, we would like the option to encourage additional smoothness on a more
global scale; hence we impose an additional energy term.
We take the class of perfectly smooth surfaces to be the set of planar surfaces
(including both fronto-parallel and slanted planes). The usual measures of deviation
3.5. SURFACE CONSISTENCY 35
from planarity are the squared Laplacian:
E = (uxx + uyy)2
and the quadratic variation, or thin plate spline bending energy [16, 33, 77]:
E = u2xx + 2u2
xy + u2yy
However, these measures have the disadvantage of using second derivatives, which
can be susceptible to noise, and which tend to place too much emphasis on high-
frequency, local deviations, relative to low-frequency, global deviations. Instead, in
addition to restricting d[k] to take the form of a spline, we add an energy term which,
loosely speaking, is proportional to the global “variance” of the surface slope:
E smooth d [k ] = λ smooth d ·∑
p
∥
∥∇d[k](p)−mean(∇d[k])∥
∥
2
where the summation and the mean are both taken over all discrete pixel positions p,
independent of the segmentation f . This energy term attempts to quantify deviations
from global planarity.
In our experiments, this energy term is given a very small weight, and mainly
serves to accelerate the convergence of numerical optimization by shrinking the
nullspace of the total energy function. This term does not prevent surfaces from
being non-planar.
3.5 Surface Consistency
For perfect consistency, a surface should have left and right views that coincide
exactly with one another, as specified in Equation (2.1). In some prior work, this
constraint has been enforced directly through specified update equations, without
an explicit energy term being given [51]. However, with left and right views simul-
taneously constrained each to have the form of Equation (3.1), exact coincidence is
generally no longer possible. That is, a surface shape that conforms to Equation (3.1)
36 CHAPTER 3. SURFACE FITTING
in one view, will no longer necessarily conform to Equation (3.1) when warped to the
other view. Therefore, to allow but discourage any non-coincidence, we propose the
energy term
E match d [k ] = λ match d ·∑
p
(
m[k](m[k](p))− p)2
or equivalently,
E match d [k ] = λ match d ·∑
p
(
d[k](p) + d[k](m[k](p)))2
which, intuitively, measures the distance between the surfaces defined by the left
and right views. Again, the summation is taken over all discrete pixel positions p,
independent of the segmentation f .
3.6 Surface Optimization
Given a particular k, this chapter’s subproblem is to minimize (or reduce) total
energy by varying d[k], while holding f and the remaining d[j] constant. Total energy
is a sum of six terms; in this chapter, three of them were shown to depend smoothly
on d[k]. In Chapter 4, the two terms E unassigned and E smooth f are shown to
depend only on f , and thus can be considered constant for the present subproblem.
In Chapter 5, the remaining term E match f is shown to depend smoothly on d[k].
Therefore, the total energy as a function of d[k] is differentiable, and can be minimized
with standard gradient-based numerical methods.
For convenience, we use Matlab’s optimization toolbox. The specific algorithm
chosen is a trust region method with a 2D quadratic subproblem. This is a greedy
descent method; at each step, it minimizes a quadratic model within a 2D trust
region spanned by the gradient and the Newton direction. Experimentally, this algo-
rithm exhibited more reliable convergence than the quasi-Newton methods with line
3.6. SURFACE OPTIMIZATION 37
searches, and although it requires the calculation of the Hessian, in our implemen-
tation, that expense is relatively small compared to the total computational require-
ments of solving the stereo problem.
In this chapter, we have shown how to minimize total energy by varying each d[k]
individually. As long as f remains fixed, there is nothing to be gained by varying all
d[k] simultaneously. For each k > 0, we call minimizing over d[k] a surface-fitting step,
and consider it to be a building block towards a complete algorithm for minimizing
total energy.
Chapter 4
Segmentation
In this chapter, we consider a restricted subproblem. Rather than simultaneously
estimating both the 3D shape (given by d[k]) and the 2D support (given by f) of
each surface, we consider the problem of estimating 2D support, given 3D shape.
That is, supposing that the disparity maps d[k] are known for all surfaces, how can
the segmentation f be found? Using this context, we explain our model of segmen-
tation; formulate the three energy terms that encourage segmentation non-triviality,
smoothness, and consistency; and discuss the minimization of these energy terms.
4.1 Segmentation by Graph Cuts
Boykov, Veksler, and Zabih [21] showed that certain labeling problems can be
formulated as energy minimization problems and solved efficiently by repeatedly using
maximum flow techniques to find minimum-cost cuts of associated network graphs.
Generally speaking, such problems seek to assign one of a given set of labels to each
of a given set of items, simultaneously optimizing for not only items’ individual pref-
erences for each label, but also interactions between pairs of items, where interacting
items additionally prefer to be assigned similar labels, for some measure of similarity.
Formally, let L be a finite set of labels, P be a finite set of items, and N ⊆ P ×P
be the set of interacting pairs of items. The methods of [21] find a labeling f that
assigns exactly one label fp ∈ L to each item p ∈ P, subject to the constraint that
38
4.1. SEGMENTATION BY GRAPH CUTS 39
an energy function of the form
E(f) =∑
(p,q)∈N
Vp,q(fp, fq) +∑
p∈P
Dp(fp) (4.1)
be minimized. Individual energies Dp should be nonnegative but can otherwise be
arbitrary; interaction energies Vp,q should be either semi-metric or metric, where V is
a semi-metric if it is symmetric and positive strictly for distinct labels, and where V
is a metric if it additionally satisfies the triangle inequality. Kolmogorov and Zabih
[48] generalize these results, deriving necessary and sufficient conditions on the form
of the energy E in order for it to be minimizable with graph cut methods.
Given an energy function in the form of Equation (4.1) satisfying the relevent
conditions, the methods of [21] are extremely effective at finding a minimizing
labeling, both in terms of computational complexity (which does grow with the sizes
of both L and P) and in terms of the optimality of the final solution. For this reason,
we have chosen to use these methods to solve our segmentation subproblem.
This generic formulation of an energy-minimizing labeling problem maps to our
formulation of the stereo problem as follows: the labels are the integers 0 . . .N that
are the possible values of the segmentation function f , and the items are the pixels
of each input image. This is in contrast to [21], in which the items are the pixels
of a single reference image, and to [47], in which the items are pairs of potentially
corresponding pixels. In our formulation, the individual preferences stem from testing
color constancy at varying disparities, and the interactions stem from the expectations
of smoothness and consistency.
Although graph cut methods could concievably be used to estimate a layered
model with transparency, in such a case, each image point could receive contributions
from any subset of the set of all layers, making the number of possible labels expo-
nential in the number of layers. Unfortunately, this would make computational costs
prohibitively high for any significant number of layers, so for feasibility, our algorithm
assumes that all objects are restricted piecewise to be either completely opaque or
completely invisible.
Similarly, as implied by our principle of least discretization, ideally we would be
40 CHAPTER 4. SEGMENTATION
able to model the location of surface boundaries to arbitrary precision. However,
because graph cut methods treat each item as being indivisible, any increase in
precision would require corresponding increases in the number of items and hence
in computational complexity. In fact, our algorithm should require only minor modi-
fications in order to estimate boundary locations with any given (but fixed) subpixel
precision. However, the usefulness of high precision is questionable, since accuracy
would nonetheless be limited by the pixelization of the input images. Because of both
this and the computational costs, in our algorithm, pixels are prohibited from being
split spatially among several surfaces, but instead are constrained to be indivisible,
forcing surface boundaries to lie on pixel boundaries.
In representing the continuous-domain segmentation function f with a finite
number of unknowns on a discrete grid of pixels, we essentially perform nearest-
neighbor interpolation:
f(x, y, t) = F (round(x), round(y), t)
where F is defined on an integer lattice.
We now further explain the “preferences” of individual pixels and the nature of
pairwise interactions, and define the corresponding energy terms to be minimized.
4.2 Segmentation Non-triviality
The primary goal of the segmentation subproblem is to assign each pixel to the
surface it fits best. This is accomplished by minimizing the deviation from color
constancy,
E match I =∑
p
g(
I(m[k](p))− I(p); A(p))
if f(p) = k with k > 0,
0 otherwise,
This is identical to Equation (3.2), only now, we consider it as a functional of f with
m[k] being constant, instead of vice versa.
4.3. SEGMENTATION SMOOTHNESS 41
Note that we allow f(p) = 0 for some p, to model occluded pixels which should
remain unmatched and thus unassigned to any surface. However, the astute reader
will notice that since g(·) is nonnegative, E match I is trivially minimized by f(p) ≡ 0.
To discourage solutions with a large number of unassigned pixels, we add a fixed
penalty for each unassigned pixel in order to try to minimize the total area of unas-
signed regions:
E unassigned =∑
p
λ unassigned if f(p) = 0,
0 otherwise,
where λ unassigned is a constant. While it is not uncommon among stereo algorithms
to have an occlusion penalty such as this one, it should be noted that this term is
not solely for handling occlusions; for example, it also limits the influence of gross
outliers in the input image data.
Thus, the underlying segmentation problem, for the moment ignoring smoothness
and consistency, is to find the labeling f that minimizes the sum E match I +
E unassigned . Put into the form of Equation (4.1), this corresponds to the following
definition of individual pixel preferences:
Dp(fp) = g(
I(m[k](p))− I(p); A(p))
for fp > 0,
Dp(0) = λ unassigned .
4.3 Segmentation Smoothness
In addition to minimizing pointwise costs, we would also like to encourage a
simple segmentation with “smooth” boundaries of surface extents. There are several
attributes which can be used to formalize this notion, including boundary length and
curvature [16, 53]. We choose to minimize boundary length without separate regard
for boundary curvature, because it is simpler to optimize, and works fairly well in
practice.
42 CHAPTER 4. SEGMENTATION
In addition to this a priori expectation of simple boundaries, there is also an expec-
tation that boundaries will generally be correlated with monocular image features
(called “static cues” in the terminology of [21]). That is, when viewed in a two-
dimensional image, the boundary of a surface will statistically look more like an edge
than average. Thus, we would like to reward the placement of boundaries at edge-like
image locations. Again, there are many ways to estimate edge likelihood, ranging
from thresholded gradient magnitudes to color distribution distances [61]; we use a
function of gradients and local contrast. This measure of edge likelihood at each point
is then used to adjust the cost per unit length of boundaries passing through that
point.
There is one more issue to consider: which boundaries do we want to minimize?
Intuitively, minimizing the length of the boundary of any particular region will tend
to shorten or remove any protrusions or indentations that are long and thin. This
makes sense for regions that correspond to surfaces, because such narrow structures
are generally less common than wider ones. However, occlusion regions are very
typically long and thin; minimizing their boundary length would greatly hinder their
accurate recovery.
To encourage a simple segmentation, we therefore would like to minimize the
total length of the boundaries of each surface, with adjustments made to consider
monocular cues. We define this energy term for each surface k > 0 accordingly:
E smooth f [k ] =∑
p adjacent to q
ws(p, q) if f(p) = k xor f(q) = k,
0 otherwise,
where adjacency is according to 4-connectedness within each image.
The value of the weighting function ws(p, q) should decrease as monocular cues
become more indicative of an edge, but the minimum and maximum values of ws(p, q)
should not be too far apart, since monocular cues should not override the goal of
minimizing boundary length. We define the weighting function as follows:
ws(p, q) = λ smooth f ·(
1 + e−(�∇I
�T ·A·
�∇I
�)/τ
)
4.4. SEGMENTATION CONSISTENCY 43
for p adjacent to q, where λ smooth f and τ are constants, and ∇I and A are both
evaluated at the subpixel position (p + q)/2.
Put into the form of Equation (4.1), E smooth f [k ] corresponds to this penalty
function for intra-image, pairwise interactions:
Vp,q(fp, fq) = ws(p, q) ·∑
k>0
T(
fp = k xor fq = k)
= ws(p, q) ·
0 if fp = fq
1 if fp 6= fq with fp = 0 or fq = 0
2 if fp 6= fq with fp > 0 and fq > 0
for p adjacent to q, where T (·) equals 1 if its argument is true, and equals 0 otherwise.
4.4 Segmentation Consistency
The energy terms we have so far given in this chapter together encourage color
constancy and continuity. However, two-way uniqueness has yet to be enforced.
For perfect consistency, the segmentation f should satisfy Equation (2.2). To
quantify and discourage any segmentation inconsistencies, we add an energy term for
each surface k > 0:
E match f [k ] ≈∑
p
λ match f if f(p) = k xor f(m[k](p)) = k,
0 otherwise,
which approximates the area of inconsistent regions, where (2.2) does not hold. As
before, this term should ideally be defined with an integral, but in this case, a naive
finite sum is not an adequate substitute, as shall be explained in Chapter 5.
Put into the form of Equation (4.1), E match f [k ] corresponds to this penalty
function for inter-image, pairwise interactions:
Vp,q(fp, fq) =∑
k>0
wc[k](p, q) · T(
fp = k xor fq = k)
,
44 CHAPTER 4. SEGMENTATION
where, naively,
wc[k](p, q) ≈ λ match f ·(
T(
m[k](p) = q)
+ T(
m[k](q) = p)
)
,
for p and q in corresponding scanlines.
4.5 Segmentation Optimization
This chapter’s subproblem is to minimize (or reduce) total energy by varying f ,
while holding all d[k] constant. Total energy is a sum of six terms, two of which
(E smooth d and E match d) are independent of f . In this chapter, the remaining
four terms are written in the form of Equation (4.1); moreover, the penalty functions
Vp,q can be verified to be metrics. Therefore, the total energy as a function of f can
be optimized with graph cut methods.
We use a modified version of the expansion algorithm of [21]. This greedy algo-
rithm is built from expansion moves, and gets its power from the generality of such
moves: an expansion move on a label k finds the best configuration reachable by rela-
beling any subset of pixels with k. Our modification is to precede each expansion with
a contraction of the same label, which strictly enlarges the set of reachable configura-
tions. We call such a contraction-expansion pair on any one label, a segmentation step,
and consider it to be a building block towards a complete algorithm for minimizing
total energy.
Chapter 5
Integration
In this chapter, we consider the general problem of simultaneously determining
surface shape in the form of disparity maps, and surface support in the form of
segmentation, when both are initially unknown. First, we consider the interaction
between surface fitting in a continuous domain and segmentation in a discrete domain,
and discuss the integration of the two into a single, well-behaved energy function.
Then, we describe a complete algorithm for minimizing the total energy.
5.1 Segmentation Consistency, Revisited
Let us take another look at the issue of segmentation consistency, and how it
causes an interdependence between the disparity maps d[k] and the segmentation f .
Consider one pair of corresponding scanlines l and r, and one particular surface
k. How are the labelings f(l) and f(r) and the disparity maps d[k](l) and d[k](r) of
these scanlines related to one another?
For simplicity, suppose that
(x ≤ xl) ≡(
f(x, y0, ‘left’) = k)
,
(x ≤ xr) ≡(
f(x, y0, ‘right’) = k)
;
that is, the boundary of surface k intersects the corresponding scanlines once each,
45
46 CHAPTER 5. INTEGRATION
with the same orientation. Let us call xr − xl the “boundary disparity” db: it is the
disparity between the left and right labeling boundaries. Let us call d[k](xl, y0, ‘left’)
the “surface disparity” ds at the boundary: it is the disparity given by the surface
disparity map. Then it follows from Equation (2.2) that
xl − xr = d[k](xr, y0, ‘right’),
db = xr − xl = d[k](xl, y0, ‘left’) = ds.
In other words, the segmentation consistency constraint requires that the boundary
disparity and surface disparity be equal.
To achieve this equality in an energy minimization framework, let us define
E0 = h0(db − ds) where h0(∆d) = |∆d|.
Then, if arbitrary disparities are allowed, exact agreement between boundary
disparity and surface disparity follows from minimizing this energy. This is in
fact a special case of minimizing the area of the inconsistent region as described
in Section 4.4.
However, because of pixel-wise segmentation, arbitrary boundary disparities are
in fact not allowed; only integer boundary disparities are possible. If exact agreement
were still enforced, this would imply that surface disparities at the boundary would
also be restricted to be integral. Such a restriction would be undesirable, because
one should not expect the position of objects in the world to be correlated with an
arbitrary discretization of images into pixels.
Instead of exact agreement, then, only nearest-integer agreement between
boundary disparity and surface disparity should be encouraged. That is, since for
any surface disparity ds, the closest possible boundary disparity is db = round(ds),
it follows that any surface disparity within ± 12
pixel of a given boundary disparity
should be considered equally good. This can be accomplished with a modified energy
function:
E = h(db − ds) where h(∆d) = max( 12, |∆d|).
5.2. OVERALL OPTIMIZATION 47
This energy for an isolated boundary is in turn generalized for arbitrary segmen-
tations as
E match f [k ] =∑
p,q
λ match f · h(
|m[k](p)− q|)
if f(p) = k xor f(q) = k,
0 otherwise,
where p and q are on conjugate epipolar lines, and where
h(∆d) =
12
for |∆d| ≤ 12,
34− |∆d|
2for 1
2< |∆d| < 3
2,
0 for |∆d| ≥ 32.
Our implementation modifies h by rounding its “corners” (at |∆d| = 12
and |∆d| = 32)
so that total energy remains differentiable with respect to d[k].
Put into the form of Equation (4.1), this new E match f [k ] again corresponds to
Vp,q(fp, fq) =∑
k>0
wc[k](p, q) · T(
fp = k xor fq = k)
as specified in Section 4.4, but now with
wc[k](p, q) = λ match f ·(
h(
m[k](p)− q)
+ h(
m[k](q)− p)
)
again for p and q in corresponding scanlines.
5.2 Overall Optimization
We have now defined each of the six energy terms in Table 2.1; total energy is
the sum of these terms. We have also defined surface-fitting steps and segmentation
steps, which are building blocks for the minimization of total energy. How are these
building blocks put together to form a complete algorithm for overall optimization?
We list the overall algorithm in Table 5.1, and explain its details in the remainder
of this section.
48 CHAPTER 5. INTEGRATION
1. Initialize hypothesis with fronto-parallel surfaces at integer disparity;set f ≡ 0.
2. Repeat:
(a) Alternately apply segmentation and surface-fitting steps untilprogress becomes negligible.
(b) For each hypothesized surface:
• Attempt to merge it.
until either some merge succeeds or all merges fail.
until all merges fail.
3. Optionally post-process to “fill in” unmatched regions.
Table 5.1: Our overall optimization algorithm.
5.2.1 Iterative Descent
Total energy is a multi-variable functional of the unknowns {f, d[1], . . . , d[N ]}
(each of which is a function itself); each building block attempts to reduce total energy
by changing one of these unknowns. As commonly occurs in multi-variable opti-
mization, these unknowns are coupled. In particular, the segmentation f is coupled
with the disparity maps {d[1], . . . , d[N ]}, in that if total energy is sequentially opti-
mized, first with respect to f , then with respect to {d[1], . . . , d[N ]}, the end result
will generally not remain optimal with respect to f .
This coupling between disparity maps and segmentation is caused by the
requirement for segmentation consistency: as explained in the previous section, there
must be agreement between “surface disparity” as specified by d[k], and “boundary
disparity” as specified by f . To elaborate, let us again consider the previous section’s
example of a single boundary on a pair of scanlines. Suppose that, for this boundary,
ds = db = d0 for some d0. Then, in order to maintain consistency, adjusting either
d[k] or f alone must leave this disparity unchanged; only by simultaneously adjusting
5.2. OVERALL OPTIMIZATION 49
both d[k] and f can this disparity be changed without violating consistency. Unfor-
tunately, neither surface-fitting steps nor segmentation steps allow such simultaneous
adjustment.
Because of this coupling, small violations of segmentation consistency must
temporarily be allowed as the solution evolves toward an optimum. This is accom-
plished by ensuring that λ match f is not too large. In this case, however, segmen-
tation consistency nonetheless hinders evolution at boundaries, as surface disparities
and boundary disparities are constrained to remain close to one another while being
restricted to move only one at a time. Thus it is necessary to alternate iteratively
between surface fitting and segmentation, to allow boundaries to evolve satisfactorily.
With this in mind, our general strategy for overall optimization is simple: given
an initial hypothesis, repeatedly try all possible building blocks for the reduction
of total energy, until none gives any further improvement. Using the surface-fitting
and segmentation steps defined in Chapters 3 and 4 as our only building blocks, this
works well in practice as long as the initial hypothesis is somewhat close to the final
solution, with the surfaces of each roughly in one-to-one correspondence. Otherwise,
if the initial hypothesis does not match reality, several scenarios can occur.
5.2.2 Merging Surfaces
If the initial hypothesis contains enough distinct surfaces, those initial surfaces
will generally select among themselves in the process of “competing” for represen-
tation of the actual surfaces, with the stronger ones (with better initial fit) naturally
“pushing away” the weaker ones (with poorer initial fit). It is possible that this situ-
ation will end with one hypothesized surface on each actual surface, with the “extra”
hypothesized surfaces having been naturally driven to extinction (i.e., their support
in image space becomes the empty set); this outcome yields the correct solution. It
is also possible that two or more hypothesized surfaces will end in a deadlock over a
single actual surface. This corresponds to a local minimum of the energy functional.
Because of this possibility for deadlock, we consider another building block, called
a merge step. A merge step must first be preceded by the saving of a checkpoint
of the current state. A merge step then begins with the forceful extinction of a
50 CHAPTER 5. INTEGRATION
selected, hypothesized surface. This involves relabeling all of that surface’s supporting
pixels with f = 0, and removing its disparity map from all subsequent calculations.
At this point, the total energy will mostly likely have increased drastically, due to
E unassigned . The “orphaned” pixels are then immediately redistributed among the
remaining surfaces by a series of segmentation steps. Further surface-fitting and
segmentation steps are subsequently taken, until either the total energy falls below
that of the saved checkpoint, in which case the merge succeeds and the checkpoint is
discarded, or the total energy plateaus above that of the checkpoint, in which case
the merge fails and the checkpoint is restored.
Merge steps are taken whenever no significant progress can be achieved through
surface-fitting and segmentation steps alone, in order to reduce the chances of getting
“trapped” in a poor local minimum. However, because merge steps generally require
a large amount of speculative computation, this is the only time they are taken.
If the initial hypothesis contains too few surfaces, another scenario can arise. If
one hypothesized surface comes to span several actual surfaces, and there are no extra
hypothesized surfaces to “take over” support, it is possible that none of our building
blocks will be able to remedy the situation. In such a scenario, our algorithm will
produce an incorrect answer.
5.2.3 Initialization
In light of this vulnerability, it would be desirable to choose an initial hypothesis
where no hypothesized surface spans more than one actual surface. This could be
achieved by a sufficiently dense “seeding” over the areas of the left and right images,
where the initial disparity of a seed can be determined by, say, windowed corre-
lation. Density is sufficient if every actual surface gets at least one seed; otherwise,
any unseeded surfaces would immediately “grab” a hypothesized surface, leading
straight to the described vulnerability. Unfortunately, if there any small objects
in the scene, sufficient density by this standard would require a prohibitively large
number of surfaces.
Because of this difficulty, our chosen method of initialization aims only to ensure
that every actual surface be “covered” by some initial surface, regardless of whether
5.2. OVERALL OPTIMIZATION 51
1. Take result of main, energy-minimization algorithm.
2. Repeat a few times:
(a) Find all pixels in violation of segmentation consistency.Reassign them to f = 0.
(b) For all pixels where f = 0:
i. Look at the labels of the nearest pixels to the left and rightwhere f > 0.
ii. Reassign to whichever label corresponds to the smallerdisparity.
separated by:
(a) Re-estimate disparity maps from segmentation.
(b) Re-estimate segmentation from disparity maps.
using modified parameters λ match f = 0 and λ unassigned � 1.
Table 5.2: Our post-processing algorithm.
that initial surface also covers a different actual surface. Our algorithm requires that,
alongside the input image pair, a range of possible disparities also be specified; the
initial hypothesis is then formed by placing one fronto-parallel surface at every integer
disparity within that range, with all pixels initially unassigned (f ≡ 0). This strategy
works in practice because of two observations. First, left and right image regions will
generally appear to match one another within a disparity range of ±1 around the true
disparity, in that such a match would have lower energy than remaining unmatched.
Second, given a stereo pair of images, the set of possible disparities for any object
therein is usually relatively small.
5.2.4 Post-Processing
For some applications, it might be desirable to estimate a disparity value for every
pixel, whether occluded in the opposite image or not. For such occasions, we give
52 CHAPTER 5. INTEGRATION
an ad hoc method of post-processing to assign a disparity value to any pixels that
are marked as “occluded” by our main algorithm. This post-processing algorithm is
listed in Table 5.2.
This procedure is a departure from the energy minimization framework, and in
general will not converge if iterated indefinitely. However, in our experiments, using
two or three iterations seemed to produce reasonably accurate results in practice.
Chapter 6
Experimental Results
In Chapters 2 through 5, we presented a new algorithm for binocular stereopsis,
and compared portions of its design to those of other stereo algorithms. We have
implemented this algorithm using a combination of Matlab and C, and tested it
on several non-synthetic stereo pairs available online [12, 50, 64]. In this chapter, we
present the results of our experiments on these images, which together span a range of
attributes (color and grayscale; well textured and untextured) and contain a variety
of features (slanted planes and curved surfaces; small step edges, large step edges,
and crease edges), testing the generality of our algorithm. We compare our results
both quantitatively and qualitatively to those achieved by other algorithms.
6.1 Quantitative Evaluation Metric
Before we can make a quantitative comparison, we need a method for evaluating
the accuracy of a stereo reconstruction. Szeliski and Zabih [73] propose two such
methods. If ground truth is available, one can directly compare estimated and true
disparity maps. If ground truth is not available, one can instead use the estimated
disparity map to warp a reference image of the scene into a novel view, and compare
the resulting image to the actual image from the novel viewpoint. The latter method is
useful in such applications as image-based rendering, where small errors in textureless
53
54 CHAPTER 6. EXPERIMENTAL RESULTS
regions are relatively unimportant, but the former method is more appropriate when
accuracy in the scene structure itself is the desired goal.
Adopting the method of direct comparison with ground truth, Scharstein and
Szeliski [64, 65] provide sample stereo pairs with ground truth, propose a metric for
comparing results with ground truth, and tabulate results for twenty algorithms.
To evaluate the accuracy of our algorithm, we use their method and data as well,
to facilitate comparison with other algorithms. We note that their ground truth
disparities are based on depth and not necessarily correspondence, for they include
values for the entire reference (left) image, including at positions corresponding to
regions not visible in the other (right) image.
Scharstein and Szeliski [65] evaluate results by measuring the percentage of “bad”
pixels within various subsets of the entire reference image, where a bad pixel is one
whose estimated and ground truth disparities differ by more than a given threshold.
Some of the various subsets focus on particularly challenging image regions, including
the set of pixels near any discontinuity, and the set of pixels where texture is weak or
nonexistent; these allow one to evaluate the performance of an algorithm specifically
in such difficult situations. To evaluate overall accuracy, however, one should include
in the comparison as much of the reference image as feasible. The favored overall
measure in [65] does indeed correspond to their most comprehensive image subset,
which contains all pixels except border pixels within ten pixels of the image edge, and
occluded pixels not visible in the right image according to the given ground truth.
Scharstein and Szeliski explain the exclusion of “occluded” pixels as follows [65]:
We exclude the occluded regions for now since few of the algorithms in this
study explicitly model occlusions, and most perform quite poorly in these
regions. As algorithms get better at matching occluded regions, however,
we will likely focus more on the total matching error. . . .
Given that estimating depth where no correspondence exists is fundamentally a
different problem from estimating an existing correspondence, we agree that finding
the “correct” disparity in the absence of correspondence is in some sense less critical
than doing so in the presence thereof. However, we believe that it is nonetheless
6.2. QUANTITATIVE RESULTS 55
undesirable for a stereo algorithm to return incorrect disparity values in such an
occlusion region, as such a result could easily convey a misleading impression of
reality.
Instead, it would be best if an algorithm reported such occlusion regions as being
without correspondence, or as having an unknown or undefined disparity. Unfortu-
nately, appropriately handling such results with “holes” would require a more sophis-
ticated, and more difficult to interpret, evaluation metric. Therefore, in our appli-
cation of Scharstein and Szeliski’s evaluation metric, we choose to include occluded
and non-occluded pixels equally, and exclude only border pixels.
Scharstein and Szeliski exclusively use a fixed disparity error threshold of plus
or minus one pixel to classify disparity estimates as “good” or “bad.” This makes
sense with the selection of algorithms in their study, because few of them compute
subpixel disparities. However, since we do compute subpixel disparities, and would
like to evaluate the accuracy thereof, we consider a range of thresholds instead. In
Section 6.2, we plot the fraction of “bad” pixels as a function of this threshold to give
a somewhat more complete picture of the accuracy of any particular result. These
plots are essentially cumulative histograms of disparity error.
The described error measure evaluates dense depth maps that have exactly one
disparity value per pixel. It does not handle pixels without an assigned disparity
value, nor does it pay any attention to labeled discontinuities, both of which our
algorithm produces. Regarding the latter, we shall resort to qualitative observa-
tions, but regarding the former, we apply some simple post-processing (described in
Subsection 5.2.4) to “fill in” the missing disparities, before evaluating the results with
this error measure.
6.2 Quantitative Results
We tested our algorithm on the four stereo pairs used in Scharstein and Szeliski
[65], available online at [64].
For these four stereo pairs, we obtained results for our algorithm using two sets of
parameters. One set prefers a more coarse segmentation, by giving a larger weight to
56 CHAPTER 6. EXPERIMENTAL RESULTS
(empty) left input image right input image
ground truth disparitywith occlusions
(left image; grayscale)
estimated disparitywith occlusions
(left image; grayscale)
estimated disparitywith occlusions
(right image; colored)
ground truth disparitywithout occlusions
(left image; grayscale)
estimated disparitywithout occlusions
(left image; grayscale)
estimated disparitywithout occlusions
(right image; colored)
disparity differencewithout occlusions
(left image; colored)
estimated segmentationwithout occlusions
(left image; grayscale)
estimated segmentationwithout occlusions
(right image; grayscale)
Table 6.1: Layout for figures of complete results; see text for descriptions.
E smooth f relative to the other energy terms; the other prefers a more fine segmen-
tation, by giving a smaller weight to E smooth f . All other parameters are unchanged
between the two sets.
Due to space considerations, we show complete results for only one of the two
sets of parameters for each stereo pair, in Figures 6.1, 6.3, 6.5, and 6.7. Table 6.1
shows the layout of these figures. In this table, estimated results with occlusions
refer to those obtained by the main, energy minimization algorithm without any
post-processing, while estimated results without occlusions refer to those obtained
after post-processing. Ground truth disparities without occlusions refer to the orginal
disparity maps provided by [64], while ground truth disparities with occlusions are
masked by the binary occlusion maps also provided by [64]. Grayscale disparity maps
use the same scaling used in [64]. Color-coded disparity maps use a hue-modulated
color map, to highlight isocontours and smaller disparity differences. Each hue cycle
corresponds to a disparity increment of one pixel; as with the grayscale maps, lighter
shades of any particular hue are closer to the viewer. Finally, the color-coded disparity
difference images show positive and negative errors in red and blue, respectively.
However, in the graphs of “bad” pixels versus disparity error threshold
6.2. QUANTITATIVE RESULTS 57
(Figures 6.2, 6.4, 6.6, and 6.8), we summarize results for both sets of parameters
(labeled “coarse” and “fine”). We also compare our algorithm with the four algo-
rithms which appear to be the most accurate among the nineteen remaining algo-
rithms tabulated in [65].
6.2.1 “Map”
This grayscale stereo pair (Figure 6.1) shows two highly textured, moderately
slanted, planar surfaces; a very simple boundary separates the two. The disparity
difference between the two surfaces is relatively large, resulting in a significant
occlusion region on either side of the foreground surface.
Because texture is present throughout the images, and because the structure of
the scene is so simple, it is relatively easy to determine the correct correspondence for
image points for which a correspondence exists. However, the large occlusion region
emphasizes any errors an algorithm might make in treating image points for which
a correspondence does not exist. This is in fact what is observed for many of the
algorithms in [65]: results are computed with quite good accuracy for most of the
image, but errors in the occlusion region severely hurt overall performance.
Our algorithm correctly handles the large occlusion area. Before post-processing,
the energy minimization algorithm alone takes a small “bite” out of the left edge of
the foreground surface; a closer inspection of the input images suggests that this is
likely because, at this location, the right image appears slightly darker than the left.
In any case, post-processing patches this gap in the foreground surface, and places
the final boundary everywhere within one pixel of its correct location.
6.2.2 “Venus”
This color stereo pair (Figure 6.3) shows five slanted planes with varying amounts
of texture, including some regions with virtually no texture. Two of the surfaces are
joined by a crease edge; the remaining boundaries are all step edges.
Although the textureless regions of this stereo pair do cause difficulties for some of
the algorithms in [65], the most frequent error seems to occur in a region where texture
58 CHAPTER 6. EXPERIMENTAL RESULTS
Figure 6.1: “Map” stereo pair: results for “coarse” parameter set. Key: see Table 6.1.
6.2. QUANTITATIVE RESULTS 59
0.5 1.0 1.50.2
0.5
1
2
5
10
20
50
Error Threshold (pixels)
% B
ad P
ixel
s
Lin Tomasi (coarse)Lin Tomasi (fine)Kolmogorov ZabihBirchfield Tomasi (1998)HirschmullerShao
Figure 6.2: “Map” stereo pair: error distributions for our algorithm and [9, 35, 47, 66].
is in fact present but aliased. At high magnification, it can be seen that toward the
left center of the input images, the horizontal dotted lines consist of dots whose size
appears to vary slightly between images. This discrepancy is apparently significant
enough to overcome the continuity constraint in a majority of the algorithms tabulated
in [65]. However, our algorithm is not fooled by this aliasing, most likely due to the
consideration of uncertainty as described in Section A.2.
Regarding disparity estimation, our algorithm does very well, as can be seen from
Figure 6.4. The largest error occurs at the corner of the V-shaped depth discontinuity,
where our penalty for boundary length causes the tip of the “V” to be missed. This
type of behaviour is a typical result of minimizing boundary length without regard
for boundary curvature and junctions.
Regarding segmentation, our algorithm recovers only four distinct surfaces,
missing the vertical crease in the sports page. This is likely because the area of
the rightmost pane is too small, compared to the length of the crease; again, the
penalty for boundary length dominates.
60 CHAPTER 6. EXPERIMENTAL RESULTS
Figure 6.3: “Venus” stereo pair: results for “coarse” parameter set. Key: seeTable 6.1.
6.2. QUANTITATIVE RESULTS 61
0.5 1.0 1.50.2
0.5
1
2
5
10
20
50
Error Threshold (pixels)
% B
ad P
ixel
s
Lin Tomasi (coarse)Lin Tomasi (fine)Kolmogorov ZabihBirchfield Tomasi (1999)HirschmullerSun Shum Zheng
Figure 6.4: “Venus” stereo pair: error distributions for our algorithm and[11, 35, 47, 69].
6.2.3 “Sawtooth”
This color stereo pair (Figure 6.5) shows three slanted planes with varying amounts
of texture; boundaries are all discontinuous in depth, and consist of many straight
line segments joined by relatively sharp angles.
As with the “Venus” stereo pair, our algorithm tends to truncate some of these
angles, only more severely so for this stereo pair. Note that a large fraction of the
area of these erroneous regions correspond to occlusion regions not visible in the
right image. Our algorithm does not truncate the upward-pointing tips, which are
completely visible in both images.
6.2.4 “Tsukuba”
This color stereo pair (Figure 6.7), courtesy of Y. Ohta and Y. Nakamura of
the University of Tsukuba, shows a lab scene consisting of various planar, smoothly
curved, and non-smooth objects. Object boundaries are relatively complex, with
62 CHAPTER 6. EXPERIMENTAL RESULTS
Figure 6.5: “Sawtooth” stereo pair: results for “coarse” parameter set. Key: seeTable 6.1.
6.2. QUANTITATIVE RESULTS 63
0.5 1.0 1.50.5
1
2
5
10
20
50
Error Threshold (pixels)
% B
ad P
ixel
s
Lin Tomasi (coarse)Lin Tomasi (fine)Kolmogorov ZabihBirchfield Tomasi (1999)HirschmullerBoykov et al. (expansion)
Figure 6.6: “Sawtooth” stereo pair: error distributions for our algorithm and[11, 19, 35, 47].
several long and thin structures. These narrow structures (e.g., the tripod legs and
handle, and the lamp arm and cord) are problematic for many of the algorithms in
[65], tending to be lost because their area is insufficient to support their boundary
length.
Our algorithm also tends to over-simplify these extended boundaries, even with
the parameter set that prefers a finer segmentation. However, it should be noted
that the results of our main algorithm only, without post-processing, are significantly
more accurate than those obtained after post-processing: it is the post-processing
that causes the tripod handle to be missed and the lamp arm to be filled in.
It is also notable that while the given ground truth represents all surfaces as
being fronto-parallel at integer disparity, our algorithm produces curved surfaces. In
particular, our algorithm models the entire head as one curved surface, with the nose
and chin being closest to the camera, and the left and right sides of the head being
farther by approximately one half pixel of disparity.
64 CHAPTER 6. EXPERIMENTAL RESULTS
Figure 6.7: “Tsukuba” stereo pair: results for “fine” parameter set. Key: seeTable 6.1.
6.3. QUALITATIVE RESULTS 65
0.5 1.0 1.51
2
5
10
20
50
Error Threshold (pixels)
% B
ad P
ixel
s
Lin Tomasi (coarse)Lin Tomasi (fine)Kolmogorov ZabihSun Shum ZhengBoykov et al. (swap)Boykov et al. (expansion)
Figure 6.8: “Tsukuba” stereo pair: error distributions for our algorithm and[18, 19, 47, 69].
6.3 Qualitative Results
Among the four stereo pairs used in the benchmark by Scharstein and Szeliski, all
but one are full-color. Furthermore, among the three pairs that consist of extended
smooth surfaces, there is a total of one crease edge. To verify both that our algorithm
can recover crease edges, and also that it does not need color, we therefore tested it
on two of the grayscale stereo pairs used in Birchfield and Tomasi [11, 12].
These first two stereo pairs below, courtesy of Birchfield [12], show varying
amounts of texture, and are each well approximated by five slanted planes. Most of
the surface boundaries are crease edges, and those that are step edges have disparity
jumps of only a few pixels, so there are relatively few occluded pixels overall.
The original versions of these two stereo pairs, as they appear in [12], show both
minor geometric distortion, and minor photometric variations between the left and
right images. Here, we use modified versions, from which the photometric variations
have been mostly removed. Because of this, our results are not necessarily directly
66 CHAPTER 6. EXPERIMENTAL RESULTS
comparable to those presented in [11]. Note that the geometric distortion was left in
place; this manifests itself in the apparent curvature of the floor, as recovered by our
algorithm.
Although we were unable to locate ground truth disparities for these scenes, we
present the results of our algorithm for qualitative evaluation. Since ground truth is
unavailable, Figures 6.9 and 6.10 for these images are laid out according to the two
rightmost columns of Table 6.1.
6.3.1 “Cheerios”
In this stereo pair (Figure 6.9), disparity edges are fairly well marked by intensity
edges. Birchfield and Tomasi’s multiway cut algorithm [11] does very well on these
images; its primary error is the splitting of the upper-left surface of books into two
nearly coplanar pieces. Our algorithm also does fairly well, but in contrast, makes
essentially the opposite error: the Cheerios box is represented with only one surface.
This error is analogous to that which occurred on the sports page of the “Venus”
stereo pair.
6.3.2 “Clorox”
In this stereo pair (Figure 6.10), disparity edges are less well marked by
intensity edges; furthermore, there are distracting, strong intensity edges that do not
accompany disparity edges. Birchfield and Tomasi’s multiway cut algorithm [11] fares
more poorly, deceived by the misleading intensity edges into misplacing the crease
edges there. Our algorithm does not have this problem, producing results similar to
those obtained on the “Cheerios” images. Our relative immunity to such deception
is likely due in part to the contrast-normalized edge-weighting function described in
Section 4.3.
6.3.3 “Umbrella”
Among the six stereo pairs presented so far, only the “Tsukuba” set depicts a few
curved surfaces, all of which are fairly small. Okutomi et al. [55] exhibit an image
6.3. QUALITATIVE RESULTS 67
Figure 6.9: “Cheerios” stereo pair. Key: see Table 6.1.
68 CHAPTER 6. EXPERIMENTAL RESULTS
Figure 6.10: “Clorox” stereo pair. Key: see Table 6.1.
6.3. QUALITATIVE RESULTS 69
set that features a larger curved surface, but that surface is densely textured, which
simplifies the recovery of its shape. To verify that our algorithm can reconstruct
curved surfaces in the absence of dense texture, we therefore tested it on a stereo
pair of our own creation. Although we were unable to obtain accurate ground truth
disparities for this scene, we present the results of our algorithm on this scene for
qualitative evaluation.
This stereo pair (Figure 6.11, also laid out according to the two rightmost columns
of Table 6.1) shows five surfaces. The carpeted floor has some fine-grained, low-
contrast, stochastic texture, and is planar with a large disparity gradient. Both
checkerboard patterns can be considered to have high-contrast, quasi-periodic texture;
like the floor, the right checkerboard is also planar, but the left checkerboard, mounted
on a sheet of poster board, is slightly warped. The rear surface is a large, unmarked,
virtually textureless sheet of cardboard that is more severely warped. The red and
white “Stanford” umbrella is strongly curved, and rests on the floor, but does not
contact the rear sheet of cardboard; it is composed of fairly large, virtually textureless
panels joined together by high-contrast color edges that are not disparity edges. The
combination of these features makes this stereo pair particularly challenging.
To create this stereo pair, we manually arranged these objects, along with a
third checkerboard pattern much closer to the viewer. We photographed the scene
from several angles with a 4-megapixel, Bayer-pattern CCD, digital still camera,
and chose two viewpoints whose images were visually in epipolar alignment. We
corrected for lens distortion using intrinsic camera parameters obtained from addi-
tional photographs of a checkerboard pattern, and did stereo rectification using
extrinsic camera parameters derived from the checkerboard patterns in this scene.
Finally, we cropped and resized the images down to a manageable number of pixels,
in the process removing the third checkerboard pattern and reducing the artifacts
caused by the Bayer-pattern CCD.
Although we do not have results by other algorithms for this stereo pair, we note
that few of the algorithms tabulated in [65] are capable of representing smoothly
curved surfaces with subpixel disparity values, and among those, fewer still readily
reproduce sharp discontinuities in the disparity map.
70 CHAPTER 6. EXPERIMENTAL RESULTS
Figure 6.11: “Umbrella” stereo pair. Key: see Table 6.1.
6.3. QUALITATIVE RESULTS 71
Our algorithm correctly segments the scene into five smooth surfaces, each of
which is represented by a real-valued disparity map that contains no kinks or creases,
even in the presence of strong color edges that suggest otherwise. Our algorithm
places boundaries accurately at crease edges as well as at edges accompanied by
significant occlusion regions (i.e., to the left and right of the umbrella). Our algorithm
qualitatively recovers the warped shape of the background and the curvature of the
umbrella, both with very little help from texture.
Chapter 7
Discussion and Future Work
The quantitative and qualitative results presented suggest that for scenes
consisting of smooth surfaces, our algorithm produces very accurate reconstructions,
with subpixel disparity values and explicit and precise localization of boundaries,
whether the surfaces are planar or curved, textured or untextured, high-contrast
or low-contrast, color or grayscale. Despite this achievement, however, there is
nonetheless much room for improvement.
7.1 Efficiency
The results presented in Chapter 6 each took the current implementation on the
order of 2 to 8 hours to produce on a 450MHz UltraSparc II. The vast majority of this
time was spent on the calculation of minimum-cost graph cuts. As the current imple-
mentation uses Goldberg’s push-relabel maximum flow code [32], which is optimized
for worst-case graphs, the simplest way to obtain a significant speedup in software
would be to replace that code with Kolmogorov’s maximum flow code [46], which
is optimized for typical-case graphs that arise from problems in computer vision.
Kolmogorov’s code is estimated to be about three times as fast as Goldberg’s code
for vision problems [83].
Perhaps the least scalable part of the current implementation is the optimization
involved in surface fitting, which requires the calculation of the Hessian of the energy
72
7.2. THEORY VS. PRACTICALITY 73
function. It would be more efficient, especially for larger numbers of surface spline
control points, to use optimization techniques that do not require an exact Hessian
at every step. For example, one could use Lancelot [23], a Fortran package for
large scale, structured, non-linear optimization problems. Alternatively, based upon
the structure of our energy terms, one could derive customized update equations for
refining the disparity maps, using fast, stable evolution methods such as described
in [81].
7.2 Theory vs. Practicality
In the development and testing of our algorithm, we wanted to ensure that our
definition of total energy “does the right thing” theoretically, independent of the
technique used to minimize it. That is, we first need to know how to decide when one
proposed solution is better than another, before we can consider how to find the best
solution. In order to isolate as much as possible the effects of our energy formulation
from the effects of any particular optimization technique, then, we have chosen to
minimize total energy as thoroughly as possible. This mindset is another reason for
our current implementation’s long running times.
Our addition of contractions to the expansion move algorithm of [19] is one such
decision favoring more complete optimization over shorter running times. A standard
expansion move takes an initial labeling and improves upon it; if the initial labeling
is already near optimal, little work needs to be done, and the expansion move can
be done very quickly. With our mandatory, preceding contraction, however, even if
the initial labeling is near optimal, the expansion move must always undo the effects
of the contraction. Thus, the enlargement of the move space comes at a significant
computational cost.
Another such decision is the omission of heuristic speedups. During our experi-
mentation with various optimization schemes, it was discovered that good solutions
could be obtained much more quickly, simply by enforcing a certain minimum support
map area for each hypothesized surface, and immediately eliminating any whose area
drops below that threshold. However, we discarded this heuristic, because using it
74 CHAPTER 7. DISCUSSION AND FUTURE WORK
would not only break the paradigm of energy minimization, but also require the
determination of another parameter (i.e., the minimum area).
In both of these cases, we have chosen theoretical superiority over shorter running
times. However, we have not investigated the practical consequences of such superi-
ority. It would be instructive to explore the effects of these alternatives on the overall
optimization algorithm, in order to determine whether these tradeoffs are worthwhile.
7.3 Generality
The most limiting aspect of our current implementation is its model of surfaces.
Although our model works quite well for most of the results that we have presented,
it can be overly restrictive for scenes whose surfaces are less smooth. To be able
to handle surfaces with more shape detail, our implementation should use a much
finer grid for the control points of the splines the define surface shape. This would
likely require a more refined model of surface smoothness; for example, one could
do a multiscale decomposition of the surface shape, and penalize the energy in each
frequency band with a different weight, depending on what scale of detail one expects
to find in the surfaces.
Another limitation of the current algorithm occurs in some cases when a single
hypothesized surface spans what is actually two (or more) distinct surfaces: if there
are no nearby hypothesized surfaces to compete for one or the other actual surface,
it is possible that the two surfaces may remain joined in the final solution. That
is, although our algorithm does attempt to ensure that no merging of surfaces could
result in a decrease of total energy, it does not do the same for the splitting of
surfaces. It would be useful if some method were devised for automatic splitting as
well as automatic merging of surfaces; perhaps the multiscale segmentation algorithm
of Sharon et al. [67] could be used for this purpose.
Among the results presented in Section 6.2, most of our method’s errors can be
attributed to the over-dominance of the boundary length energy term E smooth f . In
order to improve the recovery of long, narrow structures, one could try to make better
use of monocular edge cues: currently, the cost of a proposed boundary edgel does not
7.3. GENERALITY 75
depend upon its orientation; it might help to favor edgels that are parallel, over those
that are perpendicular, to any edge-like monocular features. To implement such an
oriented weighting within E smooth f , one could borrow techniques from anisotropic
diffusion [15, 82].
Finally, we note that many of the parameters of our algorithm, controlling such
things as coarseness of segmentation and amount of surface shape detail, do not have
to be constant, but can in fact vary from surface to surface, and even within the same
surface. If, in addition to the disparity maps and segmentation, these parameters
themselves could be estimated separately and adaptively for each surface, we believe
that our algorithmic framework would be capable of producing accurate results for a
much wider variety of scenes.
Appendix A
Image Interpretation
Input images are given as grid of pixels, each of which has a certain color value.
From these images, we calculate floating point disparity values, which are able to
represent a continuous range of depths without discretization. Recalling the definition
of disparity in terms of the locations of corresponding image points, however, this
implies the correspondence of image points that lie between pixel positions. How can
we evaluate the validity of such subpixel correspondences?
A.1 Interpolation
For pixel-to-pixel correspondence, the cost (or energy) of a particular hypothesized
match is generally some function of the color difference between the two pixels. If one
pixel is fixed in location, and we are trying to determine which other pixel matches it
best, this matching cost can also be viewed as a function of disparity, with the cost
being defined for integer disparity values. This matching cost function can then be
interpolated at subpixel disparity values, and the location of its minimum used as the
estimated subpixel disparity.
Among prior work, it is not uncommon to estimate subpixel disparities by inter-
polating the energy function in this way [68, 72, 78]. Although such approaches are
computationally efficient, however, their accuracy in practice is limited, and their
justification in theory is unclear. The energy “landscape” as a function of disparity is
76
A.2. CERTAINTY 77
typically rather complex; sampling it at integer positions may not provide adequate
resolution for accurate interpolation.
In fact, since the matching cost depends on pixel values, matching cost as a
function of disparity should be at least as complex as color as a function of image
location. This observation suggests that, for estimating subpixel disparities, inter-
polating the image function should give better results than interpolating the energy
function. This is why we perform matching on interpolated images, as implied in
Equation (3.2).
We require that the image interpolation kernel have finite support, for computa-
tional efficiency. We also require that it be twice differentiable, for more convenient
numerical optimization. Cubic interpolation can satisfy either one of these require-
ments alone, but not both simultaneously; instead, we use a separable quartic interpo-
lation. This interpolation kernel is similar to that of finite-support cubic interpolation,
but with slightly smaller sidelobes.
A.2 Certainty
In actuality, we do not know the underlying continuous image to be twice differ-
entiable, or to satisfy any other specific property; any attempt to calculate an inter-
polated value is merely an educated guess with some degree of inherent uncertainty.
However, just as the value itself can be estimated, so can its uncertainty. This can
be useful when it is necessary to determine whether two such interpolated values
match one another: the larger their uncertainty, the larger the acceptable difference
between them. This idea is reflected in the sampling insensitive pixel dissimilarity
measure developed by Birchfield and Tomasi [10] and further investigated by Szeliski
and Scharstein [72]: in their measures, uncertainty is effectively proportional to the
size of the interval spanned by the neighboring values between which interpolation is
being done.
For our purposes, however, we would like estimated uncertainty to be differentiable
with respect to position. Note, also, that even exactly at a discrete pixel position,
there is some uncertainty in the value of the image, due for example to sensor noise
78 APPENDIX A. IMAGE INTERPRETATION
and quantization. These considerations and others motivate the following definition
of certainty, that our algorithm uses.
Let I : I 7→ Rm be the image function, and I be the m×m identity matrix. Let
x2 be shorthand for the outer product xxT . Let Gσ ∗ I represent the convolution of
I with a Gaussian of standard deviation σ. Then we define certainty to be
Aσ,ε =[
ε I + Gσ ∗ (I2)− (Gσ ∗ I)2]−1
where σ and ε are small constants. Intuitively, A(p) is the inverse of the covariance
of the color values in I within a small, Gaussian-weighted neighborhood around p.
Bibliography
[1] Yusuf Sinan Akgul and Chandra Kambhamettu. Recovery and tracking of
continuous 3D surfaces from stereo data using a deformable dual-mesh. In
Proceedings of the IEEE International Conference on Computer Vision, pages
765–772, 1999.
[2] H. Harlyn Baker and Thomas O. Binford. Depth from edge and intensity-based
stereo. In Proceedings of the International Joint Conferences on Artificial Intel-
ligence, pages 631–636, 1981.
[3] S. Baker, R. Szeliski, and P. Anandan. A layered approach to stereo recon-
struction. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 434–441, 1998.
[4] S. T. Barnard and M. A. Fischler. Computational stereo. Computing Surveys,
14:553–572, 1982.
[5] Stephen Barnard. Stochastic stereo matching over scale. International Journal
of Computer Vision, 3(1):17–32, 1989.
[6] Peter N. Belhumeur. A binocular stereo algorithm for reconstructing sloping,
creased, and broken surfaces in the presence of half-occlusion. In Proceedings of
the IEEE International Conference on Computer Vision, pages 431–438, 1993.
[7] Peter N. Belhumeur. A Bayesian approach to binocular stereopsis. International
Journal of Computer Vision, 19:237–260, 1996.
79
80 BIBLIOGRAPHY
[8] Peter N. Belhumeur and David A. Mumford. A Bayesian treatment of the stereo
correspondence problem using half-occluded regions. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 506–512, 1992.
[9] Stan Birchfield and Carlo Tomasi. Depth discontinuities by pixel-to-pixel stereo.
In Proceedings of the IEEE International Conference on Computer Vision, pages
1073–1080, 1998.
[10] Stan Birchfield and Carlo Tomasi. A pixel dissimilarity measure that is insen-
sitive to image sampling. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 20(4):401–406, 1998.
[11] Stan Birchfield and Carlo Tomasi. Multiway cut for stereo and motion with
slanted surfaces. In Proceedings of the IEEE International Conference on
Computer Vision, pages 489–495, 1999.
[12] Stan Birchfield and Carlo Tomasi. Multiway cut for stereo and motion with
slanted surfaces. Online at http://vision.stanford.edu/~birch/multiwaycut/ ,
2002.
[13] Stanley Thomas Birchfield. Depth and Motion Discontinuities. PhD thesis,
Stanford University, 1999.
[14] M. J. Black and A. Rangarajan. On the unification of line processes, outlier
rejection, and robust statistics with applications in early vision. International
Journal of Computer Vision, 19(1):57–91, 1996.
[15] M. J. Black, G. Sapiro, D. H. Marimont, and D. J. Heeger. Robust anisotropic
diffusion. IEEE Transactions on Image Processing, 7:421–432, 1998.
[16] Andrew Blake and Andrew Zisserman. Visual Reconstruction. MIT Press,
Cambridge, MA, 1987.
[17] Aaron F. Bobick and Stephen S. Intille. Large occlusion stereo. International
Journal of Computer Vision, 33(3):181–200, 1999.
BIBLIOGRAPHY 81
[18] Yuri Boykov, Olga Veksler, and Ramin Zabih. Markov random fields with efficient
approximations. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 648–655, June 1998.
[19] Yuri Boykov, Olga Veksler, and Ramin Zabih. Approximate energy minimization
with discontinuities. In Proceedings of the IEEE International Workshop on
Energy Minimization Methods in Computer Vision, pages 205–220, July 1999.
[20] Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy mini-
mization via graph cuts. In Proceedings of the IEEE International Conference
on Computer Vision, pages 377–384, September 1999.
[21] Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy mini-
mization via graph cuts. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 23(11), 2001.
[22] Lisa Gottesfeld Brown. A survey of image registration techniques. Computing
Surveys, 24(4):325–376, December 1992.
[23] A. R. Conn, N. I. M. Gould, and Ph. L. Toint. LANCELOT: A Fortran Package
for Large-Scale Nonlinear Optimization (Release A). Springer Verlag, 1992.
[24] Ingemar J. Cox, Sunita L. Hingorani, Satish B. Rao, and Bruce Maggs. A
maximum likelihood stereo algorithm. Computer Vision and Image Under-
standing, 63(3):542–567, May 1996.
[25] Trevor Darrell and Alex Pentland. Robust estimation of a multi-layer motion
representation. In Proceedings of the IEEE Workshop on Visual Motion, pages
173–178, 1991.
[26] Trevor Darrell and Alex Pentland. Cooperative robust estimation using layers
of support. IEEE Transactions on Pattern Analysis and Machine Intelligence,
17(5):474–487, 1995.
[27] U. R. Dhond and J. K. Aggarwal. Structure from stereo – a review. IEEE
Transactions on Systems, Man, and Cybernetics, 19(6):1489–1510, 1989.
82 BIBLIOGRAPHY
[28] U. R. Dhond and J. K. Aggarwal. Stereo matching in the presence of narrow
occluding objects using dynamic disparity search. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 17(7):719–724, 1995.
[29] L. R. Ford and D. R. Fulkerson. Flows in Networks. Princeton University Press,
Princeton, NJ, 1962.
[30] D. Geiger, B. Ladendorf, and A. Yuille. Occlusions and binocular stereo. Inter-
national Journal of Computer Vision, 14(3):211–226, 1995.
[31] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the
Bayesian restoration of images. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 6:721–741, 1984.
[32] Andrew Goldberg. Network optimization library. Online at
http://www.avglab.com/andrew/soft.html , 2002.
[33] W. E. L. Grimson. From Images to Surfaces: A Computational Study of the
Human Early Visual System. MIT Press, Cambridge, MA, 1981.
[34] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel. Robust
Statistics: The Approach Based on Influence Functions. John Wiley and Sons,
New York, NY, 1986.
[35] Heiko Hirschmuller, Peter R. Innocent, and Jon Garibaldi. Real-time correlation-
based stereo vision with reduced border errors. International Journal of
Computer Vision, 47:229–246, 2002.
[36] W. Hoff and N. Ahuja. Surfaces from stereo: Integrating feature matching,
disparity estimation, and contour detection. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 11(2):121–136, 1989.
[37] B. K. P. Horn. Robot Vision. MIT Press, Cambridge, MA, 1986.
[38] B. K. P. Horn and B. G. Schunck. Determining optical flow. Artificial Intelli-
gence, 17:185–203, 1981.
BIBLIOGRAPHY 83
[39] P. J. Huber. Robust Statistics. John Wiley and Sons, New York, NY, 1981.
[40] Stephen S. Intille and Aaron F. Bobick. Disparity-space images and large
occlusion stereo. In Proceedings of the European Conference on Computer Vision,
volume 2, pages 179–186, 1994.
[41] Hiroshi Ishikawa and Davi Geiger. Occlusions, discontinuities, and epipolar
lines in stereo. In Proceedings of the European Conference on Computer Vision,
volume 1, pages 232–249, 1998.
[42] B. Julesz. Binocular depth perception of computer-generated patterns. Bell
System Technical Journal, 39:1125–1162, 1960.
[43] B. Julesz. Foundations of Cyclopean Perception. University of Chicago Press,
Chicago, 1971.
[44] Takeo Kanade and M. Okutomi. A stereo matching algorithm with an adaptive
window: Theory and experiment. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 16(9):920–932, 1994.
[45] Johannes Kepler. Ad Vitellionem paralipomena, quibus Astronomiae pars optica
traditur. Frankfurt, 1604.
[46] Vladimir Kolmogorov. maxflow and match software. Online at
http://www.cs.cornell.edu/People/vnk/software.html , 2002.
[47] Vladimir Kolmogorov and Ramin Zabih. Computing visual correspondence with
occlusions using graph cuts. In Proceedings of the IEEE International Conference
on Computer Vision, pages 508–515, 2001.
[48] Vladimir Kolmogorov and Ramin Zabih. What energy functions can be mini-
mized via graph cuts? In Proceedings of the European Conference on Computer
Vision, volume 3, pages 65–81, 2002.
[49] David Lee and Theo Pavlidis. One dimensional regularization with discon-
tinuities. IEEE Transactions on Pattern Analysis and Machine Intelligence,
10(6):822–829, November 1988.
84 BIBLIOGRAPHY
[50] Michael H. Lin and Carlo Tomasi. Surfaces with occlusions from layered stereo.
Online at http://robotics.stanford.edu/~michelin/layered_stereo/ , 2002.
[51] A. Luo and H. Burkhardt. An intensity-based cooperative bidirectional stereo
matching with simultaneous detection of discontinuities and occlusions. Inter-
national Journal of Computer Vision, 15:171–188, 1995.
[52] D. Marr and T. Poggio. Cooperative computation of stereo disparity. Science,
194:283–287, 1976.
[53] D. Mumford and J. Shah. Boundary detection by minimizing functionals. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 22–26, 1985.
[54] Y. Ohta and T. Kanade. Stereo by intra- and inter-scanline search using dynamic
programming. IEEE Transactions on Pattern Analysis and Machine Intelligence,
7(2):139–154, 1985.
[55] Masatoshi Okutomi, Yasuhiro Katayama, and Setsuko Oka. A simple stereo algo-
rithm to recover precise object boundaries and smooth surfaces. International
Journal of Computer Vision, 47:261–273, 2002.
[56] Tomaso Poggio, Vincent Torre, and Christof Koch. Computational vision and
regularization theory. Nature, 317:314–319, 1985.
[57] R. Potts. Some generalized order-disorder transformation. In Proceedings of the
Cambridge Philosophical Society, volume 48, pages 106–109, 1952.
[58] Mariano Rivera and Jose L. Marroquın. Adaptive rest condition potentials:
Second order edge-preserving regularization. In Proceedings of the European
Conference on Computer Vision, volume 1, pages 113–127, 2002.
[59] L. G. Roberts. Machine Perception of Three-Dimensional Solids. PhD thesis,
Massachusetts Institute of Technology, Department of Electrical Engineering,
1963.
BIBLIOGRAPHY 85
[60] S. Roy and I. Cox. A maximum-flow formulation of the N -camera stereo corre-
spondence problem. In Proceedings of the IEEE International Conference on
Computer Vision, pages 492–499, 1998.
[61] Mark Ruzon and Carlo Tomasi. Color edge detection with the compass operator.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 160–166, 1999.
[62] Mark Ruzon and Carlo Tomasi. Alpha estimation in natural images. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 18–25, 2000.
[63] Daniel Scharstein and Richard Szeliski. Stereo matching with nonlinear diffusion.
International Journal of Computer Vision, 28(2):155–174, 1998.
[64] Daniel Scharstein and Richard Szeliski. The Middlebury stereo vision page.
Online at http://www.middlebury.edu/stereo/ , 2002.
[65] Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense
two-frame stereo correspondence algorithms. International Journal of Computer
Vision, 47:7–42, 2002.
[66] Juliang Shao. Generation of temporally consistent multiple virtual camera views
from stereoscopic image sequences. International Journal of Computer Vision,
47:171–180, 2002.
[67] Eitan Sharon, Achi Brandt, and Ronen Basri. Fast multiscale image segmen-
tation. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2000.
[68] Masao Shimizu and Masatoshi Okutomi. Precise sub-pixel estimation on area-
based matching. In Proceedings of the IEEE International Conference on
Computer Vision, volume 1, pages 90–97, 2001.
86 BIBLIOGRAPHY
[69] Jian Sun, Heung-Yeung Shum, and Nan-Ning Zheng. Stereo matching using
belief propagation. In Proceedings of the European Conference on Computer
Vision, volume 2, pages 510–524, 2002.
[70] Richard Szeliski and James Coughlan. Spline-based image registration. Interna-
tional Journal of Computer Vision, 22(3):199–218, 1997.
[71] Richard Szeliski and P. Golland. Stereo matching with transparency and matting.
International Journal of Computer Vision, 32(1):45–61, 1999.
[72] Richard Szeliski and Daniel Scharstein. Symmetric sub-pixel stereo matching. In
Proceedings of the European Conference on Computer Vision, volume 2, pages
525–540, 2002.
[73] Richard Szeliski and Ramin Zabih. An experimental comparison of stereo algo-
rithms. In Proceedings of the IEEE Workshop on Vision Algorithms, pages 1–19,
1999.
[74] Hai Tao and Harpreet S. Sawhney. Global matching criterion and color segmen-
tation based stereo. In Proceedings of the IEEE Workshop on the Application of
Computer Vision, pages 246–253, 2000.
[75] Hai Tao, Harpreet S. Sawhney, and Rakesh Kumar. A global matching framework
for stereo computation. In Proceedings of the IEEE International Conference on
Computer Vision, volume 1, pages 532–539, 2001.
[76] Demetri Terzopoulos. Image analysis using multigrid relaxation methods. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 8(2):129–139, 1986.
[77] Demetri Terzopoulos. Regularization of inverse visual problems involving discon-
tinuities. IEEE Transactions on Pattern Analysis and Machine Intelligence,
8(4):413–424, 1986.
[78] Q. Tian and M. N. Huhns. Algorithms for subpixel registration. Computer
Vision, Graphics, and Image Processing, 35:220–233, 1986.
BIBLIOGRAPHY 87
[79] Olga Veksler. Efficient Graph-Based Energy Minimization Methods in Computer
Vision. PhD thesis, Cornell University, 1999.
[80] J. Wang and E. Adelson. Layered representation for motion analysis. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 361–366, 1993.
[81] J. Weickert, B.M.T.H. Romeny, and M.A. Viergever. Efficient and reliable
schemes for nonlinear diffusion filtering. IEEE Transactions on Image Processing,
7(3):398–410, March 1998.
[82] Joachim Weickert. Anisotropic Diffusion in Image Processing. Teubner-Verlag,
Stuttgart, Germany, 1998.
[83] Ramin Zabih. Personal communication, 2002.
[84] Ramin Zabih and John Woodfill. Non-parametric local transforms for computing
visual correspondence. In Proceedings of the European Conference on Computer
Vision, volume 2, pages 151–158, 1994.
[85] Ye Zhang and Chandra Kambhamettu. Stereo matching with segmentation-
based cooperation. In Proceedings of the European Conference on Computer
Vision, volume 2, pages 556–571, 2002.
[86] C. Lawrence Zitnick and Takeo Kanade. A cooperative algorithm for stereo
matching and occlusion detection. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 22(7):675–684, July 2000.
top related