Computational Investigation of Feature Extraction and Image
Organization
DISSERTATION
Presented in Partial Fulfillment of the Requirements for
the Degree Doctor of Philosophy in the
Graduate School of The Ohio State University
By
Xiuwen Liu, B.Eng., M.S., M.S.
* * * * *
The Ohio State University
1999
Dissertation Committee:
Prof. DeLiang L. Wang, Adviser
Prof. Song-Chun Zhu
Prof. Anton F. Schenk
Prof. Alan J. Saalfeld
Approved by
Adviser
Department of Computerand Information Science
ABSTRACT
This dissertation investigates computational issues of feature extraction and im-
age organization at different levels. Boundary detection and segmentation are studied
extensively for range, intensity, and texture images. We developed a range image seg-
mentation system using a LEGION network based on a similarity measure consisting
of estimated surface properties. We propose a nonlinear smoothing algorithm through
local coupling structures, which exhibits distinctive temporal properties such as quick
convergence.
We propose spectral histograms, consisting of marginal distributions of a chosen
bank of filters, as a generic feature vector based on that early steps of human visual
processing can be modeled using local spatial/frequency representations. Spectral
histograms are studied extensively in texture modeling, classification, and segmenta-
tion. Experiments in texture synthesis and classification demonstrate that spectral
histograms provide a sufficient and unified feature in capturing perceptual appearance
of textures. Spectral histograms improve significantly the classification performance
for challenging texture images. We also propose a model for texture discrimination
based on spectral histograms which matches existing psychophysical data. A new en-
ergy functional for image segmentation is proposed. With given regional features, an
iterative and deterministic algorithm for segmentation is derived. Satisfactory results
ii
are obtained for natural texture images using spectral histograms. We also devel-
oped a novel algorithm which automatically identifies homogeneous texture features
from input images. By incorporating texture structures, we achieve accurate texture
boundary localization through a new distance measure. With extensive experiments,
we demonstrate that spectral histograms provide a generic feature which can be used
effectively to solve fundamental vision problems.
Based on a novel and biologically plausible boundary-pair representation, per-
ceptual organization is studied. A network is developed which can simulate many
perceptual phenomena through temporal dynamics. Boundary-pair representation
provides a unified explanation of edge- and surface-based representations.
A prototype system for automated feature extraction from remote sensing images
is developed. By combining the advantages of the learning-by-example method and a
locally coupled network, a generic feature extraction system is feasible. The system
is tested by extracting hydrographic features from large images of natural scenes.
iii
ACKNOWLEDGMENTS
I express my gratitude for my advisor, Prof. DeLiang Wang, who not only gener-
ously gives his time and energy, but also teaches me fundamental principles that are
essential for my scientific career. He not only gives me many scientific insights and
ideas, but also takes every chance to improve my skills in presentation and commu-
nication. I would also like to thank Prof. Song-Chun Zhu for sharing his time and
ideas with me. I benefit much from his computational thinking of vision problems.
I would like to thank my colleagues at Department of Computer and Information
Science, Department of Civil and Environmental Engineering and Geodetic Science,
and Center for Mapping for providing me an excellent environment for doing research.
I am especially grateful for Dr. John D. Bossler providing me opportunities to work
on challenging and yet fruitful problems. I would also like to thank Dr. Anton F.
Schenk, Dr. Alan J. Saalfeld, Dr. J. Raul Ramirez, Dr. Joseph C. Loon, Dr. Ke
Chen, Dr. Shannon Campbell, and many other faculty members and colleagues for
their strong support. I would also express my thanks to my colleagues in the Vision
Club at The Ohio State University, Dr. James Todd, Dr. Delwin Lindsey, and
Dr. Tjeerd Dijkstra, for stimulating discussions. Many thanks go to my teammates,
Dr. Erdogan Cesmeli, Mingying Wu, and Qiming Luo for their help and insightful
discussions. A Presidential Fellowship from The Ohio State University helped me
v
focus on my dissertation work in the last year of my Ph.D. study and is greatly
acknowledged.
I would like to thank my Lord Jesus Christ for His wonderful guidance, arrange-
ments, and opportunities He gives especially to me. I would like to express my sincere
gratitude for the strong support from my family. My mother-in-law takes a good care
of our family so that both my wife and I can focus on our studies. My wife Xujing pro-
vides a comfort and reliable home for me. Without her support and encouragement,
it would be impossible for me to finish my study. I thank my daughter Teng-Teng for
the enjoy we have together and for her support. I thank my families in China, my
sisters and brothers for their encouragement, understanding and support.
vi
VITA
August 14, 1966 . . . . . . . . . . . . . . . . . . . . . . . . . . . . Born - Hebei Province, China
July, 1989 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.Eng. Computer Science,Tsinghua University, Beijing, China
August, 1989 - February, 1993 . . . . . . . . . . . . . . Assistant Lecturer,Tsinghua University, Beijing, China
March, 1995 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.S. Geodetic Science and Surveying,The Ohio State University
June, 1996 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .M.S. Computer & Information Science,The Ohio State University
PUBLICATIONS
Journal Articles
X. Liu and J. R. Ramirez, “Automated vectorization and labeling of very largehypsographic map images using a contour graph.” Surveying and Land InformationSystems, vol. 57(1), pp. 5-10, 1997.
X. Liu and D. L. Wang, “Range image segmentation using an oscillatory network.”IEEE Transactions on Neural Networks, vol. 10(3), pp. 564-573, 1999.
X. Liu, D. L. Wang, and J. R. Ramirez, “Boundary detection by contextual nonlinearsmoothing.” Pattern Recognition, 1999.
Conference Papers
Y. Li, B. Zhang, and X. Liu, “A robust motion planner for assembly robots.” InProceedings of the IEEE International Conference on Robotics and Automation, vol.3, p. 1016, 1993.
vii
X. Liu and D. L. Wang, “Range image segmentation using an oscillatory network.”In Proceedings of the 1997 IEEE International Conference on Neural Networks, vol.3, pp. 1656-1660, 1997.
J. J. Loomis, X. Liu, Z. Ding, K. Fujimura, M. L. Evans, and H. Ishikawa, “Visualiza-tion of plant growth.” In Proceedings of the 1997 IEEE Conference on Visualization,pp. 475-478, 1997.
X. Liu and J. R. Ramirez, “Automatic extraction of hydrographic features in digitalorthophoto images.” In Proceedings of GIS/LIS’1997, pp. 365-373, 1997.
X. Liu, D. L. Wang, and J. R. Ramirez, “Extracting hydrographic objects fromsatellite images using a two-layer neural network.” In Proceedings of the 1998 Inter-national Joint Conference on Neural Networks, vol. 2, pp. 897-902, 1998.
X. Liu, D. L. Wang, and J. R. Ramirez, “A two-layer neural network for robust imagesegmentation and its application in revising hydrographic features.” InternationalArchives of Photogrammetry and Remote Sensing, vol. 32, part 3/1, 464-472, 1998.
X. Liu, D. L. Wang, and J. R. Ramirez, “Oriented Statistical Nonlinear SmoothingFilter.” In Proceedings of the 1998 International Conference on Image Processing,vol. 2, pp. 848-852, 1998.
X. Liu, “A prototype system for extracting hydrographic regions from Digital Or-thophoto Quadrangle images.” In Proceedings of GIS/LIS’1998, pp. 382-393, 1998.
X. Liu and D. L. Wang, “A boundary-pair representation for perception modeling.”In Proceedings of the 1999 International Joint Conference on Neural Networks, 1999.
X. Liu and D. L. Wang, “Modeling perceptural organization using temporal dynam-ics.” In Proceedings of the 1999 International Joint Conference on Neural Networks,1999.
Technical Report
J. J. Loomis, Z. Ding, X. Liu, K. Fujimura, and H. Ishikawa, “Flexible ObjectReconstruction from Temporal Image Series.” Technical Report OSU-CISRC-5/96-TR30, Department of Computer and Information Science, The Ohio State University,1996.
viii
X. Liu and D. L. Wang, “Range Image Segmentation Using a LEGION Network.”Technical Report OSU-CISRC-10/96-TR49, Department of Computer and Informa-tion Science, The Ohio State University, 1996.
X. Liu, D. L. Wang, and J. R. Ramirez, “Boundary Detection by Contextual Non-linear Smoothing.” Technical Report OSU-CISRC-7/98-TR21, Department of Com-puter and Information Science, The Ohio State University, 1998.
K. Chen, D. L. Wang, and X. Liu, “Weight adaptation and oscillatory correlationfor image segmentation.” Technical Report OSU-CISRC-8/98-TR37, Department ofComputer and Information Science, The Ohio State University, 1998.
X. Liu, K. Chen, and D. L. Wang, “Extraction of hydrographic regions from re-mote sensing images using an oscillator network with weight adaptation.” TechnicalReport OSU-CISRC-4/99-TR12, Department of Computer and Information Science,The Ohio State University, 1999.
FIELDS OF STUDY
Major Field: Computer and Information Science
Studies in:
Perception and Neurodynamics Prof. DeLiang L. WangMachine Vision Prof. Song-Chun ZhuDigital Photogrammetry Prof. Anton F. SchenkGeographic Information Systems Prof. Alan J. Saalfeld
ix
TABLE OF CONTENTS
Page
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
Chapters:
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2. Range Image Segmentation Using a Relaxation Oscillator Network . . . . 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Overview of the LEGION Dynamics . . . . . . . . . . . . . . . . . 13
2.2.1 Single Oscillator Model . . . . . . . . . . . . . . . . . . . . 132.2.2 Emergent Behavior of LEGION Networks . . . . . . . . . . 15
2.3 Similarity Measure for Range Images . . . . . . . . . . . . . . . . 202.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.1 Parameter Selection . . . . . . . . . . . . . . . . . . . . . . 252.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.4.3 Comparison with Existing Approaches . . . . . . . . . . . . 33
2.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
x
2.5.1 Biological Plausibility of the Network . . . . . . . . . . . . . 352.5.2 Comparison with Pulse-Coupled Neural Networks . . . . . . 362.5.3 Further Research Topics . . . . . . . . . . . . . . . . . . . . 38
3. Boundary Detection by Contextual Nonlinear Smoothing . . . . . . . . . 40
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.2 Contextual Nonlinear Smoothing Algorithm . . . . . . . . . . . . . 45
3.2.1 Design of the Algorithm . . . . . . . . . . . . . . . . . . . . 453.2.2 A Generic Nonlinear Smoothing Framework . . . . . . . . . 49
3.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.3.1 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . 513.3.2 Numerical Simulations . . . . . . . . . . . . . . . . . . . . . 54
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 583.4.1 Results of the Proposed Algorithm . . . . . . . . . . . . . . 583.4.2 Comparison with Nonlinear Smoothing Algorithms . . . . . 65
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4. Spectral Histogram: A Generic Feature for Images . . . . . . . . . . . . . 75
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.2 Spectral Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2.1 Properties of Spectral Histograms . . . . . . . . . . . . . . . 854.2.2 Choice of Filters . . . . . . . . . . . . . . . . . . . . . . . . 85
4.3 Texture Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 874.3.1 Comparison with Heeger and Bergen’s Algorithm . . . . . . 96
4.4 Texture Classification . . . . . . . . . . . . . . . . . . . . . . . . . 1014.4.1 Classification at Fixed Scales . . . . . . . . . . . . . . . . . 1044.4.2 Classification at Different Scales . . . . . . . . . . . . . . . 1054.4.3 Image Classification . . . . . . . . . . . . . . . . . . . . . . 1084.4.4 Training Samples and Generalization . . . . . . . . . . . . . 1114.4.5 Comparison with Existing Approaches . . . . . . . . . . . . 113
4.5 Content-based Image Retrieval . . . . . . . . . . . . . . . . . . . . 1154.6 Comparison of Statistic Features . . . . . . . . . . . . . . . . . . . 1214.7 A Model for Texture Discrimination . . . . . . . . . . . . . . . . . 1244.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5. Image Segmentation Using Spectral Histograms . . . . . . . . . . . . . . 131
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1315.2 Formulation of Energy Functional for Segmentation . . . . . . . . 1345.3 Algorithms for Segmentation . . . . . . . . . . . . . . . . . . . . . 135
xi
5.4 Segmentation with Given Region Features . . . . . . . . . . . . . . 1395.4.1 Segmentation at a Fixed Integration Scale . . . . . . . . . . 1405.4.2 Segmentation with Multiple Scales . . . . . . . . . . . . . . 1505.4.3 Region-of-interest Extraction . . . . . . . . . . . . . . . . . 153
5.5 Automated Seed Selection . . . . . . . . . . . . . . . . . . . . . . 1565.6 Localization of Texture Boundaries . . . . . . . . . . . . . . . . . 1605.7 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1645.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6. Perceptual Organization Based on Temporal Dynamics . . . . . . . . . . 169
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1706.2 Figure-Ground Segregation Network . . . . . . . . . . . . . . . . . 172
6.2.1 Boundary-Pair Representation . . . . . . . . . . . . . . . . 1726.2.2 Incorporation of Gestalt Rules . . . . . . . . . . . . . . . . 1756.2.3 Temporal Properties of the Network . . . . . . . . . . . . . 177
6.3 Surface Completion . . . . . . . . . . . . . . . . . . . . . . . . . . 1786.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 1796.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
7. Extraction of Hydrographic Regions from Remote Sensing Images Usingan Oscillator Network with Weight Adaptation . . . . . . . . . . . . . . 188
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1897.2 Weight Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . 1937.3 Automated Seed Selection . . . . . . . . . . . . . . . . . . . . . . 2007.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 202
7.4.1 Parameter Selection . . . . . . . . . . . . . . . . . . . . . . 2037.4.2 Synthetic Image . . . . . . . . . . . . . . . . . . . . . . . . 2037.4.3 Hydrographic Region Extraction from DOQQ Images . . . . 204
7.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
8. Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 221
8.1 Contributions of Dissertation . . . . . . . . . . . . . . . . . . . . . 2218.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
8.2.1 Correspondence Through Spectral Histograms . . . . . . . . 2228.2.2 Integration of Bottom-up and Top-down Approaches . . . . 2238.2.3 Psychophysical Experiments . . . . . . . . . . . . . . . . . . 227
8.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 228
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
xii
LIST OF TABLES
Table Page
3.1 Quantitative comparison of boundary detection results shown in Figure3.15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2 Quantitative comparison of boundary detection results shown in Figure3.16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1 L1-norm distance of the spectral histograms and RMS distance betweenimages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.2 Classification errors of methods shown in [108] and our method . . . 115
4.3 Comparison of texture discrimination measures . . . . . . . . . . . . 128
7.1 Comparison of error rates using neural network classification and theproposed method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
xiii
LIST OF FIGURES
Figure Page
1.1 A texture image and the corresponding numerical arrays. (a) A textureimage with size 128× 64. (b) A small portion with size 40× 30 of (a)centered at pixel (64, 37), which is on the boundary between the twotexture regions. (c) Numerical values of (b). To save space, the valuesare displayed in hexadecimal format. . . . . . . . . . . . . . . . . . . 2
1.2 Demonstration of nonlinearity for texture images. (a) A regular textureimage. (b) The image in (a) was circularly shifted left and downwardfor 2 pixels at each direction. (c) The pixel-by-pixel average of (a) and(b). The relative variance defined in (3.20) between (a) and (b) is 137,and between (a) and (c) is 69. The distance between the spectral his-tograms defined in Chapter 4 between (a) and (b) is 1.288 and between(a) and (c) is 38.5762. . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 A stable limit cycle for a single relaxation oscillator. The thick solidline represents the limit cycle and thin solid lines stand for nullclines.Arrows are used to indicate the different traveling speed, resultingfrom fast and slow time scales. The following parameter values areused: ε = 0.02, β = 0.1, γ = 3.0, and a constant stimulus I = 1.0. . . 15
2.2 The temporal activities of the excitatory unit of a single oscillator fordifferent γ values. Other parameters are same as for Figure 2.1. (a)γ = 3.0. (b) γ = 40.0. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Architecture of a two-dimensional LEGION network with eight-nearestneighbor coupling. An oscillator is indicated by an empty ellipse, andthe global inhibitor is indicated by a filled circle. . . . . . . . . . . . . 19
xiv
2.4 Illustration of LEGION dynamics. (a) An input image consisting ofseven geometric objects, with 40× 40 pixels. (b) The corrupted imageof (a) by adding 10which is presented to a 40×40 LEGION network. (c)A snapshot of the network activity at the beginning. (d)-(j) Subsequentsnapshots of the network activity. In (c)-(j), the grayness of a pixel isproportional to the corresponding oscillator’s activity and black pixelsrepresent oscillators in the active phase. The parameter values for thissimulation are following: ε = 0.02, β = 0.1, γ = 20.0, θx = −0.5,θp = 7.0, θz = 0.1, θ = 0.8, and Wz = 2.0. . . . . . . . . . . . . . . . . 21
2.5 Temporal evolution of the LEGION network. The upper seven plotsshow the combined temporal activities of the seven oscillator blocksrepresenting the corresponding geometric objects. The eighth plotshows the temporal activities of all the stimulated oscillators whichcorrespond to the background. The bottom one shows the temporalactivity of the global inhibitor. The simulation took 20,000 integrationsteps using a fourth-order Runge-Kutta method to solve differentialequations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Segmentation result of the LEGION network for the range image of acolumn. (a) The input range image. (b) The background region. (c)-(f) The four segmented regions. (g) The overall segmentation resultrepresented by a gray map. (h) The corresponding intensity image. (i)The 3-D construction model. As in Figure 2.4, black pixels in (b)-(f)represent oscillators that are in the active phase. . . . . . . . . . . . . 28
2.7 Segmentation results of the LEGION network for range images. Ineach row, the left frame shows the input range image, the middle oneshows the segmentation result represented by a gray map, and the rightone shows the 3-D construction model for comparison purposes. . . . 30
2.8 Segmentation results of the LEGION network for several more rangeimages. See the caption of Figure 2.7 for arrangement. . . . . . . . . 31
2.9 Two examples with thin regions. The global inhibition and potentialthreshold are tuned to get the results shown here. See the caption ofFigure 2.7 for arrangement. . . . . . . . . . . . . . . . . . . . . . . . 32
xv
2.10 A hierarchy obtained from multiscale segmentation. The top is theinput range image and each segmented region is further segmented byincreasing the level of global inhibition. As in Figure 2.6, black pixelsrepresent active oscillators, corresponding to the popped up region.See Figure 2.6(i) for the corresponding 3-D model. . . . . . . . . . . . 39
3.1 An example with non-uniform boundary gradients and substantialnoise. (a) A noise-free synthetic image. Gray values in the image: 98for the left ‘[’ region, 138 for the square, 128 for the central oval, and 158for the right ‘]’ region. (b) A noisy version of (a) with Gaussian noiseof σ = 40. (c) Local gradient map of (b) using the Sobel operators.(d)-(f) Smoothed images from an anisotropic diffusion algorithm [106]at 50, 100, and 1000 iterations. (g)-(i) Corresponding edge maps of(d)-(f) respectively using the Sobel edge detector. . . . . . . . . . . . 44
3.2 Illustration of the coupling structure of the proposed algorithm. (a)Eight oriented windows and a fully connected window defined on a3 x 3 neighborhood. (b) A small synthetic image patch of 6 x 8 inpixels. (c) The resulting coupling structure for (b). There is a directededge from (i1, j1) to a neighbor (i0, j0) if and only if (i1, j1) contributesto the smoothing of (i0, j0) according to equations (3.12) and (3.9).Each circle represents a pixel, where the inside color is proportionalto the gray value of the corresponding pixel. Ties in (3.9) are brokenaccording to left-right and top-down preference of the oriented windowsin (a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3 Temporal behavior of the proposed algorithm with respect to theamount of noise. Six noisy images are obtained by adding zero-meanGaussian noise with σ of 5, 10, 20, 30, 40, and 60, respectively, to thenoise-free image shown in Figure 3.1(a). The plot shows the deviationfrom the ground truth image with respect to iterations of the noise-freeimage and six noisy images. . . . . . . . . . . . . . . . . . . . . . . . 56
3.4 Relative variance of the proposed algorithm for the noise-free imageshown in Figure 3.1(a) and four noisy images with Gaussian noise ofzero-mean and σ of 5, 20, 40 and 60, respectively. . . . . . . . . . . . 57
3.5 Relative variance of the proposed algorithm for real images shown inFigure 3.9-3.12. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
xvi
3.6 The oriented bar-like windows used throughout this chapter for syn-thetic and real images. The size of each kernel is approximately 3 x 10in pixels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.7 The smoothed images at the 11th iteration and detected boundariesfor three synthetic images by adding specified Gaussian noise to thenoise-free image shown in Figure 3.1(a). Top row shows the inputimages, middle the smoothed image at the 11th iteration, and bottomthe detected boundaries using the Sobel edge detector. (a) Gaussiannoise with σ = 10. (b) Gaussian noise with σ = 40. (c) Gaussian noisewith σ = 60. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.8 The smoothed image at the 11th iteration and detected boundaries fora synthetic image with corners. (a) Input image. (b) Smoothed image.(c) Boundaries detected. . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.9 The smoothed image at the 11th iteration and detected boundaries fora grocery store advertisement. Details are smoothed out while majorboundaries and junctions are preserved accurately. (a) Input image.(b) Smoothed image. (c) Boundaries detected. . . . . . . . . . . . . . 62
3.10 The smoothed image at the 11th iteration and detected boundaries fora natural satellite image with several land use patterns. The bound-aries between different regions are formed from noisy segments due tothe coupling structure. (a) Input image. (b) Smoothed image. (c)Boundaries detected. . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.11 The smoothed image at the 11th iteration and detected boundaries fora woman image. While the boundaries between large features are pre-served and detected, detail features such as facial features are smoothedout. (a) Input image. (b) Smoothed image. (c) Boundaries detected. 64
3.12 The smoothed image at the 11th iteration and detected boundaries fora texture image. The boundaries between different textured regionsare formed while details due to textures are smoothed out. (a) Inputimage. (b) Smoothed image. (c) Boundaries detected. . . . . . . . . . 65
xvii
3.13 Deviations from the ground truth image for the four nonlinear smooth-ing methods. Dashed line: The SUSAN filter [117]; Dotted line: ThePerona-Malik model [105]; Dash-dotted line: The Weickert model ofedge enhancing anisotropic diffusion [137]; Solid line: The proposedalgorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.14 Relative variance of the four nonlinear smoothing methods. Dashedline: The SUSAN filter [117]; Dotted line: The Perona-Malik diffusionmodel [105]; Dash-dotted line: The Weickert model [137]; Solid line:The proposed algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.15 Smoothing results and detected boundaries of the four nonlinear meth-ods for a synthetic image shown in Figure 3.7(a). Here noise is not largeand all of the methods perform well in preserving boundaries. . . . . 70
3.16 Smoothing results and detected boundaries of the four nonlinear meth-ods for a synthetic image with substantial noise shown in Figure 3.7(b).The proposed algorithm generates sharper and better connectedboundaries than the other three methods. . . . . . . . . . . . . . . . 72
3.17 Smoothing results and detected boundaries of a natural scene satelliteimage shown in Figure 3.10. Smoothed image of the proposed algo-rithm is at the 11th iteration while smoothed images of the other threemethods are chosen manually. While the other three methods gener-ate similar fragmented boundaries, the proposed algorithm forms theboundaries between different regions due to its coupling structure. . . 72
4.1 Basis functions of Fourier transform in time and frequency domainswith their Fourier transforms. (a) An impulse and its Fourier trans-form. (b) A sinusoid function and its Fourier transform. . . . . . . . . 79
4.2 A texture image with its Gabor filter response. (a) Input texture image.(b) A Gabor filter, which is truncated to save computation. (c) Thefilter response obtained through convolution. . . . . . . . . . . . . . . 81
4.3 A texture image and its spectral histograms. (a) Input image. (b) AGabor filter. (c) The histogram of the filter. (d) Spectral histogramsof the image. There are eight filters including intensity filter, gradientfilters Dxx and Dyy, four LoG filters with T =
√2/2, 1, 2, and 4, and
a Gabor filter Gcos(12, 150). There are 8 bins in the histograms ofintensity and gradient filters and 11 bins for the other filters. . . . . . 84
xviii
4.4 Gibbs sampler for texture synthesis. . . . . . . . . . . . . . . . . . . . 88
4.5 Texture image synthesis by matching observed statistics. (a) Observedtexture image. (b) Initial image. (c) Synthesized image after 14 sweeps.(d) The total matched error with respect to sweeps. . . . . . . . . . . 90
4.6 Temporal evolution of a selected filter for texture synthesis. (a) AGabor filter. (b) The histograms of the Gabor filter. Dotted line -observed histogram, which is covered by the histogram after 14 sweeps;dashed line - initial histogram; dash-dotted line - histogram after 2sweeps. solid line - histogram after 14 sweeps. (c) The error of thechosen filter with respect to the sweeps. (d) The error between theobserved histogram and the synthesized one after 14 sweeps. Here theerror is multiplied by 1000. . . . . . . . . . . . . . . . . . . . . . . . . 91
4.7 More texture synthesis examples. Left column shows the observedimages and right column shows the synthesized image within 15 sweeps.In (b), due to local minima, there are local regions which are notperceptually similar to the observed image. . . . . . . . . . . . . . . . 92
4.8 Real texture images of regular patterns with synthesized images after20 sweeps. (a) An image of a leather surface. The total matched errorafter 20 sweeps is 0.082. (b) An image of a pressed calf leather surface.The total matched error after 20 sweeps is 0.064. . . . . . . . . . . . 94
4.9 Texture synthesis for an image with different regions. (a) The observedtexture image. This image is not a homogeneous texture image andconsists mainly of two homogeneous regions. (b) The initial image.(c) Synthesized image after 100 sweeps. Even though the spectral his-togram of each filter is matched well, compared to other images, theerror is still large. Especially for the intensity filter, the error is stillabout 7.44 %. The synthesized image is perceptually similar to theobserved image except the geometrical relationships among the homo-geneous regions. (d) The matched error with respect to the sweeps.Due to that the observed image is not homogeneous, the synthesisalgorithm converges slower compared with Figure 4.5(d). . . . . . . . 95
4.10 A synthesis example for a synthetic texton image. (a) The originalsynthetic texton image with size 128× 128. (b) The synthesized imagewith size 256× 256. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
xix
4.11 A synthesis example for an image consisting of two regions. (a) Theoriginal synthetic image with size 128×128, consisting of two intensityregions. (b) The synthesized image with size 256× 256. . . . . . . . . 97
4.12 A synthesis example for a face image. (a) Lena image with size 347×334. (b) The synthesized image with size 256× 256. . . . . . . . . . . 97
4.13 The synthesized images of the 40 texture images shown in Figure 4.16.Here same filters and cooling schedule are used for all the images. . . 98
4.14 Synthesized images from different initial images for the texture imageshown in Figure 4.3(a). (a)-(c) Left column is the initial image andright column is the synthesized image after 20 sweeps. (d) The matchederror with respect to the number of sweeps. . . . . . . . . . . . . . . 100
4.15 Synthesized images from Heeger and Bergen’s aglorithm and thematched spectral histogram error for the image shown in Figure 4.3(a).(a) Synthesized image at 3 iterations. (b) Synthesized image at 10iterations. (c) Synthesized image at 100 iterations. (d) The L1-normerror of the observed spectral histogram and the synthesized one. . . 102
4.16 Forty texture images used in the classification experiments. The inputimage size is 256× 256. . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.17 The divergence between the feature vector of each image in the textureimage database shown in Figure 4.16. (a) The cross-divergence matrixshown in numerical values. (b) The numerical values are displayed asan image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.18 (a) The classification error for each image in the texture database alongwith the ratio between the maximum and minimum divergence shownin (b) and (c) respectively. (b) The maximum divergence of spectralhistogram from the feature vector of each image. (c) The minimumdivergence between each image and the other ones. . . . . . . . . . . 107
4.19 The classification error of the texture database with respect to the scalefor feature extraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
xx
4.20 (a) Image “Hexholes-2” from the texture database. (b) The classi-fication error rate for the image. (c) The ratio between maximumdivergence and minimum cross divergence with respect to scales. . . . 109
4.21 (a) Image “Woolencloth-2” from the texture database. (b) The clas-sification error rate for the image. (c) The ratio between maximumdivergence and minimum cross divergence with respect to scales. . . . 110
4.22 (a) A texture image consisting of five texture regions from the tex-ture database. (b) Classification result using spectral histograms. (c)Divergence between spectral histograms and the feature vector of theassigned texture image. (d) The ground truth segmentation of theimage. (e) Misclassified pixels, shown in black. . . . . . . . . . . . . . 112
4.23 (a) The classification error for each image in the database at integrationscale 35×35. (b) The classification error at different integration scales.In both cases, solid line – training is done using half of the samples;dashed line – training is done using all the samples. . . . . . . . . . . 113
4.24 The classification error with respect to the ratio of testing samples totraining samples. Solid line – integration scale 35 × 35; dashed line –integration scale 23× 23. . . . . . . . . . . . . . . . . . . . . . . . . 114
4.25 A group of 10 texture images used in [108]. Each image is 256× 256. 116
4.26 A group of 10 texture images used in [108]. Each image is 256× 256. 117
4.27 Image retrieval result from a 100-image database using a given im-age patch based on spectral histograms. (a) Input image patch withsize 35 × 35. (b) The sorted matched error for the 100 images in thedatabase. (c) The first nine image with smallest errors. . . . . . . . . 119
4.28 Image retrieval result from a 100-image database using a given im-age patch based on spectral histograms. (a) Input image patch withsize 53 × 53. (b) The sorted matched error for the 100 images in thedatabase. (c) The first nine image with smallest errors. . . . . . . . . 120
xxi
4.29 Classification error in percentage of texture database for different fea-tures. Solid line: spectral histogram of eight filters including inten-sity, gradients, LoG with two scales and Gabor with three differentorientations. Dotted line: Mean value of the image patch. Dashedline: Weighted sum of mean and variance values of the image patch.The weights are determined to achieve the best result for window size35× 35. Dash-dotted line: Intensity histogram of image patches. . . . 122
4.30 Classification error in percentage of the texture database for differentfilters. Solid line: spectral histogram of eight filters including inten-sity, gradients, LoG with two scales and Gabor with three differentorientations. Dotted line: Gradient filters Dxx and Dyy; Dashed line:Laplacian of Gaussian filters LoG(
√2/2), LoG(1), and LoG(2). Dash-
dotted line: Six Cosine Gabor filters with T = 4 and six orientationsθ = 0, 30, 60, 90, 120, and 150. . . . . . . . . . . . . . . . . . . 123
4.31 Classification error in percentage of the texture database for differ-ent distance measures. Solid line: χ2-square statistic. Dotted line:L1-norm. Dashed line: L2-norm. Dash-dotted line: Kullback-Leiblerdivergence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.32 Ten synthetic texture pairs scanned from Malik and Perona [87]. Thesize is 136× 136. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.33 The averaged texture gradient for selected texture pairs. (a) The tex-ture pair (+ O) as shown in Figure 4.32. (b) The texture gradientaveraged along each column for (a). The horizontal axis is the col-umn number and the vertical axis is the gradient. (c) The texture pair(R-mirror-R). (d) The averaged texture gradient for (c). . . . . . . . 126
4.34 Comparison of texture discrimination measures. Dashed line - Psy-chophysical data from Krose [69]; dotted line - Prediction of Malik andPerona’s model [87]; solid line - prediction of the proposed model basedon spectral histograms. . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.1 Gray-level image with two regions with similar means but differentvariances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.2 Examples of asymmetric windows. The solid cross is the central pixel.(a) Square windows. (b) Circular windows. . . . . . . . . . . . . . . . 138
xxii
5.3 Gray-level image segmentation using spectral histograms. The inte-gration scale W (s) for spectral histograms is a 15× 15 square window,λΓ = 0.2, and λB = 3. Two features are given at (32, 64) and (96, 64).(a) A synthetic image with size 128× 128. The image is generated byadding zero-mean Gaussian noise with different σ’s at left and rightregions. (b) Initial classification result. (c) Final segmentation result.The segmentation error is 0.00 % and all the pixels are segmentedcorrectly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.4 The histogram and derived probability model of χ2-statistic for thegiven region features. Solid lines stand for left region and dashed linesstand for right region. (a) The histogram of the χ2-statistic betweenthe given feature and the computed ones at a coarser grid. (b) Thederived probability model for the left and right regions. . . . . . . . . 142
5.5 A row from the image shown in Figure 5.3 and the result using derivedprobability model. In (b) and (c), solid lines stand for left regionand dashed lines stand for right region. (a) The 64th row from theimage. (b) The probability of the two given regional features usingasymmetric windows when estimating spectral histogram. The edgepoint is correctly located between columns 64 and 65. (c) Similarto (a) but using windows centered at the pixel to compute spectralhistogram. Labels between columns 58 and 65 cannot be decided. Thisis because that the computed spectral histograms within that intervaldo not belong to either region. . . . . . . . . . . . . . . . . . . . . . . 143
5.6 Classification result based on χ2-statistic for the row shown in Fig-ure 5.4(a). Solid lines stand for left region and dashed lines stand forright region. (a) χ2-statistic from the two given regional features us-ing asymmetric windows when estimating spectral histogram. If weuse the minimum distance classifier, the edge point will be located be-tween columns 65 and 66, where the true edge point should be betweencolumns 64 and 65. (b) Similar to (b) but using windows centered atthe pixel to compute spectral histogram. The edge point is localizedbetween 61 and 62. . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
xxiii
5.7 Gray-level image segmentation using spectral histograms. W (s) is a15× 15 square window, λΓ = 0.2, and λB = 5. Two features are givenat (32, 64) and (96, 45). (a) A synthetic image with size 128×128. Theimage is generated by adding zero-mean Gaussian noise with differentσ’s at the two different regions. Here the boundary is ’S’ shaped totest the segmentation algorithm in preserving boundaries. (b) Initialclassification result. (c) Final segmentation result. . . . . . . . . . . . 145
5.8 Texture image segmentation using spectral histograms. W (s) is a 29×29 square window, λΓ = 0.2, and λB = 2. Features are given at pixels(32, 32) and (96, 32). (a) A texture image consisting of two textureregions with size 128 × 64. (b) Initial classification result. (c) Finalsegmentation result. . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.9 Texture image segmentation using spectral histograms. W (s) is a 29×29 square window, λΓ = 0.2, and λB = 3. (a) A texture image consist-ing of two texture regions with size 128 × 64. (b) Initial classificationresult. (c) Final segmentation result. . . . . . . . . . . . . . . . . . . 146
5.10 Texture image segmentation using spectral histograms. W (s) is a 35×35 square window, λΓ = 0.4, and λB = 3. Four features are given at(32, 32), (32, 96), (96, 32), and (96, 96). (a) A texture image consistingof four texture regions with size 128 × 128. (b) Initial classificationresult. (c) Final segmentation result. . . . . . . . . . . . . . . . . . . 147
5.11 Texture image segmentation using spectral histograms. W (s) is a 35×35 square window, λΓ = 0.4, and λB = 3. Four features are given at(32, 32), (32, 96), (96, 32), and (96, 96). (a) A texture image consistingof four texture regions with size 128 × 128. (b) Initial classificationresult. (c) Final segmentation result. . . . . . . . . . . . . . . . . . . 147
5.12 Texture image segmentation using spectral histograms. W (s) is a 29×29 square window, λΓ = 0.2, and λB = 3. Four features are given at(32, 32), (32, 96), (96, 32), and (96, 96). (a) A texture image consistingof four texture regions with size 128 × 128. (b) Initial classificationresult. (c) Final segmentation result. . . . . . . . . . . . . . . . . . . 148
xxiv
5.13 Texture image segmentation using spectral histograms. W (s) is a 35×35 square window, λΓ = 0.4, and λB = 3. Four features are given at(32, 32), (32, 96), (96, 32), and (96, 96). (a) A texture image consistingof four texture regions with size 128 × 128. (b) Initial classificationresult. (c) Final segmentation result. . . . . . . . . . . . . . . . . . . 148
5.14 A challenging example for texture image segmentation. W (s) is a 35×35square window, λΓ = 0.4, and λB = 20. Two features are given at(160, 160) and (252, 250). (a) Input image consisting of two textureimages, where the boundary can not be localized clearly because oftheir similarity. The size of the image is 320× 320 in pixels. (b) Initialclassification result. (c) Final segmentation result. . . . . . . . . . . . 149
5.15 Another challenging example for texture segmentation. W (s) is a 35×35 square window, λΓ = 0.4, and λB = 20. Two features are givenat (160, 160) and (252, 250). (a) Input image consisting of two textureimages, where the boundary can not be localized clearly because oftheir similarity. The size of the image is 320× 320 in pixels. (b) Initialclassification result. (c) Final segmentation result. . . . . . . . . . . . 149
5.16 Segmentation for a texton image with oriented short lines. W (s) is a35×35 square window, λΓ = 0.4, and λB = 10. Two features are givenat (185, 67) and (180, 224). (a) The input image with size of 402×302 inpixels. (b) The initial classification result. (c) The segmentation resultusing spectral histograms. (d) The initial classification result using twoGabor filters Gcos(10, 30) and Gcos(10, 60). (e) The segmentationresult using two Gabor filters. The result is improved significantly. . . 150
5.17 Segmentation results at different integration scales. Parameters λΓ =0.4, and λB = 4 are fixed. (a) The input image. (b) The percentage ofmis-classified pixels. . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.18 Segmentation results using different segmentation scales for the imageshown in Figure 5.17(a). In each sub-figure, the left shows the initialclassification result and the right shows the segmentation result. Pa-rameters λΓ = 0.4, and λB = 4 are fixed. (a) W (s) is a 1 × 1 squarewindow. (b) W (s) is a 3× 3 square window. (c) W (s) is a 5× 5 squarewindow. (d) W (s) is a 7× 7 square window. . . . . . . . . . . . . . . 152
xxv
5.19 A texture image with a cheetah. The feature vector is calculated atpixel (247, 129) at scale 19 × 19, λΓ = 0.2, and λB = 2.5. To demon-strate the accuracy of the results, the classification and segmentationresults are embedded into the original image by lowering the intensityvalues of the background region by a factor of 2. (a) The input imagewith size 324× 486. (b) The initial classification result using 8 filters.(c) The final segmentation result using 8 filters. (d) The initial classifi-cation result using 6 filters consisting ofDxx, Dyy, LoG(
√2/2), LoG(1),
LoG(2) and LoG(3). (e) The final segmentation result correspondingto (d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.20 An indoor image with a sofa. The feature vector is calculated at pixel(146, 169) at scale 35×35, λΓ = 0.2, and λB = 3. (a) Input image withsize 512× 512. (b) Initial classification result. (c) Final segmentationresult. (d) Segmentation result if we assume there is another regionfeature given at (223, 38). . . . . . . . . . . . . . . . . . . . . . . . . 155
5.21 Texture image segmentation with representative pixels identified auto-matically. W (s) is a 29 × 29 square window, W (a) is a 35 × 35 squarewindow, λC = 0.1, λA = 0.2, λB = 2.0, λΓ = 0.2, and TA = 0.08. (a)Input texture image, which is shown in Figure 5.8. (b) Initial classifica-tion result. Here the representative pixels are detected automatically.(c) Final segmentation result. . . . . . . . . . . . . . . . . . . . . . . 158
5.22 Texture image segmentation with representative pixels identified auto-matically. W (s) is a 29 × 29 square window, W (a) is a 43 × 43 squarewindow, λC = 0.4, λA = 0.4, λB = 5.0, λΓ = 0.4, and TA = 0.30.(a) Input texture image, which is shown in Figure 5.10. (b) Initialclassification result. Here the representative pixels are detected auto-matically. (c) Final segmentation result. . . . . . . . . . . . . . . . . 158
5.23 Texture image segmentation with representative pixels identified auto-matically. W (s) is a 29 × 29 square window, W (a) is a 43 × 43 squarewindow, λC = 0.1, λA = 0.2, λB = 5.0, λΓ = 0.4, and TA = 0.20.(a) Input texture image, which is shown in Figure 5.11. (b) Initialclassification result. Here the representative pixels are detected auto-matically. (c) Final segmentation result. . . . . . . . . . . . . . . . . 159
xxvi
5.24 Texture image segmentation with representative pixels identified auto-matically. (a) Input texture image, which is shown in Figure 5.12. (b)Initial classification result. Here the representative pixels are detectedautomatically. (c) Final segmentation result. . . . . . . . . . . . . . . 159
5.25 Texture image segmentation with representative pixels identified auto-matically. W (s) is a 29 × 29 square window, W (a) is a 43 × 43 squarewindow, λC = 0.1, λA = 0.2, λB = 5.0, λΓ = 0.4, and TA = 0.20.(a) Input texture image, which is shown in Figure 5.13. Here the rep-resentative pixels are detected automatically. (c) Final segmentationresult. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
5.26 (a) A texture image with size 256× 256. (b) The segmentation resultusing spectral histograms. (c) Wrongly segmented pixels of (b), rep-resented in black with respect to the ground truth. The segmentationerror is 6.55%. (d) Refined segmentation result. (e) Wrongly segmen-tation pixels of (d), represented in black as in (c). The segmentationerror is 0.95%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.27 (a) A synthetic image with size 128×128, as shown in Figure 5.7(a). (b)The segmentation result using spectral histograms as shown in Figure5.7(c). (c) Refined segmentation result. . . . . . . . . . . . . . . . . . 163
5.28 (a) A texture image with size 256× 256. (b) The segmentation resultusing spectral histograms. (c) Refined segmentation result. . . . . . . 163
5.29 (a) A texture image with size 256× 256. (b) The segmentation resultusing spectral histograms. (c) Refined segmentation result. . . . . . . 164
5.30 Distance between scales for different regions. (a) Input image. (b) Thedistance between different integration scales for the left region at pixel(32, 64). (c) The distance between different integration scales for theright region at pixel (96, 64). . . . . . . . . . . . . . . . . . . . . . . . 165
5.31 A natural image with a zebra. λΓ = 0.2, and λB = 5.5. (a) Theinput image. (b) The segmentation result with one feature computedat (205, 279). (c) The segmentation result with one feature computedat (308, 298). (d) The combined result from (b) and (c). . . . . . . . 167
xxvii
6.1 On- and off-center cell responses. (a) Input image. (b) On-center cellresponses. (c) Off-center cell responses (d) Binarized on- and off-centercell responses. White regions represent on-center response regions andblack off-center regions. . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.2 The figure-ground segregation network architecture for Figure 6.1(a).Nodes 1, 2, 3 and 4 belong to the white region; Nodes 5, 6, 7, and 8belong to the black region; Nodes 9 and 10, 11 and 12 belong to theleft and right gray regions respectively. Solid lines represent excitatorycoupling while dashed lines represent inhibitory connections. . . . . . 174
6.3 Temporal behavior of each node in the network shown in Figure 6.2.Each plot shows the status of the node with respect to the time. Thedashed line is 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
6.4 Surface completion results for Figure 6.1(a). (a) White region. (b)Gray region. (c) Black region. . . . . . . . . . . . . . . . . . . . . . . 180
6.5 Layered representation of surface completion for results shown in Fig-ure 6.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
6.6 Images with virtual contours. (a) Kanizsa triangle. (b) Woven square.(c) Double kanizsa. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.7 Surface completion results for the corresponding image in Figure 6.6. 182
6.8 Images with virtual contours. (a) Kanizsa triangle. (b) Four crosses.(c) Overlapping rectangular bars. . . . . . . . . . . . . . . . . . . . . 183
6.9 Surface completion results for the corresponding image in Figure 6.8. 183
6.10 Images with virtual contours. (a) Original pacman image. (b) Mixedpacman image. (c) Alternate pacman image. . . . . . . . . . . . . . . 183
6.11 Layered representation of surface completion for the corresponding im-ages shown in Figure 6.10. . . . . . . . . . . . . . . . . . . . . . . . . 184
6.12 Bregman and real images. (a) and (b) Examples by Bregman [9]. (c)A grocery store image. . . . . . . . . . . . . . . . . . . . . . . . . . . 185
6.13 Surface completion results for images shown in Figure 6.12. . . . . . . 185
xxviii
6.14 Bistable perception. (a) Face-vase input image. (b) Faces as figures.(c) Vase as figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
6.15 Temporal behavior of the system for Figure 6.14(a). Dotted lines are0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
7.1 Classification result of a noisy synthetic image using a three-layer per-ceptron. (a) The input image with size of 230 × 240. (b) The groundtruth image. (c) Positive and negative training samples. Positive ex-amples are shown as white and negative ones as black. (d) Classifica-tion result from a three-layer perceptron. . . . . . . . . . . . . . . . . 190
7.2 Lateral connection evolution through weight adaptation illustrated us-ing the 170th row from the image shown in Figure 7.1(a). (a) Theoriginal signal. (b) Initial connection weights. (c) Connection weightsafter 40 iterations. (d) Corresponding smoothed signal. . . . . . . . . 194
7.3 Architecture and local features for the seed selection neural network. 202
7.4 Segmentation result using the proposed method for a synthetic image.(a) A synthetic image as shown in Figure 7.1(a). (b) The segmentationresult from the proposed method. Here Wz = 0.25 and θp = 100. . . . 204
7.5 A DOQQ image with size of 6204×7676 pixels of the Washington East,D.C.-Maryland area. . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
7.6 Seed pixels obtained by applying a trained three-layer perceptron tothe DOQQ image shown in Figure 7.5. Seed pixels are marked as whiteand superimposed on the original image. The network is trained using19 positive and 28 negative samples, where each sample is a 31 × 31window. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
7.7 Extracted hydrographic regions from the DOQQ image shown in Figure7.5. Hydrographic regions are marked as white and superimposed onthe original image to show the accuracy of the extracted result. HereWz = 0.15 and θp = 4000. . . . . . . . . . . . . . . . . . . . . . . . . 209
7.8 A ground truth generated by manually placing seeds based on the cor-responding 1:24,000 USGS topographic map and DOQQ image. Theresult was manually edited. . . . . . . . . . . . . . . . . . . . . . . . 210
xxix
7.9 Hydrographic region extraction result for an aquatic garden area withmanually placed seed pixels. Due that no reliable seed region is de-tected, this aquatic region, which is very similar to soil regions, is notextracted from the DOQQ image as shown in Figure 7.7. Extractedregions are marked as white and superimposed on the original image. 211
7.10 Extraction result for an image patch from Figure 7.5. (a) The inputimage. (b) The seed points from the neural network. (c) A topographicmap of the area. Here the map is scanned from the chapter versionand not wrapped with respect to the image. (d) Extracted result fromthe proposed method. Extracted regions are represented by white andsuperimposed on the original image. . . . . . . . . . . . . . . . . . . . 213
7.11 A DOQQ image with size of 5802 × 7560 pixels of Damascus,Pennsylvania-New York area. . . . . . . . . . . . . . . . . . . . . . . 215
7.12 Extracted hydrographic regions from the DOQQ image shown in Fig-ure 7.11. The extracted regions are represented by white pixels andsuperimposed on the original image. . . . . . . . . . . . . . . . . . . . 216
7.13 A ground truth generated based on a 1:24,000 USGS topographic mapand DOQQ image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
8.1 A stereo image pair and correspondence using the spectral histogram.(a) The left image. (b) The right image. (c)-(e) The matching resultsof marked pixels in the left image. In each row, the left shows themarked pixel, the middle shows the probability of being a match inthe paired image, and the right shows the high probability area in thepaired image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
8.2 Comparison between en edge detector and the spectral histogram usinga natural image of a giraffe. (a) The input image with size 300 ×240. (b) The edge map from a Canny edge detector [13]. (c) Theinitial classification result using the method presented in Chapter 5. Aspectral histogram is extracted at pixel (209, 291) and the segmentationscale is 29× 29. (d) The initial classification is embedded in the inputimage to show the boundaries. . . . . . . . . . . . . . . . . . . . . . . 226
xxx
CHAPTER 1
INTRODUCTION
1.1 Motivations
“Vision is the process of discovering from images what is present in the world and
where it is” (Marr [88], p. 3). Due to the apparent simplicity of the action of seeing,
however, the underlying difficulties of visual information processing had not been
realized until Marr’s pioneer work on computational vision. According to Marr, the
ultimate task of any computer vision system is essentially to “transform” an array of
input numerical values into meaningful description that a human normally perceives.
Figure 1.1(a) shows an image which consists of two texture regions. However, the
texture regions are not obvious at all from the numerical values, where a small portion
is shown in Figure 1.1(c).
While the true “transformation” employed by humans is not known, any algorithm
for solving vision problems attempts to approximate the transformation based on
different assumptions and constraints with respect to the problems to be solved.
Existing approaches can thus be categorized according to the problems to be solved
and their assumptions. Due to the complexities of the vision process, four problems
are widely studied in computer vision relatively independently: edge detection, stereo
1
(a) (b)
67 6d 68 65 6a 73 6c 6a 69 75 77 76 74 78 78 7f 7b 7c 7d 7e 87 7b 86 bd a6 85 8e b4 8d 95 8c 82 89 83 8a 84 89 81 94 91 8d
69 6b 63 6f 6c 69 6b 6c 70 6e 6c 6b 74 71 75 71 73 6d 93 8b 7e 7f 81 b7 8f 93 51 c9 a9 c0 80 81 7d 80 7e 8e 7e 8d 85 87 8d
6b 71 6c 6e 6b 6e 67 67 66 67 bb ee ac 81 b8 b6 9c 57 7a 8c 5d 72 84 d0 8b ab b2 8d d0 bd 8f 84 7a 86 7c 85 87 8e 86 92 94
6a 6c 62 6d 6a 6d 6e 73 6c 69 b8 9e a6 6b 4e a5 87 94 51 a7 79 7e 59 ad 80 7e 7a 70 b9 7b 91 85 7d 87 81 89 86 8e 88 91 91
63 5d 69 75 68 6a 64 67 b0 e5 ce 5c 8e 7d 64 6c 98 bd 88 9a 96 92 7e 3d 9b 8c 92 97 87 94 cc 7d 85 7c 87 8d 91 83 87 96 95
6b 65 6a 6e 67 6c 78 94 dd 9e e4 65 84 85 5e 6b 87 b9 a4 ae 9b 9f 98 83 8b 92 5f a6 7f a8 ab 7c 81 84 87 87 8b 8a 8f 8d 93
69 69 72 73 6c 77 66 61 a8 a7 a9 a2 5f a5 63 92 8c a8 a0 97 9c a2 96 56 9d 9d 8a 9f a7 ac b0 81 80 83 87 89 89 8e 93 8b a0
73 6c 70 6b 78 70 70 61 ef e7 d1 a5 82 8e c5 40 7f 9f 9f a2 a3 a6 a0 9d 9e 8f 74 84 9a ca c7 86 80 85 89 86 87 8e 98 90 96
6e 72 6c 6c 6d 75 68 71 dd e0 c6 c4 c2 b0 b6 8e 90 88 9f 9a 9f a3 99 a0 9d a1 70 8a 95 eb b9 7b 84 8a 82 8a 86 89 91 9b 95
68 6c 77 6e 6d 73 72 fb d0 da 9f 8f be c0 a8 aa a0 9f 93 95 a1 a4 9d 9e a8 81 78 7d 9e ce 8e 81 85 8c 83 88 8d 9a 8f 96 9a
71 6f 77 73 72 68 71 e3 e0 dd da c0 a4 d7 a1 a1 85 95 8f 98 a0 af a5 9e a7 8b 4d 9c 86 a7 a6 81 87 85 87 8e 8e 95 93 95 9d
6d 74 72 73 6c 78 67 e6 cf d8 c8 da ad a1 ca b1 fb a4 5f 92 a2 b4 b5 a3 9b 7f 82 b2 4f dd dd ff 88 8d 84 82 90 8e 92 91 97
65 72 70 6f 69 71 6d f5 af e3 d0 c0 c2 b7 a7 b2 a2 a6 7f 79 c8 a5 a2 a6 a3 9a 88 71 59 b6 d9 99 8a 8a 87 89 90 8e 9a 97 9a
71 6e 6b 6f 71 67 78 fc f5 e0 b1 a3 cc a2 d0 a0 ae a0 93 8f ab be a5 a9 a1 a2 87 7e 8a b1 8f 87 86 8a 8d 8c 93 97 94 92 99
72 6d 6c 79 6f 6d 73 df d4 e2 9b 84 c0 b0 9d a0 ce 9b ac bb 9b ac aa 9a a9 90 86 74 94 c6 83 85 88 8c 88 8c 91 90 91 95 99
6f 78 6d 6f 70 73 6c cf f3 ce bb a2 c0 94 c9 bf 94 9a 9d a0 ad ad b3 a6 ad a3 86 7f b7 92 7a 89 91 8c 8f 8c 91 92 8e 99 9a
71 63 70 67 72 71 78 ec bd cb c5 d8 94 a7 a0 b6 90 74 c8 a7 ab a8 aa bb af bc 8d 94 6e dc 8c 8d 88 8f 8c 91 92 93 90 95 9a
6d 6d 6c 74 68 69 6c ff d9 df fa d2 b8 ad 8b b1 b7 cd c8 ad a8 b6 af b4 bd 88 84 7d 8a b4 8a 8b 8f 8c 96 91 91 9d 94 9b 99
6e 6c 70 74 74 6c 6e ff f4 f0 d8 c5 db 8f a1 b1 c3 f6 d5 95 a8 b5 aa b1 e9 e4 78 7d 8c 87 91 87 93 8f 93 8e 90 96 91 96 9c
6e 73 6a 71 6f 72 5e dd f4 ef d5 ba bb aa ab aa ae ae ae 96 ab c1 f9 90 c2 f3 ed c4 81 7f 89 86 8d 86 8a 89 93 96 93 97 9c
73 76 73 6e 6d 6c ff da e6 dc cc b4 ae b5 ae 9e d5 ae ba af 93 a4 b5 ea b1 d3 e7 ff e9 81 87 89 89 8b 8b 8f 8b 90 92 94 a0
72 70 6a 73 74 6f f7 c1 97 d8 c4 bb ab ac a7 af a3 aa b1 c5 e8 c3 af 94 9e b2 d6 cd ee de 85 81 89 84 87 90 94 9f 94 94 97
6b 68 71 74 70 6d 63 5d e5 d8 cb bb b8 ae aa a9 a3 9b a9 ca e3 ff b5 b6 8f 9b af aa b6 eb c1 80 7e 87 8d 8b 91 95 94 95 97
74 70 6d 72 73 70 60 63 cc d7 c0 b3 bf b6 b3 ad a6 92 a3 c8 de ff a7 bd b5 9b 9c 98 9b d0 a6 e5 83 84 88 8a 8a 91 97 99 9e
6c 6e 6e 76 6c 6f 67 68 ff e9 c5 b9 b4 b3 c8 c8 b0 98 a9 b9 d2 ff 82 80 84 87 92 94 98 af aa d1 d9 7b 80 85 8e 87 8e 96 98
69 70 67 70 72 6e 6d 63 5f e1 c9 ba ba b3 f6 cd c2 b4 a4 b9 d3 e8 ff 7e 7e 7b 8a 8d 96 7f 9b b3 d3 d7 6e 86 82 8b 8d 8e 96
6c 64 69 6a 71 6b 76 6c 61 db bd b6 a6 ee d9 d4 c4 b5 ad ac c7 d9 ea 81 79 82 86 8a b5 8b a5 a8 ad c0 dc 76 80 80 8d 91 93
68 64 67 5d 71 65 63 65 5c c3 a9 a9 99 de e5 d5 bd ae a5 ab bb c8 d7 e3 80 7a 7a d2 b9 94 89 92 9f a0 ad d9 7b 7b 7f 87 92
63 6d 6a 5b 58 5e 5c 59 5a 63 8f 9c 76 db de c0 bd b9 b0 a9 b1 c2 bd c9 d4 7a 81 76 6f 83 87 84 8e 95 95 c2 ee 7a 7f 88 87
67 64 59 66 60 66 62 5f 5f 5b 69 63 5f e9 cf b5 b0 ae ae b6 b0 bd c6 bb c9 cb 86 79 85 8a 87 8e 83 89 8e ad d5 e3 7d 79 85
63 68 69 62 67 63 59 67 67 68 5a 5e d6 bd ba b3 aa af ad b4 b7 bd c3 b9 bf c5 d3 cc 75 8f 9a 95 97 91 8c 9d a9 dc f3 83 89
5a 63 59 69 64 63 5d 5c 68 5d 8d d0 b9 ab a9 a7 ac af b5 c2 c3 bd b2 af b5 c2 c0 d2 d4 b7 b9 97 9a 94 91 89 8f ae ff a8 71
61 67 5f 65 5c 5f 59 4f 6d c5 bd ae ab a0 a9 a3 b1 b7 b9 c2 c8 c0 a9 9c a5 ba c6 d4 dc da c1 c6 c8 8a 83 84 91 b3 c7 ff 7d
68 61 5c 5c 5a cb be bb b8 ba aa a8 a2 92 9c 9a a1 bc ba c5 d5 b9 a0 93 a4 b6 c5 de ee ee de c1 cc d9 b2 b7 ac b2 c3 da bd
5f 5c 54 d0 bd b5 b1 ad a7 9f a3 9a 94 9b 8c 98 a5 b6 c7 d2 e6 c7 9c 8b 94 b4 c2 e2 fe f0 e5 dc e7 de df b4 c1 cc d4 f6 f4
55 74 d0 c1 b8 b2 a6 a4 9e 9b 9d 97 8e 8e 91 95 ac b5 cc e4 ef c8 9e 8c 91 b1 c5 ef ff fd f1 e7 ee f1 dd d4 cf cc dc e5 e2
82 ca c4 ba ab a6 9e 9f 9b 99 9a 8e 88 90 99 98 a1 b4 cc ea ff c8 97 86 97 ae c0 f5 ff f4 f6 e9 e6 e2 df d3 d9 dd e3 ea d5
d0 c3 b3 a6 b1 a2 a4 9b 94 8f 92 88 80 8f 95 95 9e b5 cf ee fb ce 98 8c 90 a7 c4 e9 ff f4 eb e4 d9 d0 d8 d1 d7 e2 d6 e6 d9
ce bd b7 ac a7 a0 a3 9a 92 8e 83 90 8a 91 94 9b 9d ba d7 f2 ff d4 9a 87 96 a7 c6 e3 f9 f9 f9 e4 d9 d7 de de ec ee ea f3 e4
(c)
Figure 1.1: A texture image and the corresponding numerical arrays. (a) A textureimage with size 128×64. (b) A small portion with size 40×30 of (a) centered at pixel(64, 37), which is on the boundary between the two texture regions. (c) Numericalvalues of (b). To save space, the values are displayed in hexadecimal format.
2
matching and motion analysis, image segmentation and perceptual organization, and
pattern recognition. Roughly speaking, the techniques for the first three problems
are primarily data-driven, or called bottom-up processes, and pattern recognition is
model-driven, or called top-down process.
Early techniques with successful applications are classification techniques [25],
which map a given input into one of the pre-defined classes according to a distance
measure. However, all the possible classes and their variations we normally perceive
are too gigantic to be implemented effectively in any system. The attention of com-
puter vision was then shifted to derive more generic features for arbitrary images.
From information and encoding theories, edges, i.e., discontinuities in images, carry
more information and exhibit nice properties such as invariance to luminance changes.
Motivated by neurophysiological findings [53], many edge detection algorithms were
proposed and studied. Segmentation techniques try to solve the same problem by
segmenting an image into homogeneous regions, where edges and region contours can
be obtained straightforwardly and more robustly. These approaches were claimed to
be unified [92] through what is called Mumford-Shah Segmentation energy functional
[94] (See Chapters 4 and 5). Common to these approaches, the images are assumed
to be piece-wise smooth regions with additive Gaussian noise, resulting in efficient
algorithms. To improve the performance for real images, multiple scales are generally
needed and linear and nonlinear scale spaces are thus proposed and studied. Chapter
2 studies segmentation for range images. Chapter 3 studies a new nonlinear smooth-
ing algorithm and addresses some of problems in nonlinear scale spaces. Chapter
7 applies a nonlinear smoothing algorithm to hydrographic object extraction from
remotely sensed images.
3
While there are useful applications of edge detection and segmentation algorithms,
the underlying assumption limits their successes in dealing with natural images. As
shown in Figure 1.1, texture regions neither are piece-wise smooth nor can be modeled
with additive Gaussian noise. Figure 1.2 demonstrated that a pure linear system is
not sufficient for natural image modeling [63], where the spatial relationships among
pixels are more prominent in characterizing the texture regions than individual pixels.
Clearly piece-wise smooth regions with additive Gaussian noise are not sufficient and
more sophisticated models are needed to deal with texture images.
Supported by neurophysiological and psychophysical experiments [12] [23], The
early processes in the human vision system can be abstractly modeled by filtering
with a set of frequency and orientation tuned filters. However, as demonstrated in
Figure 1.2, purely linear filtering is not sufficient, nonlinearity beyond filtering must
be incorporated [87]. Spectral histograms integrate the responses of a chosen bank
of filters through marginal distributions [148] [149] [150] [147]. As demonstrated
in Figure 1.2, spectral histograms are nonlinear. Chapters 4 and 5 apply spectral
histograms to modeling [147], classification, and segmentation of texture as well as
intensity images.
While edge detection and segmentation techniques are very fruitful, there are
perceptual phenomena that cannot be explained by purely data-driven processes.
Classical examples include virtual contours, which are widely studied by Gestaltists.
The long-range order grouping is known as perceptual organization. Chapter 6 studies
perceptual organization through temporal dynamics.
Because many of the meaningful objects cannot be characterized well using in-
tensity values or even textures, such as a human face, the relationships among some
4
(a) (b)
(c)
Figure 1.2: Demonstration of nonlinearity for texture images. (a) A regular textureimage. (b) The image in (a) was circularly shifted left and downward for 2 pixels ateach direction. (c) The pixel-by-pixel average of (a) and (b). The relative variancedefined in (3.20) between (a) and (b) is 137, and between (a) and (c) is 69. Thedistance between the spectral histograms defined in Chapter 4 between (a) and (b)is 1.288 and between (a) and (c) is 38.5762.
5
primitives need to be modeled. This leads to the need of top-down processes such as
recognition. Clearly, the four problems studied are sub-problems of vision process and
the integration among them is critical for a complete vision system. The interaction
between different modules is briefly discussed in Chapter 8.
1.2 Thesis Overview
As discussed above, we study vision problems at different organizational levels
in this dissertation. In Chapter 2, we study the segmentation problem for range
images. Depth is most important cue for visual perception and range image segmen-
tation has a wide range of applications. We propose a feature vector consisting of
surface normal, mean and Gaussian curvatures and a similarity measure for range
images. We implemented a system based on oscillatory correlation using a LEGION
(locally excitatory globally inhibitory oscillator network) network. Experimental re-
sults demonstrate that our system is capable of handling different kinds of surfaces.
With the unique properties of a temporal dynamic system, our approach may lead to
a real-time approach for range image segmentation.
In Chapter 3, we propose a new nonlinear smoothing algorithm by incorporating
contextual information and geometrical constraints. Several nonlinear algorithms are
derived as special cases of the proposed one. We have compared the temporal behavior
and boundary detection results of several widely algorithms, including the proposed
method. The proposed algorithm gives quantitatively good results and exhibits nice
temporal behaviors such as quick convergence and robustness to noise.
In Chapter 4, we propose spectral histograms as a generic statistic feature for tex-
ture as well as intensity images. We demonstrate the properties of spectral histograms
6
using image synthesis, image classification, and content-based image retrieval. We
also compare with several widely used statistic features for textures and show that
the distribution of local features is critically important for classification while mean
and variance in general are not sufficient. We also propose a model for texture dis-
crimination, which matches the existing psychophysical data well.
Chapter 5 continues the work in Chapter 4. In Chapter 5, segmentation prob-
lem is studied extensively using spectral histograms. A new energy functional for
segmentation is proposed by making explicit the homogeneity measures. An approx-
imate algorithm is derived, implemented and studied under different assumptions.
Satisfactory results have been obtained using natural texture images.
Chapter 6 studies the problem of perceptual organization and long-range grouping,
which is one level beyond the segmentation. By using a boundary-pair representa-
tion, we propose a figure-ground segregation network. Gestalt-like grouping rules are
incorporated by modulating the connection weights in the network. The network can
explain many perceptual phenomena such as modal and amodal completion, shape
composition and perceptual grouping using a fixed set of parameters.
Chapter 7 presents a computational framework for feature extraction from remote
sensing images for map revision and geographic information extraction purposes. A
multi-layer perceptron is used to learn the features to be extracted from examples. A
locally coupled LEGION network is used to achieve accurate boundary localization.
To increase the robustness of the system, a weight adaption method is used. Ex-
perimental results using DOQQ images show that our system can handle very large
images efficiently and may have a wide range of applications.
7
Chapter 8 summarizes the contributions of the work presented in this dissertation
and concludes this dissertation with discussions on the future work.
8
CHAPTER 2
RANGE IMAGE SEGMENTATION USING A
RELAXATION OSCILLATOR NETWORK
In this chapter, a locally excitatory globally inhibitory oscillator network (LE-
GION) is constructed and applied to range image segmentation, where each oscillator
has excitatory lateral connections to the oscillators in its local neighborhood as well
as a connection with a global inhibitor. A feature vector, consisting of depth, surface
normal, and mean and Gaussian curvatures, is associated with each oscillator and is
estimated from local windows at its corresponding pixel location. A context-sensitive
method is applied in order to obtain more reliable and accurate estimations. The lat-
eral connection between two oscillators is established based on a similarity measure
of their feature vectors. The emergent behavior of the LEGION network gives rise to
segmentation. Due to the flexible representation through phases, our method needs
no assumption about the underlying structures in image data and no prior knowledge
regarding the number of regions. More importantly, the network is guaranteed to con-
verge rapidly under general conditions. These unique properties lead to a real-time
approach for range image segmentation in machine perception. The results presented
in this chapter appeared in [83].
9
2.1 Introduction
Image segmentation has long been considered in machine vision as one of the fun-
damental tasks. Range image segmentation is especially important because depth
is one of the most widely used cues in visual perception. Due to its practical
importance, many techniques have been proposed for range image segmentation,
and they can be roughly classified into four categories: 1) edge-based algorithms
[91][6][136]; 2) region-based algorithms [129][4][71][58][51]; 3) classification-based ap-
proaches [56][49][67][51][5];
4) global optimization of a function [73].
Edge-based algorithms first identify the edge points that signify surface discon-
tinuity using certain edge detectors, and then try to link the extracted edge points
together to form surface boundaries. For example, Wani and Batchelor [136] intro-
duced specialized edge masks for different types of discontinuity. Because critical
points, such as junctions and corners, could be degraded greatly by edge detectors,
they are extracted in an additional stage. Then surface boundaries are formed by
growing from the critical points. As we can see, many application-specific heuris-
tics must be incorporated in order to design good edge detectors and overcome the
ambiguities inherent in linking.
Region-based algorithms were essentially similar to region-growing and split-and-
merge techniques for intensity images [152], but with more complicated criteria to
incorporate surface normal and curvatures which are critical for range image segmen-
tation. A commonly used method is iterative surface fitting [4][71][51]. Pixels are first
coarsely classified based on the sign of mean and Gaussian surface curvature and seed
regions are formed based on initial classification. Neighboring pixels will be merged
10
into an existing surface region if they fit into the surface model well. This procedure
is done iteratively. As pointed by Hoffman and Jain [49], the major disadvantage is
that many parameters need to be involved. Also a good surface model that fits the
range image data must be provided in order to obtain good results.
In classification-based approaches, range image segmentation is posed as a vector
quantization problem. Each pixel is associated with an appropriate feature vector.
The center vector for each class can be obtained by applying some clustering algo-
rithms which minimize a certain error criterion [49] or alternatively by training [67][5].
Then each pixel is associated with the closest cluster center. The segmentation re-
sult from the classification can be further refined by a merging procedure similar
to region-growing [49][67][51][5]. One of the limitations of classification-based ap-
proaches is that the number of regions must be given a priori, which, generally, is not
available. Koh et al. [67] and Bhandarkar et al. [5] tried to address this issue by using
a hierarchical self-organizing network for range image segmentation. At each level, a
self-organizing feature map (SOFM) [68] is used to segment range images into a given
number of regions. An abstract tree [67] is constructed to represent the output of the
hierarchical SOFM network. The final segmentation is obtained by searching through
the abstract tree, which is sequential and similar to a split-and-merge method. Thus
the solution suffers from the disadvantages of region-based algorithms. In addition,
the problem of prior specification of number of regions is not entirely solved because
the number of regions for each level still needs to be specified.
A more fundamental limitation common to all region- and classification-based ap-
proaches is that the representation is too rigid, i.e., different regions are represented
through explicit labels, which forces the approaches to be sequential to a large extent.
11
Energy minimization techniques [35] can be inherently parallel and distributed and
have been widely used in image classification and segmentation. In this framework,
solutions are found by minimizing energy functions using relaxation algorithms1. Li
[73] constructed a set of energy functions for range image segmentation and recogni-
tion by incorporating surface discontinuity through mean and Gaussian curvatures.
Minimization algorithms were obtained based on regularization techniques [124] and
relaxation labeling algorithms [112][55]. While the approach was quite successful, the
main problem is that the algorithms are too computationally expensive for real-time
applications [88][36].
In this chapter, we use a novel neural network for segmenting range images, which
overcomes some of the above limitations. Locally excitatory globally inhibitory oscil-
lator network (LEGION) [123][134][135] provides a biologically plausible framework
for image segmentation in general. Each oscillator in the LEGION network connects
excitatorily with the oscillators in its neighborhood as well as inhibitorily with a
global inhibitor. For range image segmentation, the feature detector associated with
each oscillator estimates the surface normal and curvatures at its corresponding pixel
location. The lateral connection between two oscillators is set at the beginning based
on a similarity measure between their feature vectors. The segmentation process is
the emergent behavior of the oscillator network. Because the results are encoded flex-
ibly through oscillator phases, segmentation is inherently parallel and no assumption
about the underlying structures in image data is needed.
1“Relaxation” used in relaxation labeling technique [112] refers to an optimization technique thatglobal optimal solutions can be obtained by satisfying local constraints. This is very different fromthe term “relaxation” as used in relaxation oscillators, where it refers to the change of activity on aslow time scale (see Section 2.2).
12
The rest of this chapter is organized as follows. Section 2.2 gives an overview of the
LEGION network. Section 2.3 presents feature vector estimation for range images.
Section 2.4 provides experimental results and comparisons with other approaches.
Section 2.5 justifies the biological plausibility of this approach and gives a comparison
with pulse-coupled neural networks (PCNN). This chapter appeared in [83].
2.2 Overview of the LEGION Dynamics
A fundamental problem in image segmentation is to group similar input elements
together while segregating different groups. Temporal correlation was hypothesized as
a representational framework for brain functions [90][130], which has received direct
supports from neurophysiological findings of stimulus-dependent oscillatory activities
in the visual cortex [26][40]. The LEGION network [123][134], which is based on
oscillatory correlation, was proposed as a computational model to address problems
of image segmentation and figure-ground segregation [135]. In this section, after
briefly introducing the single oscillator model, we summarize the properties of the
network and demonstrate the network’s capability using a synthetic image.
2.2.1 Single Oscillator Model
As the building block of LEGION, a single oscillator i is defined as a feedback loop
between an excitatory unit ui and an inhibitory unit vi [135] with one modification:
dui
dt= 3ui − u3
i + 2− vi + IiH(pi − θ) + Si + ρ (2.1a)
dvi
dt= ε(γ(1 + tanh(ui/β))− vi) (2.1b)
13
Here H stands for the Heaviside step function. Ii represents external stimula-
tion for the oscillator i, which is assumed to be applied at time 0. ρ denotes a
Gaussian noise term to test the robustness of the system and play an active role in
desynchronization. Si denotes the coupling from other oscillators in the network, and
pi the potential of the oscillator i. θ is a threshold, where 0 < θ < 1. The only
difference between the definition in [135] and the one given here is that the term
H(pi + exp(−at)− θ) in [135] is replaced by H(pi− θ) in (2.1a) so that the Heaviside
term in (2.1) does not depend explicitly on time.
The parameter ε is a small positive number. In this case, equation (2.1), without
any coupling or noise but with constant stimulation, corresponds to a standard relax-
ation oscillator, similar to the van der Pol oscillator [128]. The u-nullcline (du/dt = 0)
of (2.1) is a cubic curve, and the v-nullcline (dv/dt = 0) is a sigmoid function, the
steepness of which is specified by β. If I > 0 and H = 1, which corresponds to a
positive stimulation, (2.1) produces a stable periodic orbit for all sufficiently small
values of ε, alternating between left and right branches of near steady-state behavior.
Figure 2.1 shows a stable limit cycle along with nullclines. The oscillator travels on
left and right branches on a slow time scale because the driving force is mainly from
the inhibitory unit and is weak, while the transition between two branches occurs
on a fast time scale because the driving force is mainly from the excitatory unit and
is strong. The slow and fast time scales result from a small ε and produce highly
nonlinear activities.
The parameter γ is used to control the ratio of the times that an oscillator spends
on the right and left branches. Figure 2.2 shows the temporal activities of the excita-
tory unit for γ = 3 and γ = 40. If γ is chosen to be large, the output will be closer to
14
u
v
LB RB
Figure 2.1: A stable limit cycle for a single relaxation oscillator. The thick solid linerepresents the limit cycle and thin solid lines stand for nullclines. Arrows are used toindicate the different traveling speed, resulting from fast and slow time scales. Thefollowing parameter values are used: ε = 0.02, β = 0.1, γ = 3.0, and a constantstimulus I = 1.0.
neural spikes. Based on this observation, the relaxation oscillator can also be viewed
as a neuronal spike generator.
2.2.2 Emergent Behavior of LEGION Networks
The LEGION network consists of identical relaxation oscillators given by equation
(2.1) and a global inhibitor. Each oscillator is excitatorily coupled with oscillators in
its local neighborhood. The coupling term, Si, is defined as:
Si =∑
k∈Nc(i)
WikH(uk − θu)−WzH(Z − θz) (2.2)
15
(a)
(b)
Figure 2.2: The temporal activities of the excitatory unit of a single oscillator fordifferent γ values. Other parameters are same as for Figure 2.1. (a) γ = 3.0. (b)γ = 40.0.
16
Here θu and θz are thresholds. Wik is the connection weight from oscillator k to
i, and Nc(i), the coupling neighborhood of i, is the set of neighboring oscillators of
i. An oscillator only sends excitation to its neighbors only when its activity is higher
than the threshold θu. In that case, we refer that it is in active phase. Otherwise, it
is in silent phase. This results in Heaviside coupling.
With excitatory coupling, the distance between two oscillators must decrease at
an exponential rate when traveling on the same branch. During jumps between
left and right branches, the time difference, the time that is needed to travel the
distance between them, will be compressed even though the Euclidean distance does
not change. These two factors lead to the fact that the two oscillators must get
synchronized at an exponential rate [123][134][135].
The global inhibitor Z leads to desynchronization among different oscillator groups
[123][134]. In (2.2), Wz is the weight of inhibition from the global inhibitor Z, which
is defined by:
dZ
dt= φ(σ∞ − Z) (2.3)
Where σ∞ = 1 if ui ≤ θz for at least one oscillator i, and σ∞ = 0 otherwise. Under
this condition, only an oscillator in the active phase can trigger the global inhibitor.
When one oscillator group is active, it suppresses the other groups from jumping
up by activating the global inhibition. Architecturally, the global inhibitor imposes
constraints on the entire network and effectively reduces the collisions from local
coupling. This is a main reason why LEGION networks are far more efficient than
purely locally coupled cooperative processes, including relaxation labeling techniques
[112][55][35], which tend to be very slow [88][36].
17
The network we use for image segmentation is two dimensional. Figure 2.3 shows
the architecture employed throughout this chapter, where each oscillator is connected
with its eight immediate neighbors, except those on the boundaries where no wrap-
around is used. To allow the network to distinguish between major regions and noisy
fragments, the potential of the oscillator i, pi, is introduced. The basic idea is that a
major region must contain at least one leader. A leader is an oscillator that receives
large enough lateral excitation from its neighborhood, i.e. larger than θp (called the
potential threshold). Such a leader can develop a high potential and lead the activation
of the oscillator block corresponding to an object. A noisy fragment does not contain
such as an oscillator and thus is not able to be active after a beginning period. See
[135] for detailed discussion.
We illustrate the capability of the network for image segmentation using a binary
synthetic image. The original image is shown in Figure 2.4(a). It has 40× 40 pixels
and consists of seven geometric objects. Figure 2.4(b) shows the corrupted version
by adding 10% noise: each pixel has 10% chance of flip-flop. The corrupted image
is represented to a 40 × 40 LEGION network. Each oscillator in the network starts
with a random phase, shown in Figure 2.4(c). Figures 2.4(d) - (j) show the snapshots
of the network activity when the oscillators corresponding to one object are in the
active phase. All the oscillator groups corresponding to the objects are popped out
alternately after five cycles. Figure 2.5 shows the temporal activities of all the stim-
ulated oscillators, grouped together by objects, and a background, which consists of
those stimulated oscillators that do not correspond to any object. From Figure 2.5,
we can see that desynchronization among seven groups as well as synchronization
within each object is achieved quickly, here after five cycles. Furthermore, only the
18
Figure 2.3: Architecture of a two-dimensional LEGION network with eight-nearestneighbor coupling. An oscillator is indicated by an empty ellipse, and the globalinhibitor is indicated by a filled circle.
19
oscillators belonging to an object can stay oscillating and all other noisy fragments
are suppressed into the background after two cycles.
Formally, it has been proved [123][135] that the LEGION network can achieve
synchronization among the oscillators corresponding to the same object and desyn-
chronization between groups corresponding to different objects quickly under general
conditions. In particular, synchronization within an object and desynchronization
among different objects can be achieved in NM + 1 cycles, where NM is the number
of the major regions corresponding to distinct objects.
2.3 Similarity Measure for Range Images
Given the LEGION dynamics, the main task is to establish lateral connections
based on some similarity measure. Intuitively, neighboring pixels belonging to the
same surface should be similar and thus should have strong connections with each
other in the LEGION network. To be effective for range image segmentation, we
consider different types of surfaces. For planar surfaces, surface normal is generally
homogeneous while depth value may vary largely. For cylindrical, conical, and spher-
ical surfaces, surface curvature does not change much while both depth value and
surface normal may. Based on these observations, the similarity measure should de-
pend on surface normal and curvatures in addition to raw depth data. Similar to [67],
we chose (z, nx, ny, nz, C, G) as the feature vector for each oscillator. Here z is the
depth value, (nx, ny, nz) is the surface normal, and C and G are the mean and Gaus-
sian curvature, at the corresponding pixel location. Depth value is directly available
from image data. Surface normal at each pixel is estimated from the depth values of
a neighboring window, and curvatures are further derived from surface normal.
20
(a) (b) (c) (d)
(e) (f) (g) (h)
(i) (j)
Figure 2.4: Illustration of LEGION dynamics. (a) An input image consisting of sevengeometric objects, with 40 × 40 pixels. (b) The corrupted image of (a) by adding10which is presented to a 40 × 40 LEGION network. (c) A snapshot of the networkactivity at the beginning. (d)-(j) Subsequent snapshots of the network activity. In(c)-(j), the grayness of a pixel is proportional to the corresponding oscillator’s activityand black pixels represent oscillators in the active phase. The parameter values forthis simulation are following: ε = 0.02, β = 0.1, γ = 20.0, θx = −0.5, θp = 7.0,θz = 0.1, θ = 0.8, and Wz = 2.0.
21
Circle
Parallelogram
Triangle
Rectangle
Ellipse
Trapezoid
Staircase
Background
Inhibitor
Time
Figure 2.5: Temporal evolution of the LEGION network. The upper seven plotsshow the combined temporal activities of the seven oscillator blocks representing thecorresponding geometric objects. The eighth plot shows the temporal activities of allthe stimulated oscillators which correspond to the background. The bottom one showsthe temporal activity of the global inhibitor. The simulation took 20,000 integrationsteps using a fourth-order Runge-Kutta method to solve differential equations.
22
Formally, each oscillator is associated with a feature detector to estimate the
normal and curvature values at the corresponding pixel location. Based on the work
in [4], (nx, ny, nz) is calculated from the first-order partial derivatives:
(nx, ny, nz) =∂z∂x× ∂z
∂y
‖ ∂z∂x× ∂z
∂y‖ (2.4)
and the partial derivatives are estimated using the following formula:
∂z
∂x≈ Dx = d0 · dT
1 (2.5a)
∂z
∂y≈ Dy = d1 · dT
0 (2.5b)
Here, T indicates transpose, d0 is the equally weighted average operator, and d1 is
the least-square derivative estimator. For a 5 x 5 window, they are given by:
d0 =1
5[1, 1, 1, 1, 1]T (2.6a)
d1 =1
10[−2,−1, 0, 1, 2]T (2.6b)
This normal estimation method works well if the estimation window is within one
surface. When the window crosses different surfaces, especially ones with very differ-
ent depth values, the estimation results tend to be inaccurate. In order to improve the
results near surface boundaries, we require that the pixels in the estimation window
must be within the same context based on the edge preserving noise smoothing quad-
rant filter [52] and the context-sensitive normal estimation method [144]. However,
both methods require edge information, which is not available for range images. To
be more applicable, here we define that two pixels are within the same context if their
difference in depth value is smaller than a given threshold. This definition captures
the significant edge information in range images and works well. When estimating the
23
first-order derivatives, d0 and d1 are applied only to pixels that are within the same
context with respect to the central pixel. These operators are called context-sensitive
operators.
There are two ways to estimate the surface curvature. First it can be estimated
by breaking the surface up into curves and measuring the curvature of these curves.
Second it can be estimated by measuring surface normal changes. Following [56], the
surface curvature at point p in the direction of point q is estimated by:
κp,q =
‖np−nq‖‖p−q‖
if ‖p− q‖ ≤ ‖(np + p)− (nq + q)‖
−‖np−nq‖‖p−q‖
otherwise(2.7)
where p and q refer to the 3-D coordinate vectors of the corresponding pixels, and np
and nq are the unit normal vectors at points p and q respectively, which are available
from (2.4). Here the 3-D coordinate vector of a pixel is composed of its 2-D location in
the image and its depth value. The condition in (2.7) is to assign a positive curvature
value for pixels on a convex surface and a negative one for pixels on a concave surface.
Now the mean curvature Ci of oscillator i, is estimated as the average of all possible
values with respect to a neighborhood Nk(i) of oscillator i. The Gaussian curvature
Gi is estimated as the production of the minimum and maximum values with respect
to the neighborhood.
In summary, the similarity measure between two oscillators is given by: Wik =
255/(Ψ(i, k) + 1), and the lateral connections are set up accordingly. Here Ψ(i, k) is
a disparity measure between two oscillators, and is given by:
Ψ(i, k) =
Kz|zi − zk|+Kn‖ni − nk‖ if |Ci| ≤ Tc
Kc|Ci − Ck|+KG|Gi −Gk| otherwise(2.8)
Here Kz, Kn, Kc, and KG are weights for different disparity measures. Tc is a thresh-
old for planar surface testing.
24
2.4 Experimental Results
For real images with a large number of pixels, it involves very intensive numerical
computation to solve the differential equations of (2.1) - (2.3) if the LEGION network
is applied directly. To reduce the numerical computation on a serial computer, an
algorithm is extracted from these equations [135]. The algorithm follows the ma-
jor steps of the original network and exhibits the essential properties of relaxation
oscillator networks. It has been applied to intensity image segmentation [135].
2.4.1 Parameter Selection
Most parameters in (2.1) - (2.3) are intrinsic to LEGION networks and need not
to be changed once they are appropriately chosen, and only the global inhibition Wz
and the potential threshold θp need to be tuned for applications. Other parameters
are application-specific and related to how to measure the similarity between feature
vectors. Theoretically, best parameter values could be obtained through training
using a neural network. Experimentally, each parameter roughly corresponds to a
threshold for a type of discontinuity, and can be set accordingly. In (2.8), Kz captures
abrupt changes in depth, andKn andKc andKG capture boundaries between different
type surfaces.
There are several local windows involved. The system is not sensitive to the size
of windows and can be chosen from a wide range. For the experiments in this work,
we use a single system configuration. That is, the surface normal is estimated from
a 5 × 5 neighboring window, the curvatures are estimated from a 3 × 3 neighboring
window, and each oscillator has lateral connections with its eight immediate neighbors
as depicted in Figure 2.3.
25
2.4.2 Results
From an algorithmic point of view, a system for image segmentation must incor-
porate discontinuity in input data which gives rise to region boundaries. For range
images, there are mainly three types of discontinuity. Jump edges occur when depth
values are discontinuous due to occlusion while crease and curvature edges occur
when different surfaces intersect with each other. Crease edges correspond to sur-
face normal discontinuity while curvature edges are due to curvature discontinuity
where surface normals are continuous. Lateral connection weights implicitly encode
the discontinuity through equation (2.8), which will be demonstrated using real range
images.
Range images shown in Figures 2.6-2.10 were obtained from the Pattern Recog-
nition and Image Processing Laboratory at Michigan State University. These images
were generated using a Technical Arts 100X white scanner, which is a triangulation-
based laser range sensor. Except for the global inhibition Wz and the potential
threshold θp, which define a scale space (see below), all other parameters are in gen-
eral fixed, as is the case for all the range images shown in this chapter. For the range
images shown here, a fixed level of global inhibition and potential threshold works well
except for images with thin regions like those shown in Figure 2.9. Because the image
size is rather small compared to a 5 × 5 normal estimation window, a popped-out
region is extended by one pixel within the same context. This simple post-processing
is done only for reducing the boundary effect in the final results.
Figure 2.6 shows a complete example from our network. Figure 2.6(a) shows a
range image of a column. From its 3-D object model (the original 3-D construction)
shown in Figure 2.6(i), it consists of a cylinder and a rectangular parallelepiped. From
26
the view point where the image is taken, four planar surfaces and a cylindrical one are
visible. One planar surface is missing from the range image due to sampling artifact
and shadows. Figures 2.6(b)-2.6(f) show the output of the LEGION network. We
can see that all the visible surface regions in the image, including the background,
are popped up. Oscillators belonging to the same surface region are synchronized
due to the strong lateral connections resulting from similar feature vectors and thus
are grouped together correctly. The segregation of these regions is achieved because
of the weak lateral connections resulting from jump and crease edges. Figure 2.6(g)
shows the overall result by putting the popped up regions together into a gray map
[135], where each region is shown using a single gray level. The boundaries between
different surfaces are shown fairly accurately, demonstrating that surface discontinuity
is effectively captured through lateral connection. Small holes in Figure 2.6(d) and
2.6(g) are due to some noise in the input range image. Here Figure 2.6(h), the
intensity image, is included for discussion in Section V.
Figures 2.7 and 2.8 show the segmentation results for a number of more real range
images, which include different types of edges and junctions. These results were
produced using the same parameter values as in Figure 2.6. In Figure 2.7(a), an
object with only planar surfaces is segmented into four surface regions. Boundaries
between surfaces are precisely marked, showing crease edges between planar surfaces
are correctly represented. The junction point where three surfaces intersect is han-
dled correctly without additional modeling, which would be difficult for edge-based
approaches. Figure 2.7(b) shows an object with several planar surfaces and a cylin-
drical one. As in Figure 2.7(a), all the planar surfaces are segmented out precisely.
The cylindrical surface is segmented out correctly even though it is not complete
27
(a) (b) (c)
(d) (e) (f)
(g) (h) (i)
Figure 2.6: Segmentation result of the LEGION network for the range image of acolumn. (a) The input range image. (b) The background region. (c)-(f) The foursegmented regions. (g) The overall segmentation result represented by a gray map.(h) The corresponding intensity image. (i) The 3-D construction model. As in Figure2.4, black pixels in (b)-(f) represent oscillators that are in the active phase.
28
because of shadowing effect. Both jump and crease edges are marked correctly in
the result. Figure 2.7(c) shows an image which consists of two objects. All the sur-
faces are correctly segmented out. Figure 2.7(d) shows a cylinder. The transition
between two surfaces is smoothing and it is even difficult to segment it manually.
The system correctly segmented two surface regions out and the boundary between
them is marked where we would expect. Figure 2.8(a) shows another image with
smoothing transition. Because of that, curvatures were used to segregate the surface
regions. Figure 2.8(b) shows a funnel. Even though the neck is a thin surface, it is
segmented out correctly. But boundary effect is more obvious. Figures 2.8(a) and
2.8(b) demonstrated that the system can handle conic surfaces. Figure 2.8(c) showed
that the network can handle spherical surfaces.
Figure 2.9 shows two difficult examples with thin regions compared to the size of
the normal estimation windows. When using the same parameters as in Figures 2.7
and 2.8, the results are not very satisfactory. This is because the thin parts do not
contain a leader and the normal and curvature estimations are not very reliable due to
the smoothing transitions between surfaces. When tuning the potential threshold and
global inhibition, both images are processed correctly. All the surfaces are segmented
out with boundaries marked correctly, as shown in Figure 2.9.
These examples show clearly that the LEGION network can handle planar, cylin-
drical, conic, and spherical surfaces and different types of surface discontinuity. Only
a few parameters need to be tuned for all these images, which demonstrates that our
network is a robust approach to range image segmentation.
29
(a)
(b)
(c)
(d)
Figure 2.7: Segmentation results of the LEGION network for range images. In eachrow, the left frame shows the input range image, the middle one shows the segmenta-tion result represented by a gray map, and the right one shows the 3-D constructionmodel for comparison purposes.
30
(a)
(b)
(c)
Figure 2.8: Segmentation results of the LEGION network for several more rangeimages. See the caption of Figure 2.7 for arrangement.
31
(a)
(b)
Figure 2.9: Two examples with thin regions. The global inhibition and potentialthreshold are tuned to get the results shown here. See the caption of Figure 2.7 forarrangement.
32
2.4.3 Comparison with Existing Approaches
As demonstrated using real range images, lateral connections in our LEGION net-
work capture different types of discontinuity as well as similarity within each surface
region. Both factors are critically important for correct segmentation. Discontinuity
avoids under-segmentation while similarity avoids over-segmentation. Critical points,
such as junctions between surfaces are handled correctly due to the context-sensitive
estimation. These factors determine that our method is close to edge-based ap-
proaches [91][6][136]. For region- and classification-based approaches [49][4][71][67][5],
because in most cases many unspecified parameters and pre- and post-processing are
involved, a quantitative comparison suggested by Hoover et al. [51] is not possible. A
qualitative comparison for similar images is used instead, which may be suggestive.
For the range images used in this chapter, we believe that our results in general are
comparable with the best results from other methods. For the two images whose
similar versions are also used elsewhere, we give a more detailed comparison below.
Cup images similar to Figure 2.9(a) were also used in [49][4][71]. In [49], a cup
image was first classified into two patch images through three different tests. The
two patch images were merged to form another patch image, based on which the
final classification was generated. In the final result, the handle was classified as
background, resulting in wrong topology. In [4], a coffee mug was segmented into six
bivariate polynomial surfaces with complicated coefficients, which were obtained by
surface fitting after curvature-based initial classification. The handle was incorrectly
segmented into two surfaces. In [24], a slightly different cup image was segmented in
five parts through an iterative regression method, where the handle was broken into
three parts. In our case, Figure 2.9(a) shows that the cup was segmented correctly into
33
two regions, the body and the handle. This qualitative comparison suggests that only
our approach produced a correct segmentation, without under- or over-segmentation.
An image similar to Figure 2.6(a) was also used in [5]. At the lower layers of
the hierarchical network, the image was broken into small patches because a position
term, not relevant to surface properties, is included in the feature vector in order to
achieve spatial connectivity [67][5]. Due to the position term, even planar surfaces
need several layers to be correctly segmented. More complicated surfaces, such as
cylindrical and conic ones, tend to be broken into small regions. As shown in [5],
an image may be correctly classified only when the correct number of regions in a
layer is given. This problem is not solved even using a hierarchical network. To
produce meaningful results, segmentation from different layers must be combined.
Because of that, this neural network based approaches [67][5] in general became a
split-and-merge algorithm.
Methodologically, our approach has several major advantages over other methods.
Firstly, the segmentation result is the emergent behavior of the LEGION network; the
flexible representation of temporal labeling promises to provide a generic solution to
image segmentation. Different cues, such as depth and color (see Figures 2.6(a) and
2.6(h)) are highly correlated and should be integrated to improve the performance
of a machine vision system. LEGION provides a unified and effective framework to
implement the integration through lateral connections. On the other hand, the inte-
gration would be difficult for segmentation algorithms [56][49][4][136][67][5]; different
information must be modeled explicitly and encoded accordingly. Secondly, due to
the non-linearity of single oscillators and the effective architecture for modeling local
and global coupling, our network is guaranteed to converge very efficiently. To our
34
best knowledge, this is the only oscillatory network that has been analyzed rigorously.
Finally, our approach is a dynamical system with parallel and distributed computa-
tion. The unique property of neural plausibility makes it particularly feasible for
analog VLSI implementation, the efficiency of which may pave the way to achieve
real-time range image segmentation.
2.5 Discussions
A LEGION network has been constructed and applied to range image segmenta-
tion. Our network successfully segments real range images, which shows that it may
give rise to a generic range image segmentation method.
2.5.1 Biological Plausibility of the Network
The relaxation oscillator used in LEGION is similar to many oscillators used in
modeling neuronal behaviors, including FitzHugh-Nagumo [31][96] and Morris-Lecar
model [93], and is reducible from the Hodgkin-Huxley equations [48]. The local
excitatory connections are consistent with various lateral connections in the brain
and can be viewed as the horizontal connections in the visual cortex [37]. The global
inhibitor, which receives input from the entire network and feeds back with inhibition,
can be viewed as a network that exerts some form of global control. Oscillatory
correlation, as a special form of temporal correlation [90][130], also conforms with
experiments of stimulus-dependent oscillations, where synchronous oscillations occur
when the stimulus is a coherent object, and no phase locking exists when regions are
stimulated by different stimuli [26][40][116].
As stated earlier, the LEGION network can produce segmentation results within
a few cycles. This analytical result may suggest a striking agreement with human
35
performance in perceptual organization. It is well-known that human subjects can
perform segmentation (figure-ground segregation) tasks in a small fraction of a second
[7], corresponding to several cycles of oscillations if the frequency is taken to be around
40 Hz as commonly reported. Thus, if our oscillator network is implemented in real-
time hardware, the time it takes the network to perform a segmentation task would
be compatible with human performance.
2.5.2 Comparison with Pulse-Coupled Neural Networks
Both LEGION and PCNN networks were proposed based on the temporal cor-
relation theory [90][130] and experimental data of stimulus-dependent oscillations
[26][40]. Single units in both models approximate nonlinear neuronal activity, and
excitatory local coupling is used to achieve synchronization. These are similarities
between them. Yet there are important differences.
A single neuron in PCNN networks consists of input units, a linking modulation,
and a pulse generator, all of which are mainly implemented using leaky integrators
[27]. The input units consist of two principal branches, called feeding and linking
inputs. Feeding inputs are the primary inputs from the neuron’s receptive field while
linking inputs are mainly the lateral connections with neighboring neurons and may
also include feedback controls from higher layers. The linking modulation modulates
the feeding inputs with linking inputs and produces the total internal activity U ,
which is given by U = F (1 + βL). Here F and L are total feeding and linking
inputs respectively, and β is a parameter that controls the linking strength. The
pulse generator is a Heaviside function of the difference between the total internal
activity and a dynamic threshold implemented through another leaky integrator. If
36
β is set to zero, the neuron becomes an encoder, which maps the intensity linearly
to firing frequency. If the activities from all neurons are summed up, the resulting
output is a time series, which corresponds to the histogram of the input image [59].
With linking inputs, the behavior of the network can be very complicated. But
when β is small, corresponding to the weak linking regime [59], synchronization is
achieved only when the differences are small, similar to a smoothing operation. This
type of PCNN can transform the input image into firing frequency representation
with desirable invariant properties [59]. Variations of this type of PCNN have found
successful applications in image factorization [60] and may have a connection with
wavelet and other transformations [103].
Recently, Stoecker et al. [120] also applied PCNN to scene segmentation. The
network successfully separated identical objects in a visual scene. Because differ-
ent intensities map to different firing frequencies, the network would have difficulty
in handling real images. For real images, different objects generally correspond to
different frequencies, and the readout would be problematic, i.e., it would be diffi-
cult to have a network that could detect the segmentation. More fundamentally, the
interactions among different firing frequencies have not been investigated. In [120],
desynchronization is achieved solely due to the spatial distances among input objects.
On the other hand, LEGION networks can successfully segment real range images as
demonstrated here and other real images such as texture images [16].
Another difference is that LEGION has a well-analyzed mathematical foundation
[123][135] whereas there is little analytical investigation on PCNN probably because
37
the neuron model is complicated. As illustrated in this chapter, to achieve image seg-
mentation, complicated oscillator models may not be necessary. The task performed
in [120] was previously achieved readily by LEGION [123][134].
2.5.3 Further Research Topics
While the LEGION network successfully segmented real range images, there are
several issues that need to be addressed in the future. One direct problem is how
to train the LEGION network so that the optimal parameter values can be chosen
automatically. One solution would be to have a separate network. A better solution
should utilize the temporal dynamics and devise an efficient memory accordingly. A
substantial improvement of the segmentation results is possible by setting the lateral
connections according to the temporal context that is developed through dynamics.
This offers another dimension which is unique to dynamic system approaches.
It is obvious that optimal segmentation is scale-dependent, and global inhibition
in LEGION actually defines a scale space. Figure 2.10 shows a hierarchy by manually
changing global inhibition. At the first level, it corresponds to figure-ground segre-
gation. By increasing global inhibition, the segmented results are refined further. As
shown Figure 2.10, this scale space is not based on blurring effects by continuously
changing the filter’s spatial extend [142]. Rather, the boundaries are precisely located
through all levels. This scale space may be an alternative to conventional scale spaces
and its properties need to be further studied.
38
)
PPPPPPPPPq
9
HHHHHHj
@@
@R
@@
@R
Figure 2.10: A hierarchy obtained from multiscale segmentation. The top is theinput range image and each segmented region is further segmented by increasing thelevel of global inhibition. As in Figure 2.6, black pixels represent active oscillators,corresponding to the popped up region. See Figure 2.6(i) for the corresponding 3-Dmodel.
39
CHAPTER 3
BOUNDARY DETECTION BY CONTEXTUAL
NONLINEAR SMOOTHING
In this chapter we present a two-step boundary detection algorithm. The first
step is a nonlinear smoothing algorithm which is based on an orientation-sensitive
probability measure. By incorporating geometrical constraints through the coupling
structure, we obtain a robust nonlinear smoothing algorithm, where many nonlin-
ear algorithms can be derived as special cases. Even when noise is substantial, the
proposed smoothing algorithm can still preserve salient boundaries. Compared with
anisotropic diffusion approaches, the proposed nonlinear algorithm not only performs
better in preserving boundaries but also has a non-uniform stable state, whereby re-
liable results are available within a fixed number of iterations independent of images.
The second step is simply a Sobel edge detection algorithm without non-maximum
suppression and hysteresis tracking. Due to the proposed nonlinear smoothing, salient
boundaries are extracted effectively. Experimental results using synthetic and real
images are provided. The results presented in this chapter appeared in [79] [80].
40
3.1 Introduction
One of the fundamental tasks in low-level machine vision is to locate discontinu-
ities in images corresponding to physical boundaries between a number of regions. A
common practice is to identify local maxima in local gradients of images - collectively
known as edge detection algorithms. The Sobel edge detector [14] consists of two 3
x 3 convolution kernels, which respond maximally to vertical and horizontal edges
respectively. Local gradients are estimated by convolving the images with the two
kernels, and thresholding is then applied to get rid of noisy responses. The Sobel
edge detector is computationally efficient but sensitive to noise. To make the esti-
mation of gradients more reliable, the image can be convolved with a low-pass filter
before estimation and two influential methods are due to Marr and Hildreth [89] and
Canny [13]. By convolving the image with a Laplacian of Gaussian kernel, the re-
sulting local maxima, which are assumed to correspond to meaningful edge points,
are zero-crossings in the filtered image [89]. Canny [13] derived an optimal step edge
detector using variational techniques starting from some optimal criteria and used the
first derivative of a Gaussian as a good approximation of the derived detector. Edge
points are then identified using a non-maximum suppression and hysteresis thresh-
olding for better continuity of edges. As noticed by Marr and Hildreth [89], edges
detected at a fixed scale are not sufficient and multiple scales are essentially needed
in order to obtain good results. By formalizing the multiple scale approach, Witkin
[142] and Koenderink [66] proposed Gaussian scale space. The original image is em-
bedded in a family of gradually smoothing images controlled by a single parameter,
which is equivalent to solving a heat equation with input as the initial condition [66].
While Gaussian scale space has nice properties and is widely used in machine vision
41
[75], a major limitation is that Gaussian smoothing inevitably blurs edges and other
important features due to its low-pass nature. To overcome the limitation, anisotropic
diffusion, which was proposed by Cohen and Grossberg [19] in modeling the primary
visual cortex, was formulated by Perona and Malik [105]:
∂I
∂t= div(g(‖∇‖)∇I) (3.1)
Here div is the divergence operator, and g is a nonlinear monotonically decreasing
function and ∇I denotes the gradient. By making the diffusion conductance depen-
dent explicitly on local gradients, anisotropic diffusion prefers intra-region smoothing
over inter-region smoothing, resulting immediate localization while noise is reduced
[105]. Because it produces visually impressive results, anisotropic diffusion gener-
ates much theoretical as well as practical interest (see reference [138] for a recent
review). While many improvements have been proposed, including spatial regular-
ization [137] and edge-enhancing anisotropic diffusion [15], the general framework
remains the same. As shown by You et al. [145], anisotropic diffusion given by (3.1)
is the steepest gradient descent minimizer of the following energy function:
E(I) =∫
Ωf(‖∇‖)dΩ (3.2)
with
g(‖∇‖) = f ′(‖∇‖)‖∇‖
Under some general conditions, the energy function given by (3.2) has a unique and
trivial global minimum, where the image is constant everywhere, and thus interest-
ing results exist only within a certain period of diffusion. An immediate problem is
how to determine the termination time, which we refer to as the termination prob-
lem. While there are some heuristic rules on how to choose the stop time [137] [15],
42
in general it corresponds to the open problem of automatic scale selection. As in
Gaussian scale space, a fixed time would not be sufficient to obtain good results.
Another problem of anisotropic diffusion is that diffusion conductance is a determin-
istic function of local gradients, which, similar to non-maximum suppression in edge
detection algorithms, makes an implicit assumption that larger gradients are due to
true boundaries. When noise is substantial and gradients due to noise and boundaries
cannot be distinguished based on magnitudes, the approach tends to fail to preserve
meaningful region boundaries.
To illustrate the problems, Figure 3.1(a) shows a noise-free image, where the gra-
dient magnitudes along the central square change considerably. Figure 3.1(b) shows a
noisy version of Figure 3.1(a) by adding Gaussian noise with zero mean and σ = 40,
and Figure 3.1(c) shows its local gradient magnitude obtained using Sobel opera-
tors [14]. While the three major regions in Figure 3.1(b) may be perceived, Figure
3.1(c) is very noisy and the strong boundary fragment is barely visible. Figures 3.1
(d)-(f) show the smoothed images by an anisotropic diffusion algorithm [106] with
specified numbers of iterations. Figures 3.1 (g)-(i) show the edge maps of Figure 3.1
(d)-(f), respectively, using the Sobel edge detection algorithm. While at the 50th
iteration the result is still noisy, the result becomes meaningless at the 1000th itera-
tion. Even though the result at the 100th iteration is visually good, the boundaries
are still fragmented and it is not clear how to identify a ”good” number of iterations
automatically.
These problems to a large extent are due to the assumption that local maxima in
gradient images are good edge points. In other words, due to noise, responses from
true boundaries and those from noise are not distinguishable based on magnitude.
43
(a) (b) (c)
(d) (e) (f)
(g) (h) (i)
Figure 3.1: An example with non-uniform boundary gradients and substantialnoise. (a) A noise-free synthetic image. Gray values in the image: 98 for the left ‘[’region, 138 for the square, 128 for the central oval, and 158 for the right ‘]’ region.(b) A noisy version of (a) with Gaussian noise of σ = 40. (c) Local gradient map of(b) using the Sobel operators. (d)-(f) Smoothed images from an anisotropic diffusionalgorithm [106] at 50, 100, and 1000 iterations. (g)-(i) Corresponding edge maps of(d)-(f) respectively using the Sobel edge detector.
44
To overcome these problems, contextual information, i.e., responses from neighboring
pixels, should be incorporated in order to reduce ambiguity as in relaxation labeling
and related methods [112] [55] [72] [42]. In general, relaxation labeling methods use
pair-wise compatibility measure, which is determined based on a priori models asso-
ciated with labels, and convergence is not known and often very slow in numerical
simulations[88]. In this chapter, by using an orientation-sensitive probability mea-
sure, we incorporate contextual information through the geometrical constraints on
the coupling structure. Numerical simulations show that the resulting nonlinear al-
gorithm has a non-uniform stable state and good results can be obtained within a
fixed number of iterations independent of input images. Also, the oriented probability
measure is defined on input data, and thus no a priori models need to be assumed.
In Section 3.2, we formalize our contextual nonlinear smoothing algorithm and
show that many nonlinear smoothing algorithms can be treated as special cases.
Section 3.3 gives some theoretical results as well as numerical simulations regarding
to the stability and convergence of the algorithm. Section 3.4 provides experimental
results using synthetic and real images. Section 3.5 concludes the chapter with further
discussions.
3.2 Contextual Nonlinear Smoothing Algorithm
3.2.1 Design of the Algorithm
To design a statistical algorithm, with no prior knowledge, we assume a Gaussian
distribution within each region. That is, given a pixel (i0, j0) and a window R(i0,j0)
at pixel (i0, j0), consisting of a set of pixel locations, we assume that:
P (I(i0,j0), R) =1√
2πσR
exp
−(I(i0,j0) − µR)2
2σ2R
(3.3)
45
where I(i,j) is the intensity value at pixel location (i, j). To simplify notation, without
confusion, we use R to stand for R(i0,j0). Intuitively, P (I(i0,j0), R) is a measure of
compatibility between intensity value at pixel (i0, j0) and statistical distribution in
window R. To estimate the unknown parameters of µR and σR, consider the pixels
in R as n realizations of (3), where n = |R|. The likelihood function of µR and σR is
[119]:
L(R, µR, σR) =(
1√2πσR
)n
exp(
− 1
2σ2R
∑
(i,j)∈R
(I(i,j) − µR)2)
(3.4)
By maximizing (3.4), we get the maximum likelihood estimators for µR and σR:
µR =1
n
∑
(i,j)∈R
I(i,j) (3.5a)
σR =1√n
√
∑
(i,j)∈R
(I(i,j) − µR)2 (3.5b)
To do a nonlinear smoothing, similar to selective smoothing filters [125] [95],
suppose that there are M windows R(m), where 1 ≤ m ≤ M , around a central pixel
(i0, j0). Here these R(m)’s can be generated from one or several basis windows through
rotation, which are motivated by the experimental findings of orientation selectivity
in the visual cortex [53]. Simple examples are elongated rectangular windows (refer
to Figure 3.6), which are used throughout this chapter for synthetic and real images.
The probability that pixel (i0, j0) belongs to R(m) can be estimated from equations
(3.3) and (3.5). By assuming that the weight of each R(m) should be proportional to
the probability, as in relaxation labeling [112] [55], we obtain an iterative nonlinear
smoothing filter:
I t+1(i0,j0) =
∑
m P(
I t(i0,j0), R
(m)
)
µtRm
∑
m P(
I t(i0,j0), R
(m)
) (3.6)
A problem with this filter is that it is not sensitive to weak edges due to the linear
combination. To generate more semantically meaningful results and increase the
46
sensitivity even to weak edges, we apply a nonlinear function on weights, which is
essentially same as anisotropic diffusion [105]:
I t+1(i0,j0) =
∑
m g(
P(
I t(i0,j0), R
(m)
))
µtRm
∑
m g(
P(
I t(i0,j0), R
(m)
)) (3.7)
Here g2 is a nonlinear monotonically increasing function. A good choice for g is
an exponential function, which is widely used in nonlinear smoothing anisotropic
diffusion approaches:
g(x) = exp(x2/K) (3.8)
Here parameter K controls the sensitivity to edges [113]. Equation (3.7) provides a
generic model for a wide range of nonlinear algorithms, the behavior of which largely
depends on the sensitivity parameterK. WhenK is large, (3.7) reduces to the equally
weighted average smoothing filter. When K is around 0.3, g is close to a linear
function in [0, 1] and (3.7) then reduces to (3.6). When K is a small positive number,
(3.7) will be sensitive to all discontinuities. No matter how small the weight of one
window can be, theoretically speaking, if it is nonzero, when t→∞, the system will
reach a uniform stable state. Similar to anisotropic diffusion approaches, the desired
results will be time-dependent and the termination problem becomes a critical issue
for autonomous solutions. To overcome this limitation, we restrict smoothing only
within the window with the highest probability similar to selective smoothing [125]
[95]:
m∗ = max1≤m≤M
(
P(
I t(i0,j0), R
(m)))
(3.9)
2Because the probability measure given by (3.1) is inversely related to gradient measure used inmost nonlinear smoothing algorithms, (3.8) is an increasing function instead of a decreasing functionin our method.
47
The nonlinear smoothing through (3.9) is desirable in regions that are close to edges.
By using appropriate R(m)’s, (3.9) encodes discontinuities implicitly. But in homo-
geneous regions, (3.9) may produce artificial block effects due to intensity variations.
Under the proposed statistical formulation, there is an adaptive method to detect
homogeneity. Based on the assumption that there are M R(m) windows around a
central pixel (i0, j0), where each window has a Gaussian distribution, consider the
mean in each window as a new random variable:
µ(m) =1
|R(m)|∑
(i,j)∈R(m)
I(i,j) (3.10)
Because µ(m) is a linear combination of random variables with a Gaussian distribution,
µ(m) has also a Gaussian distribution with the same mean and a standard deviation
given by:
σµ(m) =1
√
|R(m)|σR(m) (3.11)
This provides a probability measure of how likely that the M windows are sampled
from one homogeneous region. Given a confidence level α, for each pair of windows
R(m1) and R(m2), we have:
|µ(m1) − µ(m2)| ≤ min(
√
√
√
√
log(1/α)
|R(m1)| σR(m1) ,
√
√
√
√
log(1/α)
|R(m2)| σR(m2)
)
(3.12)
If all the pairs satisfy (3.12), the M windows are likely from one homogeneous region
with confidence α. Intuitively, under the assumption of a Gaussian distribution, when
we have more samples, i.e., the window R(m) is larger, the estimation of the mean is
more precise and so the threshold should be smaller. In a region with a larger standard
deviation, the threshold should be larger because larger variations are allowed.
The nonlinear smoothing algorithm outlined above works well when noise is not
very large. In cases when signal to noise ratio is very low, the probability measure
48
given in (3.3) would be unreliable because pixel values change considerably. This
problem can be alleviated by using the mean value of pixels sampled from R which
are close to the central pixel (i0, j0), or along a certain direction to make the algorithm
more orientation sensitive.
To summarize, we obtain a nonlinear smoothing algorithm. We define M oriented
windows which can be obtained by rotating one or more basis windows. At each pixel,
we estimate parameters using (3.5). If all the M windows belong to a homogeneous
region according to (3.12), we do the smoothing within all theM windows. Otherwise,
the smoothing is done only within the most compatible window given by (3.9).
3.2.2 A Generic Nonlinear Smoothing Framework
In this section we will show how to derive several widely used nonlinear algorithms
from the statistical nonlinear algorithm outlined above. Several early nonlinear filters
[125] [95] do the smoothing in a window where the standard deviation is the smallest.
These filters can be obtained by simplifying (3.3) to:
P (I(i0,j0), µ, σ) =1√2πσ
C (3.13)
where C is a constant. Then the solution to (3.9) is the window with the smallest
deviation. Recently, Higgins and Hsu [47] extended the principle of choosing the
window with the smallest deviation for edge detection.
Another nonlinear smoothing filter is the gradient-inverse filter [131]. Suppose
that there is one window, i.e., M = 1, consisting of the central pixel (i0, j0) itself
only, the estimated deviation for a given pixel (i, j) in (3.5b) now becomes:
σ = |I(i,j) − I(i0,j0)| (3.14)
49
Equation (3.14) is a popular way to estimate local gradients. Using (13) as the
probability measure, (3.6) becomes exactly the gradient inverse nonlinear smoothing
filter [131].
SUSAN nonlinear smoothing filter [117] is proposed based on SUSAN (Smallest
Univalue Segment Assimilating Nucleus) principle. It is formulated as:
I t+1(i0,j0) =
∑
(δi,δj)6=(0,0) It(i0+δi,j0+δj)W (i0, j0, δi, δj)
∑
(δi,δj)6=(0,0)W (i0, j0, δi, δj)(3.15)
where
W (i0, j0, δi, δj) = exp
(
− 12σ2 −
(
It(i0+δi,j0+δj)
−It(i0,j0)
)2
T 2
)
Here (i0, j0) is the central pixel under consideration, and (δi, δj) defines a local neigh-
borhood. Essentially, it integrates Gaussian smoothing in spatial and brightness
domains. The parameter T is a threshold for intensity values. It is easy to see from
(3.15) that the weights are derived based on pair-wise intensity value differences. It
would be expected that the SUSAN filter performs well when images consists of rel-
atively homogeneous regions and within each region noise is smaller than T . When
noise is substantial, it fails to preserve structures due to the pair-wise difference cal-
culation, where no geometrical constraints are incorporated. This is consistent with
the experimental results, which will be discussed later. To get the SUSAN filter, we
define one window including the central pixel itself only. For a given pixel (i, j) in its
neighborhood, (3.3) can be simplified to:
P (I(i,j), R) = C exp(
−(I(i,j),−µR)2
T 2
)
(3.16)
where C is a scaling factor. Because now µR is I(i0,j0), (3.6) with the probability
measure given by (3.16) is equivalent to Gaussian smoothing in the brightness domain
in (3.15).
50
Now consider anisotropic diffusion given by (3.1). By discretizing (3.1) in image
domain with four nearest neighbor coupling [106] and rearranging terms, we have:
I t+1(i,j) = ηt
(i,j)It(i,j) + λ
∑
m
g(P (I t(i,j), R
(m)))µR(m) (3.17)
If we have four singleton regions, (3.17) is essentially a simplified version of (3.7) with
an adaptive learning rate.
3.3 Analysis
3.3.1 Theoretical Results
One of the distinctive characteristics of the proposed algorithm is that it requires
spatial constraints among responses from neighboring locations through coupling
structure as opposed to pair-wise coupling structure. Figure 3.2 illustrates the con-
cept using a manually constructed example. Figure 3.2(a) shows the oriented windows
in a 3 x 3 neighborhood, and Figure 3.2(c) shows the coupling structure if we apply
the proposed algorithm to a small image patch shown in Figure 3.2(b). The directed
graph is constructed as follows. There is a directed edge from (i1, j1) to (i0, j0) if and
only if (i1, j1) contributes to the smoothing of (i0, j0) according to equations (3.12)
and (3.9). By doing so, the coupling structure is represented as a directed graph as
shown in Figure 3.2(c). Connected components and strongly connected components
[20] of the directed graph can be used to analyze the temporal behavior of the pro-
posed algorithm. A strongly connected component is a set of vertices, or pixels here,
where there is a directed path from any vertex to all the other vertices in the set.
We obtain a connected component if we do not consider the direction of edges along
a path. In the example shown in Figure 3.2(c), all the black pixels form a strongly
51
connected component and so do all the white pixels. Also, there are obviously two
connected components.
Essentially our nonlinear smoothing algorithm can be viewed as a discrete dy-
namic system, the behavior of which is complex due to spatial constraints imposed
by coupling windows and adaptive coupling structure by probabilistic grouping. We
now prove that a constant region satisfying certain geometrical constraints is a stable
state of the smoothing algorithm.
Theorem 1. If a region S of a given image I satisfies:
(i1, j1) ∈ S and (i2, j2) ∈ S ⇒ I(i1,j1) = I(i2,j2) (3.18a)
∀(i, j) ∈ S ⇒ ∃mR(m)(i,j) ⊆ S (3.18b)
Then S is stable with respect to the proposed algorithm.
Proof. Condition (3.18a) states that S is a constant region and the standard
deviation is zero if R(m) is within S according to equation (3.5b). Consider a pixel
(i0, j0) in S. Inequality (3.12) is satisfied only when all R(m)’s are within S. In
this case, the smoothing algorithm does not change the intensity value at (i0, j0).
Otherwise, according to equation (3.9) must be within S because there exists at least
one such window according to (3.18b) and thus the smoothing algorithm does not
change the intensity value at (i0, j0) also. So S is stable. Q.E.D.
A maximum connected component of the constructed graph is stable when its
pixels are constant and thus maximum connected components of the constructed
graph are a piecewise constant stable solution of the proposed algorithm. For the
image patch given in Figure 3.2(b), for example, a stable solution is that pixels in
each of the two connected components are constant. The noise-free image in Figure
52
(a)
(b) (c)
Figure 3.2: Illustration of the coupling structure of the proposed algorithm. (a) Eightoriented windows and a fully connected window defined on a 3 x 3 neighborhood. (b)A small synthetic image patch of 6 x 8 in pixels. (c) The resulting coupling structurefor (b). There is a directed edge from (i1, j1) to a neighbor (i0, j0) if and only if (i1, j1)contributes to the smoothing of (i0, j0) according to equations (3.12) and (3.9). Eachcircle represents a pixel, where the inside color is proportional to the gray value ofthe corresponding pixel. Ties in (3.9) are broken according to left-right and top-downpreference of the oriented windows in (a).
53
3.1(a) is also a stable solution by itself, as we will demonstrate through numerical
simulations later on. It is easy to see from the proof that any region which satisfies
conditions (3.18a) and (3.18b) during temporal evolution will stay unchanged. In
addition, due to the smoothing nature of the algorithm, a local maximum at iteration
t cannot increase according to the smoothing kernel by equation (3.12) or (3.9), and
similarly, a local minimum cannot decrease. We conjecture that any given image
approaches an image that is almost covered by homogeneous regions. Due to the
spatial constraints given by (3.18b), it is not clear if the entire image converges to a
piece-wise constant stable state. Within each resulting homogeneous region, (3.18b)
is satisfied and thus the region becomes stable. For pixels near boundaries, corners,
and junctions, it is possible that (18b) is not uniquely satisfied within one constant
region, and small changes may persist. The whole image in this case attains a quasi-
equilibrium state. This is supported by the following numerical simulations using
synthetic and real images. While there are pixels which do not converge within 1000
iterations, the smoothed image as a whole does not change noticeably at all. The two
maximum strongly connected components in Figure 3.2(b) satisfy condition (3.18b).
Both of them are actually uniform regions and thus are stable. Gray pixels would
be grouped into one of the two stable regions according to pixel value similarity and
spatial constraints.
3.3.2 Numerical Simulations
Because it is difficult to derive the speed of convergence analytically, we use nu-
merical simulations to demonstrate the temporal behavior of the proposed algorithm.
Since smoothing is achieved using equally weighted average within selected windows,
54
the algorithm should converge rather quickly in homogeneous regions. To obtain
quantitative estimations, we define two measures similar to variance. For synthetic
images, where a noise-free image is available, we define the deviation from the ground
truth image as:
D(I) =
√
√
√
√
∑
i
∑
j(I(i,j) − Igt(i,j))
2
|I| (3.19)
Here I is the image to be measured and Igt is the ground truth image. The deviation
gives an objective measure of how good the smoothed image is with respect to the
true image. To measure the convergence, we define relative variance for image I at
time t:
V t(I) =
√
√
√
√
∑
i
∑
j(It(i,j) − I t−1
(i,j))2
|I| (3.20)
We have applied the proposed algorithm on the noise-free image shown in Figure
3.1(a) and six noisy images generated from it by adding zero-mean Gaussian noise
with σ from 5 to 60. Figure 3.3 shows the deviation from the ground truth image
with iterations, and Figure 3.4 shows the relative variance of the noise-free image and
four selected noisy images to make the figure more readable. As we can see from
Figure 3.3, the noise-free image is a stable solution by itself, where the deviation is
always zero. For the noisy images, the deviation from true image is stabilized within
a few number of iterations independent of the amount of noise. Figure 3.4 shows that
relative variance is bounded with a small upper limit after 10 iterations. This variance
is due to the pixels close to boundaries, corners and junctions that do not belong to
any resulting constant region. As discussed before, because the spatial constraints
cannot be satisfied within one homogeneous region, these pixels have connections
from pixels belonging to different homogeneous regions, and thus fluctuate. These
pixels are a small fraction of the input image in general, and thus the fluctuations do
55
0 10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
Number of iterations
Deviation
Noise-free image
Noise σ = 5
Noise σ = 10
Noise σ = 20
Noise σ = 30Noise σ = 40
Noise σ = 60
Figure 3.3: Temporal behavior of the proposed algorithm with respect to the amountof noise. Six noisy images are obtained by adding zero-mean Gaussian noise with σ of5, 10, 20, 30, 40, and 60, respectively, to the noise-free image shown in Figure 3.1(a).The plot shows the deviation from the ground truth image with respect to iterationsof the noise-free image and six noisy images.
not affect the quality of the smoothed images noticeably. As shown in Figure 3.3, the
deviation is stabilized quickly.
Real images are generally more complicated than synthetic images statistically and
structurally, and we have also applied our algorithm to the four real images shown in
Figure 3.9-3.12 which include a texture image. Figure 3.5 shows the relative variance
in 100 iterations, where the variance is bounded after 10 iterations independent of
images. This indicates that the proposed algorithm behaves similarly for synthetic
and real images.
56
0 10 20 30 40 50 60 70 80 90 1000
0.05
0.1
0.15
0.2
0.25
Number of iterations
Deviation
Noise-free image
Noise σ = 5
Noise σ = 20
Noise σ = 40
Noise σ = 60
Figure 3.4: Relative variance of the proposed algorithm for the noise-free image shownin Figure 3.1(a) and four noisy images with Gaussian noise of zero-mean and σ of 5,20, 40 and 60, respectively.
57
0 10 20 30 40 50 60 70 80 90 1000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Number of iterations
Deviation
Figure 3.9
Figure 3.10
Figure 3.11
Figure 3.12
Figure 3.5: Relative variance of the proposed algorithm for real images shown inFigure 3.9-3.12.
3.4 Experimental Results
3.4.1 Results of the Proposed Algorithm
The nonlinear smoothing algorithm formalized in this chapter integrates discon-
tinuity and homogeneity through the orientation-sensitive probability framework.
Equation (3.9) represents discontinuity implicitly and (3.12) encodes homogeneity
explicitly. Because of the probability measure, the initial errors for choosing smooth-
ing windows due to noise can be overcome by the coupling structure. Essentially only
when majority of the pixels in one window make a wrong decision, the final result
would be affected. As illustrated in Figure 3.2, the coupling structure is robust.
58
Figure 3.6: The oriented bar-like windows used throughout this chapter for syntheticand real images. The size of each kernel is approximately 3 x 10 in pixels.
To achieve optimal performance, the size and shape of the oriented windows are
application dependent. However, due to the underlying coupling structure, the pro-
posed algorithm gives good results for a wide range of parameter values. For example,
the same oriented windows are used throughout the experiments in this chapter. As
shown in Figure 3.6, these oriented windows are generated by rotating two rectan-
gular basis windows with size of 3 x 10 in pixels. The preferred orientation of each
window is consistent with orientation sensitivity of cell responses in the visual cortex
[53]. Asymmetric window shapes are used so that 2-D features such as corners and
junctions can be preserved.
As is evident from numerous simulations, the proposed algorithm generates stable
results around 10 iterations regardless of input images. Thus, all the boundaries from
the proposed algorithm are generated using smoothed images at the 11th iteration.
As stated above, boundaries are detected using the Sobel edge detector due to its
efficiency.
Figure 3.7 shows the results by applying the proposed algorithm on a set of noisy
images obtained from the noise-free image shown in Figure 3.1(a) by adding Gaussian
noise with σ of 10, 40, and 60 respectively. Same parameters for smoothing are used
59
for the three images. When noise is relative small, the proposed algorithm preserves
boundaries accurately as well as corners and junctions, as shown in Figure 3.7(a).
When noise is substantial, due to the coupling structure, the proposed algorithm
is robust to noise and salient boundaries are well preserved. Because only local
information is used in the system, it would be expected that the boundaries are less
accurate when noise is larger. This uncertainty is an intrinsic property of the proposed
algorithm because reliable estimation gets more difficult when noise gets larger as
shown in Figure 3.7(b) and (c). The results seem consistent with our perceptual
experience.
Figure 3.8 shows the result for another synthetic image, which was extensively
used by Sarkar and Boyer [114]. As shown in Figure 3.8(b), noise is reduced greatly
and boundaries as well as corners are well preserved. Even using the simple Sobel edge
detector, the result is better than the best result from the optimal infinite impulse
responses filters [114] obtained using several parameter combinations with hysteresis
thresholding. This is because their edge detector does not consider the responses from
neighboring pixels, but rather assumes the local maxima as good edge points.
Figure 3.9 shows an image of a grocery store advertisement which was used
throughout the book by Nitzberg, Mumford, and Shiota [98]. In order to get good
boundaries, they first applied an edge detector and then several heuristic algorithms
to close gaps and delete noise edges. In our system, the details and noise are smoothed
out due to the coupling structure and the salient boundaries, corners and junctions
are preserved. The result shown in Figure 3.9(c) is comparable with the result after
several post-processing steps shown on page 43 of the book.
60
(a) (b) (c)
Figure 3.7: The smoothed images at the 11th iteration and detected boundaries forthree synthetic images by adding specified Gaussian noise to the noise-free imageshown in Figure 3.1(a). Top row shows the input images, middle the smoothedimage at the 11th iteration, and bottom the detected boundaries using the Sobeledge detector. (a) Gaussian noise with σ = 10. (b) Gaussian noise with σ = 40. (c)Gaussian noise with σ = 60.
61
(a) (b) (c)
Figure 3.8: The smoothed image at the 11th iteration and detected boundaries for asynthetic image with corners. (a) Input image. (b) Smoothed image. (c) Boundariesdetected.
(a) (b) (c)
Figure 3.9: The smoothed image at the 11th iteration and detected boundaries fora grocery store advertisement. Details are smoothed out while major boundariesand junctions are preserved accurately. (a) Input image. (b) Smoothed image. (c)Boundaries detected.
62
(a) (b) (c)
Figure 3.10: The smoothed image at the 11th iteration and detected boundaries fora natural satellite image with several land use patterns. The boundaries betweendifferent regions are formed from noisy segments due to the coupling structure. (a)Input image. (b) Smoothed image. (c) Boundaries detected.
Figure 3.10 shows a high resolution satellite image of a natural scene, consisting of
a river, soil land, and a forest. As shown in Figure 3.10(b), the river boundary which
is partially occluded by the forest is delineated. The textured forest is smoothed out
into a homogeneous region. The major boundaries between different types of features
are detected correctly.
Figure 3.11 shows an image of a woman which includes detail features and shading
effects, the color version of which was used by Zhu and Yuille [151]. In their region
competition algorithm, Zhu and Yuille [151] used a mixture of Gaussian model. A
non-convex energy function consisting of several constraint terms was formulated
under Bayesian framework. The algorithm, derived using variational principles, is
guaranteed to converge to only a local minimum. For our nonlinear algorithm, as
shown in Figure 3.11(b), the details are smoothed out while important boundaries
are preserved. The final result in Figure 3.11(c) is comparable with the result from the
region competition algorithm [151] applied on the color version after 130 iterations.
Compared with the region competition algorithm, the main advantage of our approach
63
(a) (b) (c)
Figure 3.11: The smoothed image at the 11th iteration and detected boundaries fora woman image. While the boundaries between large features are preserved anddetected, detail features such as facial features are smoothed out. (a) Input image.(b) Smoothed image. (c) Boundaries detected.
is that local statistical properties are extracted and utilized effectively in the oriented
probabilistic framework instead of fitting the image into a global model which, in
general, cannot be guaranteed to fit the given data well.
To further demonstrate the effectiveness of the proposed algorithm, we have also
applied it to a texture image as shown in Figure 3.12(a). As shown in Figure 3.12(b),
the boundaries between different textures are preserved while most of detail features
are smoothed out. Figure 3.12(c) shows the detected boundaries by the Sobel edge
detector. While there are some noisy responses due to the texture patterns, the main
detected boundaries are connected. A simple region growing algorithm would segment
64
(a) (b) (c)
Figure 3.12: The smoothed image at the 11th iteration and detected boundaries fora texture image. The boundaries between different textured regions are formed whiledetails due to textures are smoothed out. (a) Input image. (b) Smoothed image. (c)Boundaries detected.
the smoothed image into four regions. While this example is not intended to show
that our algorithm can process texture images, it demonstrates that the proposed
algorithm can be generalized to handle distributions that are not Gaussian, which
was assumed when formalizing the algorithm.
3.4.2 Comparison with Nonlinear Smoothing Algorithms
In order to evaluate the performance of the proposed algorithm relative to ex-
isting nonlinear smoothing methods, we have conducted a comparison with three
recent methods. The SUSAN nonlinear filter [117] has been claimed to be the
best by integrating smoothing both in spatial and brightness domains. The origi-
nal anisotropic model by Perona and Malik [105] is still widely used and studied. The
edge-enhancing anisotropic diffusion model proposed by Weickert [137] [138] incorpo-
rates true anisotropy using a diffusion tensor calculated from a Gaussian kernel, and
is probably by far the most sophisticated diffusion-based smoothing algorithm.
65
To do an objective comparison using real images is difficult because there is no
universally accepted ground truth. Here we use synthetic images where the ground
truth is known and the deviation calculated by (19) gives an objective measure of
the quality of smoothed images. We have also tuned parameters to achieve best
possible results for the methods to be compared. For the SUSAN algorithm, we have
used several different values for the critical parameter T in (3.15). For the Perona
and Malik model, we have tried different nonlinear functions g in (3.1) with different
parameters. For the Weickert model, we have chosen a good set of parameters for
diffusion tensor estimation. We in addition choose their best results in terms of
deviation from the ground truth, which are then used for boundary detection.
Because the three methods and proposed algorithm all can be applied iteratively,
first we compare their temporal behavior. We apply each of them to the image shown
in Figure fig:context-syn-3-mine(b) for 1000 iterations and calculate the deviation
and relative variance with respect to the number of iterations using (3.19) and (3.20).
Figure 3.13 shows the deviation from the ground-truth image. The SUSAN filter,
which quickly reaches a best state, and converges quickly also to a uniform state due
to the Gaussian smoothing term in the filter (see equation (3.15)). The temporal
behavior of the Perona-Malik model and the Weickert model is quite similar while
the Weickert model converges more rapidly to and stay longer in good results. The
proposed algorithm converges and stabilizes quickly to a non-uniform state, and thus
the smoothing can be terminated after several iterations.
Figure 3.14 shows the relative variance of the four methods along the iterations.
Because the SUSAN algorithm converges to a uniform stable state, the relative vari-
ance goes to zero after a number of iterations. The relative variance of Perona-Malik
66
0 100 200 300 400 500 600 700 800 900 10000
5
10
15
20
25
30
35
40
45
50
Number of iterations
Deviation
SUSAN filter
Perona-Malik model
Weickert model
Proposed filter
Figure 3.13: Deviations from the ground truth image for the four nonlinear smoothingmethods. Dashed line: The SUSAN filter [117]; Dotted line: The Perona-Malik model[105]; Dash-dotted line: The Weickert model of edge enhancing anisotropic diffusion[137]; Solid line: The proposed algorithm.
67
0 100 200 300 400 500 600 700 800 900 10000
0.5
1
1.5
2
2.5
Number of iterations
Deviation
SUSAN filter
Perona-Malik model
Weickert model
Proposed filter
Figure 3.14: Relative variance of the four nonlinear smoothing methods. Dashedline: The SUSAN filter [117]; Dotted line: The Perona-Malik diffusion model [105];Dash-dotted line: The Weickert model [137]; Solid line: The proposed algorithm.
model is closely related to the g function in (3.1). Due to the spatial regularization
using a Gaussian kernel, Weickert model changes continuously and the diffusion lasts
much longer, which accounts for the fact why good results exist for a longer period
of time than Perona-Malik model. As shown in Figure 3.4 and 3.5, the proposed
algorithm generates bounded small ripples in the relative variance measure. Those
ripples do not affect smoothing results noticeably as the deviation from the ground
truth, shown in Figure 3.13, is stabilized quickly.
Now we compare the effectiveness of the four methods in preserving meaningful
boundaries. Following Higgins and Hsu [47], we use two quantitative performance
metrics to compare the edge detection results: P (AE|TE), the probability of a true
68
edge pixel being correctly detected by a given method; P (TE|AE), the probability
of a detected edge pixel being a true edge pixel. Due to the uncertainty in edge
localization, a detected edge pixel is considered to be correct if it is within two pixels
from ground-truth edge points using the noise-free image. For each method, the
threshold on the gradient magnitude of the Sobel edge detector is adjusted to achieve
a best trade-off between detecting true edge points and rejecting false edge points.
For the proposed algorithm, we use the result at the 11th iteration because the
proposed algorithm converges within several iterations. As mentioned before, for the
other three methods, we tune critical parameters and choose the smoothed images
with the smallest deviation. Figure 3.15 shows the smoothed images along with the
detected boundaries using the Sobel edge detector, for the image shown in Figure
fig:context-syn-3-mine(a), where added noise is Gaussian with zero mean and σ = 10.
Table 3.1 summarizes the quantitative performance metrics. All of the four methods
perform well and the proposed method gives the best numerical scores. The boundary
of the square is preserved accurately. For the central oval, the proposed algorithm
gives a better connected boundary while the other three have gaps. Also the proposed
algorithm generated the sharpest edges while edges from Weickert model are blurred
most, resulting in the worst numerical metrics among the four methods.
Figure 3.16 shows the result for the image in Figure 3.7(b), where noise is sub-
stantial, and Table 3.2 shows the quantitative performance metrics. As shown in
Figure 3.16(a), the SUSAN filter tends to fail to preserve boundaries, resulting in
noisy boundary fragments. The Perona-Malik model produces good but fragmented
boundaries. Due to that only local gradient is used, the Perona-Malik model is noise-
sensitive and thus generates more false responses than other methods in this case.
69
(a) (b) (c) (d)
Figure 3.15: Smoothing results and detected boundaries of the four nonlinear methodsfor a synthetic image shown in Figure 3.7(a). Here noise is not large and all of themethods perform well in preserving boundaries.
The false responses substantially lower the quantitative metrics of the model, making
it the worst among the four methods. The Weickert model produces good bound-
aries for strong segments but weak segments are blurred considerably. The proposed
algorithm preserves the connected boundary of the square and partially fragmented
boundaries of the central oval also, yielding the best numerical metrics among the four
methods. As shown in Figure 3.13, the smoothed image of the Weickert model has a
smaller deviation than the result from our algorithm, but the detected boundaries are
fragmented. This is because our algorithm produces sharp boundaries, which induce
larger penalties according to (3.19) when not accurately marked.
Comparing Tables 3.1 and 3.2, one can see that our proposed method is most
robust in that the average performance is degraded by about 13%. Perona-Malik
model is most noise-sensitive, where the performance is degraded by about 35%. For
70
Models SUSAN[117] Perona-Malik[105] Weickert[137][138] Our method
P (TE | AE) 0.960 0.963 0.877 0.988
P (AE | TE) 0.956 0.964 0.880 0.979
Average 0.958 0.963 0.878 0.983
Table 3.1: Quantitative comparison of boundary detection results shown in Figure3.15.
Models SUSAN[117] Perona-Malik[105] Weickert[137][138] Our method
P (TE | AE) 0.720 0.609 0.692 0.853
P (AE | TE) 0.713 0.618 0.688 0.854
Average 0.717 0.613 0.690 0.853
Table 3.2: Quantitative comparison of boundary detection results shown in Figure3.16.
SUSAN and Weickert model, the average performance is degraded by about 24% and
19% respectively.
We have also applied the four methods on the natural satellite image shown in
Figure 3.10. The result from the proposed algorithm is at the 11th iteration as
already shown in Figure 3.10. The results from the other three methods are picked up
manually for best possible results. Due to the termination problem, results from most
nonlinear smoothing algorithms have to be chosen manually, making them difficult
to be used automatically. The results from the other three methods are similar,
and boundaries between different regions are not formed. In contrast, our algorithm
generated connected boundaries separating major different regions.
71
(a) (b) (c) (d)
Figure 3.16: Smoothing results and detected boundaries of the four nonlinear methodsfor a synthetic image with substantial noise shown in Figure 3.7(b). The proposedalgorithm generates sharper and better connected boundaries than the other threemethods.
(a) (b) (c) (d)
Figure 3.17: Smoothing results and detected boundaries of a natural scene satelliteimage shown in Figure 3.10. Smoothed image of the proposed algorithm is at the11th iteration while smoothed images of the other three methods are chosen manually.While the other three methods generate similar fragmented boundaries, the proposedalgorithm forms the boundaries between different regions due to its coupling structure.
72
3.5 Conclusions
In this chapter we have presented a two-step robust boundary detection algo-
rithm. The first step is a nonlinear smoothing algorithm based on an orientation
sensitive probability measure. This algorithm is motivated by the orientation sen-
sitivity of cells in the visual cortex [53]. By incorporating geometrical constraints
through the coupling structure, the algorithm is robust to noise while preserving
meaningful boundaries. Even though the algorithm was formulated based on Gaus-
sian distribution, it performs well for real and even textured images, showing the
generalization capability of the algorithm. It is also easy to see that the formalization
of the proposed algorithm would extend to other known distribution by changing
equations (3.3)-(3.5) accordingly. One such an extension would be to use a mixture
of Gaussian distributions [30] so that the model may be able to describe arbitrary
probability distribution.
Compared with recent anisotropic diffusion methods, our algorithm approaches a
non-uniform stable state and reliable results can be obtained after a fixed number of
iterations. In other words, it provides a solution for the termination problem. When
noise is substantial, our algorithm preserves meaningful boundaries better than the
diffusion-based methods, because the coupling structure employed is more robust
than pair-wise coupling structure.
Scale is an intrinsic parameter in machine vision as interesting features may exist
only in a limited range of scales. Scale spaces based on linear and nonlinear smooth-
ing kernels do not represent semantically meaningful structures explicitly [122]. A
solution to the problem could be to use parameter K in equation (3.8) as a control
73
parameter [113], which is essentially a threshold in gray values. Under this formal-
ization, (3.12) could offer an adaptive parameter selection. With the robust coupling
structure, our algorithm with adaptive parameter selection may be able to provide a
robust multiscale boundary detection method.
Another advantage of the probability measure framework is that there is no need
to assume a priori knowledge about each region, which is necessary in relaxation
labeling [112] [55] and the comparison across windows with different sizes and shapes
is feasible. This could lead to an adaptive window selection that preserves small but
important features which cannot be handled well by the current implementation.
There is one intrinsic limitation common to many smoothing approaches including
our proposed one. After smoothing, the available feature is the average gray value,
resulting in loss of information for further processing. One way to overcome this
problem is to apply the smoothing in feature spaces derived from input images [139].
Another disadvantage of the proposed algorithm is relatively intensive computation
due to the use of oriented windows. Each oriented window takes roughly as long
in one iteration as the edge-enhancing diffusion method [137]. On the other hand,
because our algorithm is entirely local and parallel, computation time would not be
a problem on parallel and distributed hardware. Computation on serial computers
could be reduced dramatically by decomposing the oriented filters hierarchically so
that oriented windows would be used only around discontinuities rather than in ho-
mogeneous regions. The decomposition techniques for steerable and scalable filters
[104] could also help to reduce the number of necessary convolution kernels.
74
CHAPTER 4
SPECTRAL HISTOGRAM: A GENERIC FEATURE FOR
IMAGES
In this chapter, we propose a generic statistic feature for homogeneous texture im-
ages, which we call spectral histograms. A similarity measure between any given image
patches is then defined as the Kullback-Leibler divergence or other distance measures
between the corresponding spectral histograms. Unlike other similarity measures,
it can discriminate texture as well as intensity images and provide a unified, non-
parametric similarity measure for images. We demonstrate this using examples in
texture image synthesis [147], texture image classification, and content-based image
retrieval. We compare several different distance measures and find that the spectral
histogram is not sensitive to the particular form of distance measure. We also com-
pare the spectral histogram with other statistic features and find that the spectral
histogram gives the best result for classification of a texture image database. We find
that distribution of local features is important while the local features themselves do
not appear to be critically important for texture discrimination and classification.
75
4.1 Introduction
As discussed in Chapter 1, the ultimate goal of a machine vision system is to
derive a description of input images. To build an efficient and effective machine vision
system, a critical step is to derive meaningful features. Here “meaningful” states that
the high-level modules such as recognition should be able to use the derived features
readily. To illustrate the problem, Figure 1.1(a) shows a texture image and a small
patch from 1.1(a) is shown in Figure 1.1(b). Figure 1.1(c) shows the numerical values
of (b). It is extremely difficult, if not impossible, to derive a segmentation or useful
representation for the given image purely based on the numerical values. This example
demonstrates that features need to be extracted based on input images for machine
vision systems as well as for biological vision systems.
While many algorithms and systems have been proposed for image segmentation,
classification, and recognition, feature extraction is not well addressed. In most cases,
features are chosen by assumption for mathematical convenience or domain-specific
heuristics. For example, Canny [13] derived the most widely used edge detection
algorithm based on a step edge with additive Gaussian noise. Using variational tech-
niques, he obtained the optimal filter for the assumed model and proposed the first-
order derivative of Gaussian as an efficient and good approximation. Inspired by
the neurophysiological experimental findings of on- and off-center simple cells [54]
[53], Marr and Hildreth [89] proposed the second-order derivative of Gaussian (LoG,
Laplacian of Gaussian) to model the responses of the on- and off-center cells. This
simple, piece-wise constant model has been widely used in image segmentation al-
gorithms. For example, Mumford and Shah [94] proposed an energy functional for
76
image segmentation
E(f,Γ) = µ∫ ∫
R(f − I)2dxdy +
∫ ∫
R\Γ‖ ∇f ‖2 dxdy + ν|Γ|. (4.1)
Here I is a two-dimensional input image defined on R, and f is the solution to be
found and Γ is the boundary of f . One can see that the underlying assumption of
the solution space is piece-wise smooth images. The energy functional shown in (4.1)
was claimed to be a generic energy functional [92] in that most existing segmentation
algorithms and techniques can be derived from the proposed energy functional.
Another major line of research related to feature extraction is texture discrimina-
tion and classification/segmentation. There are no obvious features that work well
for all texture images. The human visual system, however, can discriminate textures
robustly and effectively. This observation inspired many research activities. Julesz
[62] pioneered the research in searching for feature statistics for human visual per-
ception. He first studied k-gon statistics and conjectured that k = 2 is sufficient for
human visual perception. The conjecture was experimentally disproved by synthesiz-
ing perceptually different textures with identical 2-gon statistics [11] [24]. Other early
features for texture discrimination include co-occurrence matrices [44] [43], first-order
statistics [61], second-order statistics [18], and Markov random fields [65] [21] [16].
Those features have limited expressive power due to that the analysis of spatial inter-
action is limited to a relatively small neighborhood [147] and are applied successfully
to so-called micro-textures.
In the 1980s, theories on human texture perception were established, largely
based on available psychophysical and neurophysiological data [12] [23] and joint
spatial/frequency representation. Theories state that the human visual system trans-
forms the retinal image into local spatial/frequency representation, which can be
77
computationally simulated by convolving the input image with a bank of filters with
tuned frequencies and orientations. The mathematical framework for the local spa-
tial/frequency was laid out by Gabor [33] in the context of communication systems.
In Fourier transform, a signal can be represented in time and frequency domains. The
basis functions in time domain are impulses with different time delays, and the basis
functions in frequency domain are complex sinusoid functions with different frequen-
cies. A major problem with Fourier transform is localization. As shown in Figure
4.1, while the impulse in time domain precisely localizes the signal component, its
Fourier transform does not provide any localization information in frequency domain.
While the sinusoid function provides accurate localization in frequency domain, it
is not possible to localize the signal in time domain. Essentially, Fourier transform
uses two basis functions, which provides best localization in one domain and no lo-
calization in the other. Based on this observation, Gabor proposed a more generic
time/frequency representation [33], where the basis functions of Fourier transforms
are just two opposite extremes. By minimizing the localization uncertainty both in
time and frequency domains, Gabor proposed basis functions, which can achieve the
optimality simultaneously in time and frequency domains. Gabor basis functions were
extended to two-dimensional images by Daugman [22]. Very recently, this theory has
also been confirmed by deriving similar feature detectors from natural images based
on certain optimization criteria [100] [101] [102] [1].
The human perception theory and the local spatial/frequency representational
framework inspired much research in texture classification and segmentation. Within
this framework, however, statistic features still need to be chosen due to that filter
78
(a)
(b)
Figure 4.1: Basis functions of Fourier transform in time and frequency domains withtheir Fourier transforms. (a) An impulse and its Fourier transform. (b) A sinusoidfunction and its Fourier transform.
79
responses are not homogeneous within homogeneous textures and not sufficient be-
cause they are linear. As shown in Figure 4.2, a Gabor filter, shown in Figure 4.2(b),
responses to local and oriented structures, as shown in Figure 4.2(c), and the filter
response itself does not characterize the texture. Intuitively, texture appearance can
not be characterized by very local pixel values because texture is a regional prop-
erty. If we want to define a feature that is homogeneous within a texture region, it
is necessary to integrate responses from filters with multiple orientations and scales.
In other words, features need to be defined as statistic measures on filter responses.
For example, Unser [127] used variances from different filters to characterize textures.
Ojala et al [99] compared different features for texture classification using the distri-
bution of detected local features using a database consisting of nine images. Puzicha
et al [107] used a distribution of responses from a set of Gabor filters as features.
However, they posed the texture segmentation as an energy minimization based on a
pair-wise discrimination matrix, and the features used are not analyzed in terms of
characterizing texture appearance.
Recently, Heeger and Bergen [45] proposed a texture synthesis algorithm that can
match texture appearance. The algorithm tries to transform a random noisy image
into an image with similar appearance to the given target image by matching inde-
pendently the histograms of image pyramids constructed from the noisy and target
images. The experimental results are impressive even though no theoretical justifi-
cation was given. De Bonet and Viola [8] attempted to match the joint histograms
by utilizing the conditional probability defined on parent vectors. A parent vector is
a vector consisting of the filter responses in the constructed image pyramid up to a
80
0
5
10
15
20
0
5
10
15
20−1
−0.5
0
0.5
1
1.5
(a) (b) (c)
Figure 4.2: A texture image with its Gabor filter response. (a) Input texture image.(b) A Gabor filter, which is truncated to save computation. (c) The filter responseobtained through convolution.
given scale. As pointed out by Zhu et al. [147], these methods do not guarantee to
match the proposed statistics closely.
Zhu et al. [148] [149] [150] proposed a theory for learning probability models by
matching histograms based on maximum entropy principle and a FRAME (Filters,
Random field, And Maximum Entropy) model is developed for texture synthesis. To
avoid the computational problem of learning the Lagrange multipliers in the FRAME
model, Julesz ensemble is defined as the set of texture images that have the same
statistics as the observed images [147]. It demonstrates through experiments that
feature pursuit and texture synthesis can be done effectively by sampling from the
Julesz ensemble using MCMC (Monte-Carlo Markov Chain) sampling. It has been
shown that the Julesz ensemble is consistent with the FRAME model [143].
In this chapter, inspired by the FRAME model [148] [149] [150] and especially the
texture synthesis model [147], we define a feature, what we call spectral histograms.
Given a window in an input image centered around a given pixel, we construct a
81
pyramid based on the local window using a bank of chosen filters. We calculate the
histogram for each local window in the pyramid. We obtain a vector, consisting of
the histograms from all filters, which is defined as spectral histogram at the given
location.
For chosen statistic features to be used for successive modules, such as classifica-
tion, segmentation, and recognition, a similarity/distance measure must be defined.
A distance measure between two spectral histograms is defined as the χ2-statistic or
other distance measures between the spectral histograms. A distance measure using
χ2-statistic was proposed based on empirical experiments [107] [50]. However, as we
will demonstrate using classification, the particular form of distance measure is not
critical for spectral histograms.
In Section 4.2, we formally define spectral histograms and give some properties
of spectral histograms. In Section 4.3 we show how to synthesize texture images
by matching spectral histograms. In Section 4.4, we study spectral histograms for
texture classification. In Section 4.5, we apply spectral histograms to the problem of
content-based image retrieval. In Section 4.6, we compare different texture features
and different similarity measures using classification. In Section 4.7, we apply our
model to synthetic texture pair discrimination. Section 4.8 concludes this chapter
with further discussion.
4.2 Spectral Histograms
Given an input image I, defined on a finite two-dimensional lattice, and a bank
of filters F (α), α = 1, 2, . . . , K, a sub-band image I(α) is computed for each filter
through linear convolution, i.e., I(α)(v) = F (α) ∗ I(v) =∑
u F (u)I(v − u). I(α), α =
82
1, 2, . . . , K can be considered as an image pyramid constructed from the given image
I when there exist scaling relationships among the chosen filters. Here we loosely call
I(α), α = 1, 2, . . . , K as an image pyramid for an arbitrary chosen bank of filters. For
each sub-band image, we define the marginal distribution, or histogram
H(α)I (z) =
1
|I|∑
v
δ(z − I(α)(v)). (4.2)
We then define the spectral histogram with respect to the chosen filters as
HI = (H(1)I , H
(2)I , . . . , H
(K)I ). (4.3)
Spectral histograms of an image or an image patch are essentially a vector con-
sisting of marginal distributions of filter responses. The size of the input image or
the input image patch is called integration scale. We define a similarity measure be-
tween two spectral histograms using different standard distance measures. Lp-norm
distance is defined as
|HI1 −HI2|p =K∑
α=1
(
∑
z
(H(α)I1
(z)−H(α)I2
(z))p
)(1/p)
. (4.4)
Also because marginal distribution of each filter response is a distribution, a dis-
tance can be defined based on discrete Kullback-Leibler divergence [70]
KL(HI1 , HI2) =K∑
α=1
∑
z
(H(α)I1
(z)−H(α)I2
(z)) logH
(α)I1
(z)
H(α)I2
(z). (4.5)
Another choice for distance is χ2-statistic, which is a first-order approximation of
Kullback-Leibler divergence and is used widely to compare histograms
χ2(HI1 , HI2) =K∑
α=1
∑
z
(H(α)I1
(z)−H(α)I2
(z))2
H(α)I1
(z) +H(α)I2
(z). (4.6)
83
0
5
10
15
20
0
5
10
15
20−1
−0.5
0
0.5
1
1.5
(a) (b) (c)
(d)
Figure 4.3: A texture image and its spectral histograms. (a) Input image. (b) AGabor filter. (c) The histogram of the filter. (d) Spectral histograms of the image.There are eight filters including intensity filter, gradient filters Dxx and Dyy, fourLoG filters with T =
√2/2, 1, 2, and 4, and a Gabor filter Gcos(12, 150). There
are 8 bins in the histograms of intensity and gradient filters and 11 bins for the otherfilters.
84
4.2.1 Properties of Spectral Histograms
Images sharing the same spectral histograms define an ensemble, called Julesz
ensemble [147]. Equivalence between Julesz ensemble and the FRAME model [149]
[150] has been established [143].
As proven in [150], the true probability model of one type of texture images can
be approximated by linear combinations of marginal distributions given in spectral
histograms. In other words, spectral histograms provide a set of “basis functions” for
statistic modeling of texture images.
The spectral histogram and the associated distance measure provide a unified
similarity measure for images. Because the marginal distribution is independent of
image sizes, any two image patches can be compared using the spectral histogram.
Naturally, we can define a scale space using different integration scales, which can be
used to measure the homogeneity. This will be studied in the next chapter.
Due to that spectral histograms are based on marginal distributions, they pro-
vide a statistical measure and two images do not need to be aligned in order to be
compared.
4.2.2 Choice of Filters
The filters used consist of filters that are consistent with the human perception
theory. Following Zhu et al [149], we use four kinds of filters.
1. The intensity filter, which is the δ() function and captures the intensity value
at a given pixel.
2. Difference or gradient filters. We use four of them:
85
Dx = C · [0.0 − 1.0 1.0]
Dy = C ·
0.0−1.0
1.0
Dxx = C · [−1.0 2.0 − 1.0].
Dyy = C ·
−1.02.0−1.0
Here C is a normalization constant.
3. Laplacian of Gaussian filters:
LoG(x, y|T ) = C · (x2 + y2 − T 2)e−x2+y2
T2 , (4.7)
where C is a constant for normalization and T =√
2σ determines the scale
of the filter and σ is the variance of the Gaussian function. These filters are
referred to as LoG(T ).
4. The Gabor filters with both sine and cosine components:
Gabor(x, y|T, θ) = C · e− 12T2 (4(x cos θ+y sin θ)2+(−x cos θ+y sin θ)2)e−i 2π
T(x cos θ+y sin θ).
(4.8)
Here C is a normalization constant and T is a scale. These filters are referred
to as Gcos(T, θ) and Gsin(T, θ).
While there may exist an optimal set for a given texture, we do not change filters
within a task. In general, we use more filters for texture synthesis, namely 56 filters.
We use around 8 filters for texture classification and content-based image retrieval
to save computation. More importantly, it seems unnecessary to use more filters
86
for texture classification and content-based image retrieval when relatively a small
integration scale is used.
4.3 Texture Synthesis
In this section we demonstrate the effectiveness of the spectral histogram in charac-
terizing texture appearance using texture synthesis. We define a relationship between
two texture images using the divergence between their spectral histograms. There ex-
ists such a relationship if and only if their spectral histograms are sufficiently close.
It is easy to check that this defines an equivalence class.
Given observed feature statistics, H (α)obs , α = 1, ..., K, which are spectral his-
tograms computed from observed images, we define an energy function [147]
E(I) =K∑
α=1
D(H(α)(I) , H
(α)obs). (4.9)
Then the corresponding Gibbs distribution is
q(I) =1
ZΘ
exp(−E(I)Θ
) (4.10)
where Θ is the temperature.
The Gibbs distribution can be sampled by a Gibbs sampler or other MCMC
algorithms. Here we use a Gibbs sampler [147] given in Figure 4.4. In Figure 4.4,
q(Isyn(v) | Isyn(−v)) is the conditional probability of pixel values at v given the rest
of the image. D is a distance measure and L1-norm is used for texture synthesis.
For texture synthesis, we use 56 filters:
• The intensity filter.
• 4 gradient filters.
87
Gibbs Sampler
Compute H(α)obs , α = 1, ..., K from observed texture images.
Initialize Isyn as any image (e.g., white noise).Θ← Θ0.Repeat
Randomly pick up a pixel location v in Isyn.Calculate q(Isyn(v) | Isyn(−v)).Randomly flip one pixel Isyn(v) under q(Isyn(v) | Isyn(−v)).Reduce Θ gradually.
Until D(H(α)(Isyn), H
(α)obs) ≤ ε for α = 1, 2, ..., K.
Figure 4.4: Gibbs sampler for texture synthesis.
• 7 LoG filters with T =√
2/2, 1, 2, 3, 4, 5, and 6.
• 36 Cosine Gabor filters with T = 2, 4, 6, 8, 10, and 12, and six orientations θ
= 0, 30, 60, 90, 120, and 150 at each scale.
• 6 Sine Gabor filters with T = 2, 4, 6, 8, 10, and 12, and one orientation θ =
45 at each scale.
• 2 Sine Gabor filters with T = 2, and 12 and one orientation θ = 60.
Those filters are chosen primarily because they are used by Zhu et al [149]. The
cooling schedule is fixed for all the experiments shown in this section.
Figure 4.5(a) shows a texture image. Figure 4.5(b) shows a white noise image,
which is used as the initial image. After 14 sweeps, the noise image is transformed
gradually to the image shown in Figure 4.5(c) by matching the spectral histogram of
the two images. Figure 4.5(d) shows the L1-norm distance between the spectral his-
togram of the observed and synthesized image with respect to the number of sweeps.
88
The matched error decreases at an exponential rate, demonstrating the synthesis al-
gorithm is computationally efficient. One can see that the synthesized image shown in
Figure 4.5(c) is perceptually similar to the observed image. By matching the spectral
histogram, synthesized image captures the textural elements and their arrangement
and gives similar perceptual appearance.
Figure 4.6 shows the temporal evolution of a Gabor filter. Figure 4.6(a) shows
the filter, which is truncated to save computation. Figure 4.6(b) shows the histogram
of the filter at different sweeps. Figure 4.6(c) shows the matching error of the filter.
Figure 4.6(d) shows the differences of histograms of the observed and synthesis images,
which is multiplied by 1000 for display purposes. The biggest error among the bins
is less than 0.0003.
Figure 4.7 shows three more texture images and the synthesized images from the
algorithm. The synthesized images are perceptually similar to the observed images
and their spectral histograms match closely. In Figure 4.7(b), due to local minima,
there are local regions which are not reproduced well.
Figure 4.8 shows two texture examples with regular patterns. The texture image
in Figure 4.8(a) shows a very regular leather surface. The synthesized image after
20 sweeps shown in Figure 4.8(a) is perceptually similar to the input texture. But
the regularity of patterns is blurred and each element is not as clear as in the input
image. However, the two images give quite similar percepts and the synthesized im-
age captures the essential arrangement of patterns and the prominent edges in the
input. Figure 4.8(b) shows an example where the vertical long-range arrangements
are promising. While the synthesized image captures the local vertical arrangements,
89
(a) (b)
0 2 4 6 8 10 12 140
5
10
15
20
25
(c) (d)
Figure 4.5: Texture image synthesis by matching observed statistics. (a) Observedtexture image. (b) Initial image. (c) Synthesized image after 14 sweeps. (d) Thetotal matched error with respect to sweeps.
90
0
5
10
15
20
0
5
10
15
20−1
−0.5
0
0.5
1
1.5
1 2 3 4 5 6 7 8 9 10 110
0.05
0.1
0.15
0.2
0.25
(a) (b)
0 2 4 6 8 10 12 140
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11−0.3
−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
(c) (d)
Figure 4.6: Temporal evolution of a selected filter for texture synthesis. (a) A Gaborfilter. (b) The histograms of the Gabor filter. Dotted line - observed histogram,which is covered by the histogram after 14 sweeps; dashed line - initial histogram;dash-dotted line - histogram after 2 sweeps. solid line - histogram after 14 sweeps.(c) The error of the chosen filter with respect to the sweeps. (d) The error betweenthe observed histogram and the synthesized one after 14 sweeps. Here the error ismultiplied by 1000.
91
(a)
(b)
(c)
Figure 4.7: More texture synthesis examples. Left column shows the observed imagesand right column shows the synthesized image within 15 sweeps. In (b), due to localminima, there are local regions which are not perceptually similar to the observedimage.
92
it does not sufficiently capture the long-range arrangements due to that the synthe-
sized algorithm is purely local and long-range couplings are almost impossible to be
captured.
While the texture images shown above are homogeneous textures, Figure 4.9(a)
shows an intensity image consisting of several regions. Figure 4.9(c) shows the syn-
thesized image after 100 sweeps. While the spectral histogram does not capture the
spatial relationships between different regions, big regions with similar gray values
emerge along the temporal evolution. Due to the inhomogeneity, the Gibbs sam-
pler converges slower compared with homogeneous texture images as shown in Figure
4.5(d).
Figure 4.10 shows a synthesized image for a synthetic texton image. In order to
synthesize a similar texton, Zhu et al [149] [150] used a texton filter which is the
template of one texton element. Here we use the same filters as for other images. As
shown in Figure 4.10(b), the texton elements are reproduced well except in two small
regions where the MCMC is trapped into a local minimum. This example clearly
demonstrates that spectral histograms provide a generic feature for different types of
textures, eliminating the need for ad hoc features for a particular set of textures.
Figure 4.11 shows a synthetic example where there are two distinctive regions in
the original. As shown in Figure 4.11(b), the synthesized image captures the appear-
ance of both regions using the spectral histogram. Here the boundary between two
regions is not reproduced because spectral histograms do not incorporate geometric
constraints. Using some geometric constraints, the boundary may be reproduced well,
which would give a more powerful feature for images consisting of different regions.
Figure 4.12 shows an interesting result for a face image. While all the “elements” are
93
(a)
(b)
Figure 4.8: Real texture images of regular patterns with synthesized images after 20sweeps. (a) An image of a leather surface. The total matched error after 20 sweeps is0.082. (b) An image of a pressed calf leather surface. The total matched error after20 sweeps is 0.064.
94
(a) (b)
0 10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
80
(c) (d)
Figure 4.9: Texture synthesis for an image with different regions. (a) The observedtexture image. This image is not a homogeneous texture image and consists mainlyof two homogeneous regions. (b) The initial image. (c) Synthesized image after 100sweeps. Even though the spectral histogram of each filter is matched well, comparedto other images, the error is still large. Especially for the intensity filter, the erroris still about 7.44 %. The synthesized image is perceptually similar to the observedimage except the geometrical relationships among the homogeneous regions. (d) Thematched error with respect to the sweeps. Due to that the observed image is nothomogeneous, the synthesis algorithm converges slower compared with Figure 4.5(d).
95
(a) (b)
Figure 4.10: A synthesis example for a synthetic texton image. (a) The originalsynthetic texton image with size 128 × 128. (b) The synthesized image with size256× 256.
captured in the synthesized image, the result is not meaningful unless some geometric
constraints are incorporated. This will be investigated further in the future.
To evaluate the synthesis algorithm more systematically, we have applied it to all
the 40 texture images shown in Figure 4.16. To save space, the reduced images are
shown in Figure 4.13. These examples clearly demonstrate that spectral histograms
capture texture appearance well.
4.3.1 Comparison with Heeger and Bergen’s Algorithm
As pointed out by Zhu et al [147], Heeger and Bergen’s algorithm does not ac-
tually match any statistic features defined on the input image. On the other hand,
the synthesis algorithm described above characterizes texture appearance explicitly
through the spectral histogram of the observed image(s), as demonstrated using real
texture images. One critical difference is that our proposed algorithm provides a
statistic modeling of the observed image(s) in that the algorithm only needs to know
96
(a) (b)
Figure 4.11: A synthesis example for an image consisting of two regions. (a) Theoriginal synthetic image with size 128 × 128, consisting of two intensity regions. (b)The synthesized image with size 256× 256.
(a) (b)
Figure 4.12: A synthesis example for a face image. (a) Lena image with size 347×334.(b) The synthesized image with size 256× 256.
97
Fabric-0 Fabric-4 Fabric-7 Fabric-9 Fabric-15 Fabric-17 Fabric-18
Food-0 Food-5 Leaves-3 Leaves-8 Leaves-12 Leaves-13 Metal-0
Metal-2 Misc-0 Misc-2 Stone-5 Water-1 Water-2 Water-6
Water-8 Beachsand-2 Calfleath-1 Calfleath-2 Grass-1 Grass-7 Grave-5
Hexholes-2 Pigskin-1 Pigskin-2 Plasticbubs-13 Raffia-1 Raffia-2 Roughwall-5
Sand-1 Woodgrain-1 Woodgrain-2 Woolencloth-1 Woolencloth-2
Figure 4.13: The synthesized images of the 40 texture images shown in Figure 4.16.Here same filters and cooling schedule are used for all the images.
98
Observed Black Gray White
Error L1-norm RMS L1-norm RMS L1-norm RMS L1-norm RMS
Observed 0.00 % 0.00 0.10 % 54.2 0.10 % 54.2 0.11 % 55.4
Black 0.10 % 54.2 0.00 % 0.00 0.11 % 54.6 0.11 % 60.8
Gray 0.10 % 54.1 0.11 % 54.6 0.00 % 0.00 0.12 % 50.3
While 0.11 % 55.4 0.11 % 60.8 0.12 % 50.3 0.00 % 0.00
Table 4.1: L1-norm distance of the spectral histograms and RMS distance betweenimages.
the spectral histogram of the input image(s) and it does not need the input images
while Heeger and Bergen’s algorithm needs the input image.
For comparison, we use the texture image shown in Figure 4.3(a). For the pro-
posed algorithm, we synthesize images starting with different initial images, as shown
in Figure 4.14. One can see that different initial images are transformed into percep-
tually similar images by matching the spectral histogram, where the matching error
is shown in Figure 4.14(d). Table 4.1 shows the L1-norm distance of the histograms of
the observed and synthesized images and the Root-Mean-Square (RMS) distance of
the corresponding images. From Table 4.1, one can see clearly that even though the
rooted mean square distance is large between the observed and synthesized images
from different initial conditions, the corresponding L1-norm distance between their
spectral histograms is quite small. Given that the synthesized images are perceptu-
ally similar to the observed image, we conclude that spectral histograms provide a
statistic feature to characterize texture appearance.
For the same input image, we have also applied the algorithm by Heeger and
Bergen [45]. The implementation used here is by El-Maraghi [28]. Figure 4.15 (a)-
(c) show the synthesized images with different number of iterations. Compared to
99
=⇒
(a)
=⇒
(b)
=⇒
(c)
0 2 4 6 8 10 12 14 16 18 200
10
20
30
40
50
60
70
80
90
(d)
Figure 4.14: Synthesized images from different initial images for the texture imageshown in Figure 4.3(a). (a)-(c) Left column is the initial image and right column isthe synthesized image after 20 sweeps. (d) The matched error with respect to thenumber of sweeps.
100
the synthesized images from our method shown in Figure 4.14(a)-(c), one can see
easily that the synthesized images in Figure 4.14(a)-(c) are perceptually similar to
the input texture image and capture texture elements and their distributions, while
the synthesized images in Figure 4.15 (a)-(c) are not percetually similar to the input
texture. As shown in Figure 4.15(d), the Heeger and Bergen’s algorithm does not
match the statistic features on which the algorithm is based and the error does not
decrease after 1 iteration. Because the Heeger and Bergen’s algorithm used Laplacian
of Gaussian paramid, we choose LoG and local filters for the spectral histogram here
for a fair comparison.
4.4 Texture Classification
Texture classification is closely related to texture segmentation and content-based
image retrieval. Here we demonstrate the discriminative power of spectral histograms.
A texture image database is given first and we extract spectral histograms for each
image in the database. The classification here is then to classify all the pixels or
selected pixels in an input image.
Given a database with M texture images, we represent each image by its average
spectral histogram Hobsm at a given integration scale. We use a minimum-distance
classifier, given by
m∗(v) = minm
D(HI(v), Hobsm). (4.11)
Here D is a similarity measure and χ2-statistic is used.
We use a texture image database available on-line at http://www-dbv.cs.uni-
bonn.de/image/texture.tar.gz. As shown in Figure 4.16, the database we use consists
of 40 texture images from Brodatz textures [10].
101
(a) (b) (c)
0 10 20 30 40 50 60 70 80 90 1000
1
2
3
4
5
6
7
8
9
(d)
Figure 4.15: Synthesized images from Heeger and Bergen’s aglorithm and thematched spectral histogram error for the image shown in Figure 4.3(a). (a) Synthe-sized image at 3 iterations. (b) Synthesized image at 10 iterations. (c) Synthesizedimage at 100 iterations. (d) The L1-norm error of the observed spectral histogramand the synthesized one.
102
Fabric-0 Fabric-4 Fabric-7 Fabric-9 Fabric-15 Fabric-17 Fabric-18
Food-0 Food-5 Leaves-3 Leaves-8 Leaves-12 Leaves-13 Metal-0
Metal-2 Misc-0 Misc-2 Stone-5 Water-1 Water-2 Water-6
Water-8 Beachsand-2 Calfleath-1 Calfleath-2 Grass-1 Grass-7 Grave-5
Hexholes-2 Pigskin-1 Pigskin-2 Plasticbubs-13 Raffia-1 Raffia-2 Roughwall-5
Sand-1 Woodgrain-1 Woodgrain-2 Woolencloth-1 Woolencloth-2
Figure 4.16: Forty texture images used in the classification experiments. The inputimage size is 256× 256.
103
4.4.1 Classification at Fixed Scales
First we study the classification performance using the spectral histogram at a
given integration scale. As discussed above, the classification algorithm is a minimum
distance classifier and each location is classified independently from other locations.
For the database shown in Figure 4.16, we use a window of 35×35 pixels. For each
image in the database, we extract an average spectral histogram by averaging spectral
histograms extracted at a larger grid size to save computation. For classification and
content-based image retrieval, we use 8 filters to compute spectral histograms:
• The intensity filter.
• Two gradient filters Dxx and Dyy.
• Two LoG filters with T = 2 and 5.
• Three Cosine Gabor filters with T = 6 and three orientations θ = 30, 90, and
150.
Figure 4.17(a) shows the divergence between the feature vector of each image
numerically and Figure 4.17(b) shows the divergence in (a) as an image for easier
interpretation.
Each texture is divided into non-overlapping image patches with size 35 × 35.
Each image patch is classified into one of the 40 given classes using the minimum
distance classifier. Figure 4.18 shows the error rate for each image along with variance
within each image and minimum divergence of each image from the other images
in the database. The overall classification error is 4.2347%. As shown in Figure
104
4.18(a), the error is not evenly distributed. Five images, namely “Fabric-15”, “Leaves-
13”, “Water-1”, “Water-2”, and “Woolencloth-2” account for more than half of the
misclassified cases, with an average error of 19.18%. For the other 35 images, the
classification error is 2.10%.
There are two major reasons of the large error rates for the five images. First, with
respect to the given scale, which is 35× 35, the images are not homogeneous. Figure
4.18(b) shows the variance of the spectral histogram of images. Second, the image
is similar to some other images in the database. This is measured by the minimum
divergence between the feature vector of an image and the feature vector of the other
images. Figure 4.18(c) shows the minimum divergence. As shown clearly, the five
images with large classification error tend to have large variance and small minimum
divergence. The dotted curve in Figure 4.18(a) shows the ratio between the variance
and the minimum divergence of each image. The peaks in the classification error
curve tend to be coincident with the peaks in the ratio curve.
4.4.2 Classification at Different Scales
We also study the classification error with respect to different scales. We use 8
different integration scales in this experiment: 5× 5, 9× 9, 15× 15, 23× 23, 35× 35,
65 × 65, 119 × 119, and 217 × 217 in pixels. The classification algorithm and the
procedure are the same as described for the fixed scale.
The overall classification error is shown in Figure 4.19. As expected, the classifi-
cation error decreases when the scale increases. For spectral histograms, the classifi-
cation error decreases approximately at an exponential rate. This demonstrates that
spectral histograms are very good in capturing texture characteristics.
105
.0 4 2 11 2 5 3 5 3 10 3 7 7 4 6 7 2 1 4 7 4 5 9 3 9 5 1 5 11 3 10 3 3 9 4 4 5 7 2 9
4 .0 3 5 3 3 2 3 1 4 3 3 3 3 2 3 6 4 12 5 10 11 4 3 4 2 4 6 6 6 4 2 5 4 5 6 8 5 5 4
2 3 .0 8 2 3 1 3 1 7 4 5 4 2 3 4 4 2 9 5 8 8 6 1 6 3 2 5 8 3 7 2 2 6 4 4 6 5 2 6
11 5 8 .0 7 6 8 3 6 1 9 2 2 8 3 3 13 12 18 3 17 17 1 7 2 3 10 9 4 12 .8 6 10 1 8 11 13 3 11 .8
2 3 2 7 .0 4 3 3 2 7 2 4 4 4 3 4 3 2 9 5 7 8 6 2 6 3 1 4 9 3 7 .4 2 6 3 5 6 4 2 6
5 3 3 6 4 .0 2 4 2 6 5 4 4 2 3 5 7 5 12 4 11 10 5 2 4 2 5 7 7 5 5 4 4 5 5 6 6 4 5 4
3 2 1 8 3 2 .0 5 1 8 5 5 5 .5 4 6 5 2 9 6 8 8 7 2 6 3 3 7 8 4 7 4 3 7 6 5 6 6 3 6
5 3 3 3 3 4 5 .0 3 2 5 1 1 5 2 2 8 6 13 2 12 11 2 3 2 .7 5 5 6 6 2 2 4 2 4 7 8 3 5 2
3 1 1 6 2 2 1 3 .0 5 3 3 3 2 2 3 6 4 11 4 10 10 5 2 4 2 3 6 7 5 5 2 3 5 5 6 7 5 4 4
10 4 7 1 7 6 8 2 5 .0 7 .8 1 8 3 2 13 11 17 2 16 16 .6 6 2 3 9 8 4 11 .4 5 9 .8 7 10 12 4 10 .7
3 3 4 9 2 5 5 5 3 7 .0 5 5 5 5 4 4 4 10 6 7 9 8 4 8 5 2 5 9 4 8 2 4 8 4 5 7 6 4 8
7 3 5 2 4 4 5 1 3 .8 5 .0 .3 5 2 .8 10 8 15 2 14 14 1 4 1 .9 7 7 4 8 1 3 6 1 6 9 10 3 8 .8
7 3 4 2 4 4 5 1 3 1 5 .3 .0 5 1 1 9 8 15 .9 13 13 2 3 1 1 6 6 5 8 2 3 6 2 5 8 9 2 7 1
4 3 2 8 4 2 .5 5 2 8 5 5 5 .0 4 6 5 3 10 5 9 9 6 3 6 4 3 7 8 5 7 4 5 7 6 5 7 6 4 6
6 2 3 3 3 3 4 2 2 3 5 2 1 4 .0 2 8 7 14 2 12 13 2 3 2 1 5 7 5 8 3 3 6 2 6 8 9 3 7 2
7 3 4 3 4 5 6 2 3 2 4 .8 1 6 2 .0 9 8 14 3 13 13 3 4 3 2 6 6 5 8 3 3 6 3 6 8 9 4 7 2
2 6 4 13 3 7 5 8 6 13 4 10 9 5 8 9 .0 1 4 10 2 5 12 5 12 8 1 6 12 4 13 5 4 12 6 5 5 9 3 12
1 4 2 12 2 5 2 6 4 11 4 8 8 3 7 8 1 .0 5 8 4 5 10 3 10 6 1 7 11 3 11 4 4 10 6 4 5 8 2 10
4 12 9 18 9 12 9 13 11 17 10 15 15 10 14 14 4 5 .0 15 2 4 18 10 17 14 7 9 17 7 18 10 8 17 10 8 7 14 7 17
7 5 5 3 5 4 6 2 4 2 6 2 .9 5 2 3 10 8 15 .0 14 13 2 4 1 2 7 6 5 8 2 4 6 2 5 8 9 .8 7 1
4 10 8 17 7 11 8 12 10 16 7 14 13 9 12 13 2 4 2 14 .0 5 16 9 16 12 4 8 15 7 17 8 8 16 9 7 7 13 6 16
5 11 8 17 8 10 8 11 10 16 9 14 13 9 13 13 5 5 4 13 5 .0 16 9 16 12 7 7 15 4 16 9 6 16 7 6 4 13 4 16
9 4 6 1 6 5 7 2 5 .6 8 1 2 6 2 3 12 10 18 2 16 16 .0 6 .8 2 9 8 3 10 .3 5 8 .5 7 10 12 3 10 .3
3 3 1 7 2 2 2 3 2 6 4 4 3 3 3 4 5 3 10 4 9 9 6 .0 5 2 3 5 8 4 6 2 2 6 4 5 5 4 3 5
9 4 6 2 6 4 6 2 4 2 8 1 1 6 2 3 12 10 17 1 16 16 .8 5 .0 2 9 8 4 10 1 5 8 .9 7 9 11 2 9 .5
5 2 3 3 3 2 3 .7 2 3 5 .9 1 4 1 2 8 6 14 2 12 12 2 2 2 .0 5 5 5 6 2 2 4 2 4 7 8 3 5 2
1 4 2 10 1 5 3 5 3 9 2 7 6 3 5 6 1 1 7 7 4 7 9 3 9 5 .0 5 10 3 10 2 3 9 4 4 5 7 2 9
5 6 5 9 4 7 7 5 6 8 5 7 6 7 7 6 6 7 9 6 8 7 8 5 8 5 5 .0 10 3 8 4 4 8 .9 2 5 6 4 7
11 6 8 4 9 7 8 6 7 4 9 4 5 8 5 5 12 11 17 5 15 15 3 8 4 5 10 10 .0 11 4 8 10 3 10 10 12 7 11 4
3 6 3 12 3 5 4 6 5 11 4 8 8 5 8 8 4 3 7 8 7 4 10 4 10 6 3 3 11 .0 11 4 1 10 2 3 3 8 .7 10
10 4 7 .8 7 5 7 2 5 .4 8 1 2 7 3 3 13 11 18 2 17 16 .3 6 1 2 10 8 4 11 .0 6 9 .6 8 10 12 4 10 .3
3 2 2 6 .4 4 4 2 2 5 2 3 3 4 3 3 5 4 10 4 8 9 5 2 5 2 2 4 8 4 6 .0 3 5 3 5 7 4 3 5
3 5 2 10 2 4 3 4 3 9 4 6 6 5 6 6 4 4 8 6 8 6 8 2 8 4 3 4 10 1 9 3 .0 8 2 3 3 6 .9 8
9 4 6 1 6 5 7 2 5 .8 8 1 2 7 2 3 12 10 17 2 16 16 .5 6 .9 2 9 8 3 10 .6 5 8 .0 7 10 11 3 9 .5
4 5 4 8 3 5 6 4 5 7 4 6 5 6 6 6 6 6 10 5 9 7 7 4 7 4 4 .9 10 2 8 3 2 7 .0 3 3 5 3 7
4 6 4 11 5 6 5 7 6 10 5 9 8 5 8 8 5 4 8 8 7 6 10 5 9 7 4 2 10 3 10 5 3 10 3 .0 4 8 3 9
5 8 6 13 6 6 6 8 7 12 7 10 9 7 9 9 5 5 7 9 7 4 12 5 11 8 5 5 12 3 12 7 3 11 3 4 .0 8 3 11
7 5 5 3 4 4 6 3 5 4 6 3 2 6 3 4 9 8 14 .8 13 13 3 4 2 3 7 6 7 8 4 4 6 3 5 8 8 .0 7 3
2 5 2 11 2 5 3 5 4 10 4 8 7 4 7 7 3 2 7 7 6 4 10 3 9 5 2 4 11 .7 10 3 .9 9 3 3 3 7 .0 9
9 4 6 .8 6 4 6 2 4 .7 8 .8 1 6 2 2 12 10 17 1 16 16 .3 5 .5 2 9 7 4 10 .3 5 8 .5 7 9 11 3 9 .0
(a)
(b)
Figure 4.17: The divergence between the feature vector of each image in the textureimage database shown in Figure 4.16. (a) The cross-divergence matrix shown innumerical values. (b) The numerical values are displayed as an image.
106
0 5 10 15 20 25 30 35 400
2
4
6
8
10
12
(a)
0 5 10 15 20 25 30 35 400
0.5
1
1.5
2
2.5
3
3.5
4
4.5
(b)
0 5 10 15 20 25 30 35 400
0.5
1
1.5
2
2.5
3
3.5
4
(c)
Figure 4.18: (a) The classification error for each image in the texture database alongwith the ratio between the maximum and minimum divergence shown in (b) and (c)respectively. (b) The maximum divergence of spectral histogram from the featurevector of each image. (c) The minimum divergence between each image and the otherones.
107
0 50 100 150 200 2500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Figure 4.19: The classification error of the texture database with respect to the scalefor feature extraction.
As one would expect, the classification error with respect to scales varies consid-
erably for images. For example, image “Hexholes-2”, shown in Figure 4.20(a), is very
homogeneous and visually very different from other images in the database. For this
image, the classification result is good for all the scales used in the experiment. Even
using a window of 5 × 5 pixels, the classification error is 1.61%. At all other scales,
the classification is 100% correct, as shown in Figure 4.20(b).
Figure 4.21(a) shows image “Woolencloth-2” from the database. This image is
not homogeneous and perceptually similar to some other images. For this image, the
classification error is large when the scale is small. When the scale is 35 × 35, the
classification error is 20.41%. When the scale is larger than 35× 35, the classification
result is perfect, as shown in Figure 4.21(b).
4.4.3 Image Classification
In this section, we classify images using the database. Each pixel in the image
is classified using the minimum distance classifier. A window centered at a given
108
(a)
0 50 100 150 200 2500
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0 50 100 150 200 2500
0.5
1
1.5
2
2.5
3
(b) (c)
Figure 4.20: (a) Image “Hexholes-2” from the texture database. (b) The classificationerror rate for the image. (c) The ratio between maximum divergence and minimumcross divergence with respect to scales.
109
(a)
0 50 100 150 200 2500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 50 100 150 200 2500
20
40
60
80
100
120
140
160
180
(b) (c)
Figure 4.21: (a) Image “Woolencloth-2” from the texture database. (b) The clas-sification error rate for the image. (c) The ratio between maximum divergence andminimum cross divergence with respect to scales.
110
pixel is used. For pixels near image borders, a window that is inside the image and
contains the pixel is used. The input texture image, shown in Figure 4.22(a) con-
sists of five texture regions from the database. Figure 4.22(b) shows the classification
result, Figure 4.22(c) shows the divergence between the spectral histogram and the
feature vector of the assigned texture image. Figure 4.22(d) shows the ground truth
segmentation and Figure 4.22(e) shows the misclassified pixels, which are shown in
black. One can see that the interior pixels in each homogeneous texture region are
classified correctly and the divergence of these pixels is small. Only the pixels that
are between texture regions are misclassified due to that the spectral histogram com-
puted is a mixture of different texture regions. At misclassified pixels, the divergence
is large. This demonstrates that the proposed spectral histogram is a reliable sim-
ilarity/dissimilarity measure between texture images. Furthermore, it also provides
a reliable measure for goodness of the classification. The classification result can be
improved by incorporating context-sensitive feature detectors, as discussed in Section
5.6.
4.4.4 Training Samples and Generalization
For the classification results shown in this section, training samples are not sepa-
rated clearly from samples for testing due to the limited number of samples especially
at large integration scales. Through experiments, we demonstrate that the number
of training samples is not critical using spectral histograms.
First we re-do some of the experiments for classification. Here we use half of
the samples for training and do the testing on the remaining half. Figure 4.23(a)
shows the classification error for each image at integration scale 35 × 35 and Figure
111
(a) (b)
(c) (d)
(e)
Figure 4.22: (a) A texture image consisting of five texture regions from the texturedatabase. (b) Classification result using spectral histograms. (c) Divergence betweenspectral histograms and the feature vector of the assigned texture image. (d) Theground truth segmentation of the image. (e) Misclassified pixels, shown in black.
112
0 5 10 15 20 25 30 35 400
0.05
0.1
0.15
0.2
0.25
0 10 20 30 40 50 60 700
0.1
0.2
0.3
0.4
0.5
0.6
0.7
(a) (b)
Figure 4.23: (a) The classification error for each image in the database at integrationscale 35×35. (b) The classification error at different integration scales. In both cases,solid line – training is done using half of the samples; dashed line – training is doneusing all the samples.
4.23(b) shows the overall error at different integration scales along with the results
shown before. While the classification error varies from image to image using different
training samples, the overall classification error does not change much.
We also examine the influence of the ratio of testing samples to training samples.
Figure 4.24 shows the classification error with respect to the ratio of testing samples
to training samples at integration scales 23 × 23 and 35 × 35. It demonstrates that
the spectral histogram captures texture characteristics well from a small number of
training samples.
4.4.5 Comparison with Existing Approaches
Recently, Randen and Husoy did an extensive comparative study for texture clas-
sification using different filtering-based methods [108]. We have applied our method
113
0 5 10 150
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
Figure 4.24: The classification error with respect to the ratio of testing samples totraining samples. Solid line – integration scale 35×35; dashed line – integration scale23× 23.
to the same images. We use integration scale 35 × 35 and the same 8 filters used in
Section 4.4.1. We use 1/3 samples for training and the remaining 2/3 samples for
testing. For the other methods, we show the average performance and the best shown
in Tables 3, 6, 8, and 9 in [108]. The results for two groups of texture images are
shown in Table 4.2.
The first group consists of 10 texture images, which are shown in Figure 4.25. In
this group, each image is visually different from other images. Our method is signifi-
cantly better than the best performance shown in [108]. The second group, shown in
Figure 4.26, is very challenging for filtering methods due to the inhomogeneity within
each texture region and similarity among different textures. For all the methods
in [108], the performance is close to a random decision. Our method, however, gives
17.5% classification error, which dramatically improves the classification performance.
114
Texture Existing methods in [108] Proposedgroup Average Best method
Figure 4.25 47.9 % 32.3 % 9.7 %Figure 4.26 89.0 % 84.9 % 17.5 %
Table 4.2: Classification errors of methods shown in [108] and our method
This comparison clearly suggests that classification based on filtering output is
not sufficient in characterizing texture appearance and an integration after filtering
must be done as Malik and Perona pointed out that a nonlinearity after filtering is
necessary for texture discrimination [87].
This comparison along with the results on texture synthesis strongly indicates
that spectral histogram is necessary in order to capture texture appearance.
4.5 Content-based Image Retrieval
Content-based image retrieval is closely related to image classification and segmen-
tation. Given some desired feature statistics, one would like to find all the images in a
given database that is similar to the given feature statistics. For example, one can use
intensity histogram and other statistics that are easy to compute to efficiently find
images. Because those heuristic-based features do not provide sufficient modeling for
images, it is not possible to provide any theoretical justification about the result. As
shown in previous sections on texture image synthesis and classification, spectral his-
tograms provide a statistically sufficient model for natural homogeneous textures. In
this section, we demonstrate how the spectral histogram and the associated similarity
measure can be used to retrieve images that contain perceptually similar region.
115
For content-based image retrieval, we use a database consisting of 100 texture
images each of which is composed of five texture regions from the texture image
database used in the classification example. For a given image patch, we extract
its spectral histogram as its footprint. As mentioned before, the same filters for
classification are used. Then we try to identify the best match in a texture image in
the database. To save some computation, we only compute spectral histogram on a
coarsely sampled grid.
Figure 4.27 shows an example. Figure 4.27(a) shows an input image patch with
size of 35× 35 in pixels. Figure 4.27(b) shows the minimum divergence between the
given patch and the images in the database. It shows clearly that there is a step edge.
The images that have a smaller matched error than the threshold given by the step
edge all actually contain the input match. Figure 4.27(c) shows the first nine image
with the matched error. The matched errors of the first eight images are much smaller
than the ninth image. This demonstrates that the spectral histogram characterizes
texture appearance very well for homogeneous textures.
Figure 4.28 shows another example. Here the texture is not very homogeneous
and consists of regular texture patterns. Figure 4.28(a) shows the input image patch
with size of 53 × 53 in pixels. Figure 4.28(b) shows the matched error of the 100
images in the database. While the edge is not a step edge, a ramp edge can be seen
clearly. As in Figure 4.27, this ramp edge defines a threshold for image that actually
contain patches that are perceptually similar to the given patch. Figure 4.28(c) shows
the first 12 images with the smallest match errors. All of them are correctly retrieved.
From these two examples, we can see a fixed threshold might be chosen. In both
cases, the threshold is around 0.25. Because χ2-statistic obeys a χ2 distribution, a
118
0
10
20
30
40
50
60
70
80
90
10
00 1 2 3 4 5 6 7
(a)
(b)
0.0
73306
0.0
88377
0.0
91331
0.0
92707
0.0
92707
0.1
16957
0.2
53079
0.2
57533
2.4
76219
Figu
re4.27:
Image
retrievalresu
ltfrom
a100-im
agedatab
aseusin
ga
givenim
agepatch
based
onsp
ectralhistogram
s.(a)
Input
image
patch
with
size35×
35.(b
)T
he
sortedm
atched
errorfor
the
100im
agesin
the
datab
ase.(c)
The
first
nin
eim
agew
ithsm
allesterrors.
119
0 10 20 30 40 50 60 70 80 90 1000
0.5
1
1.5
2
2.5
3
3.5
4
4.5
(a) (b)
0.053618 0.068854 0.098365 0.098365
0.131868 0.138115 0.140837 0.141649
0.149421 0.175323 0.175323 0.210802
Figure 4.28: Image retrieval result from a 100-image database using a given imagepatch based on spectral histograms. (a) Input image patch with size 53×53. (b) Thesorted matched error for the 100 images in the database. (c) The first nine imagewith smallest errors.
120
threshold can be computed given a level of confidence. Also the algorithm is intrin-
sically parallel and local. Using parallel machines, the search may be implemented
very efficiently.
4.6 Comparison of Statistic Features
In this section, we numerically evaluate the performance of classification of differ-
ent features, different filters, and different distance measures for spectral histograms.
First we compare the spectral histogram with statistical values such as mean
and variance. If we assume that the underlying true image is a constant image, the
mean value is the best choice for a statistic feature. If we use a Gaussian model to
characterize the input image, mean and variance values are the features to be used.
Here we use a linear combination of mean and variance values, where the weights
are determined for the best performance at integration scale 35 × 35. We also use
the intensity histogram of image patches as a feature vector. Figure 4.29 shows the
classification error of the database used in Section 4 with respect to the integration
scale.
From Figure 4.29, we see that as the model gets more sophisticated, the perfor-
mance gets better. The mean value of each image patch does not provide a sufficient
feature for texture images and gives the largest classification error at all integration
scales. While the performance improves when the scale increases, the improvement
is not significant. Also the combination of mean and variance does not give good
results, suggesting Gaussian model, even though it is widely used in the literature, is
not good enough even for homogeneous texture images. The spectral histogram gives
the best performance at all integration scales. Compared with mean and Gaussian
121
0 20 40 60 80 100 1200
10
20
30
40
50
60
70
80
90
100
Figure 4.29: Classification error in percentage of texture database for different fea-tures. Solid line: spectral histogram of eight filters including intensity, gradients, LoGwith two scales and Gabor with three different orientations. Dotted line: Mean valueof the image patch. Dashed line: Weighted sum of mean and variance values of theimage patch. The weights are determined to achieve the best result for window size35× 35. Dash-dotted line: Intensity histogram of image patches.
model, the intensity histogram gives much better performance. This indicates that
the distribution is more important for texture discrimination and classification.
Ojala et al [99] did a comparison of different statistic features for texture image
classification using a database consisting of nine images only. Here we compare the
spectral histogram with gradient detectors, edge detectors (LoG) with different scales,
and Gabor filters with different scales and orientations. Figure 4.30 shows the result.
Compared to Figure 4.29, the differences between different feature detectors are not
large. At large integration scales, Gabor filters gives the second best result, suggesting
that oriented structures are more important for large texture patches than symmetric
edge detectors. Because gradient detectors are very local, they give the worst per-
formance. But if compared with the intensity histogram and Gaussian model shown
122
0 20 40 60 80 100 1200
10
20
30
40
50
60
70
80
Figure 4.30: Classification error in percentage of the texture database for differentfilters. Solid line: spectral histogram of eight filters including intensity, gradients,LoG with two scales and Gabor with three different orientations. Dotted line: Gra-dient filters Dxx and Dyy; Dashed line: Laplacian of Gaussian filters LoG(
√2/2),
LoG(1), and LoG(2). Dash-dotted line: Six Cosine Gabor filters with T = 4 and sixorientations θ = 0, 30, 60, 90, 120, and 150.
in Figure 4.29, it still gives a very good performance. This again confirms that the
distribution of local features is important for texture classification and segmentation.
Finally, for spectral histograms, we compare different distance measures. As given
in equations (4.4) (4.5), and (4.6), here we use L1-norm, L2-norm, Kullback-Leibler
divergence, and χ2-statistic. As shown in Figure 4.31, the differences between different
distance measures are very small, suggesting that spectral histograms do not depend
on the particular form of distance measure. This observation is different from the
conclusion in [107], where the authors claim χ2-statistic consistently gives the best
performance. The reason for the differences is that spectral histograms in our case are
derived by integrating information from the same window at different scales, while
different windows are used in [107] at different scales.
123
0 20 40 60 80 100 1200
10
20
30
40
50
60
70
Figure 4.31: Classification error in percentage of the texture database for differentdistance measures. Solid line: χ2-square statistic. Dotted line: L1-norm. Dashedline: L2-norm. Dash-dotted line: Kullback-Leibler divergence.
4.7 A Model for Texture Discrimination
In this section we apply our model to texture discrimination which is widely
studied in psychophysical experiments [69] using synthetic texton patterns. These
texton patterns are in general not homogeneous even within one texture region, for
example, as those shown in Figure 4.32. Our model is intended to characterize texture
regions with homogeneous appearance and thus the texton patterns do not fit to our
assumptions well. However, the result from our model is consistent with existing
psychophysical data and the data from the model by Malik and Perona [87].
We adopt similar procedures used by Malik and Perona [87]. Instead of using
192 filter pairs, we use two gradient filters Dxx and Dyy, three LoG filters with T
=√
2/2, 1, and 2, resulting in total of five filters. At each pixel, we extract local
spectral histograms at integration scale 29 × 29 and the gradient is the χ2-square
124
(+ O) (+ []) (L +) (L M)
(∆ →) (+ T) (+ X) (T L)
(LL ML) (R-mirror-R)
Figure 4.32: Ten synthetic texture pairs scanned from Malik and Perona [87]. Thesize is 136× 136.
distance between the spectral histograms of the two adjacent windows. Then the
gradient is averaged along each column.
The images used in our experiment are shown in Figure 4.32, which were scanned
from Malik and Perona [87]. The texture gradients for selected texture pairs (+ O)
and (R-mirror-R) from our method are shown in Figure 4.33 (b) and (d).
There are several observations that can be made from the gradient results shown
in Figure 4.33. First, the texture pattern does not give rise to a homogeneous texture
125
20 30 40 50 60 70 80 90 100 1100
20
40
60
80
100
120
(a) (b)
20 30 40 50 60 70 80 90 100 1100
5
10
15
(c) (d)
Figure 4.33: The averaged texture gradient for selected texture pairs. (a) The texturepair (+ O) as shown in Figure 4.32. (b) The texture gradient averaged along eachcolumn for (a). The horizontal axis is the column number and the vertical axis is thegradient. (c) The texture pair (R-mirror-R). (d) The averaged texture gradient for(c).
126
region, and variations within each texture region are clearly perceivable. For example,
in the texture pair (+ O), we do perceive columnar structures besides the major
texture boundary. Second, because of the inhomogeneity, the absolute value of texture
gradient should not be used directly as a measure for texture discrimination, which
was actually used by Malik and Perona [87]. As shown in Figure 4.33 (d), even though
the gradient is much weaker compared to Figure 4.33 (b), the filters still respond to
the texture variations, which is evident in Malik and Perona [87] also. However, the
texture boundary is not perceived.
Based on the above observations, we propose a texture discrimination measure as
the difference between the central peak and the maximum of adjacent side peaks. In
the (+ o) case, the central peak is 104.2320, and the left and right side peaks are
45.4100 and 34.6510 respectively, thus the discrimination measure is 58.8220. For the
(R-mirror-R) case, the central peak is 7.2310 and the left and right side peaks are
5.3580 and 11.4680 respectively, thus the measure is -4.2370, which indicates that the
two texture regions are not discriminable at all.
We calculate the proposed discrimination measure for the ten texture pairs. Table
4.3 shows the psychophysical data from [69], the data from Malik and Perona’s model
[87], and the proposed measure. Here the data from [69] was actually based on the
converted data shown in [87]. Figure 4.34 shows the data which are linearly scaled
so that the measures for (+ []) match. It is clear that our measure is consistent
with the other two except for the texture pair (L +). Our measure indicates that
that (+ O) is much easier to discriminate than the other pairs, the pair (LL, ML)
is barely discriminable with a measure of 0.2080, and the pair (R-mirror-R) is not
discriminable with a measure of -4.2370.
127
Texture discriminabilityTexture pair Data [69] Data [87] Our data
(+ o) 100 (saturated) 207 58.822(+ []) 88.1 225 15.518(L +) 68.6 203 6.657(L M) not available 165 10.008(∆ →) 52.3 159 7.8380(+ T) 37.6 120 6.6700(+ X) 30.3 104 6.0040(T L) 30.6 90 1.5390
(LL, ML) not available 85 0.208(R-mirror-R) not available 50 -4.237
Table 4.3: Comparison of texture discrimination measures
1 2 3 4 5 6 7 8 9 10−10
0
10
20
30
40
50
60
Figure 4.34: Comparison of texture discrimination measures. Dashed line - Psy-chophysical data from Krose [69]; dotted line - Prediction of Malik and Perona’s model[87]; solid line - prediction of the proposed model based on spectral histograms.
128
The proposed texture discrimination measure provides a potential advantage over
Malik and Perona’s model [87]. While their model cannot account for asymmetry
[140] which exists in human texture perception, our model can potentially account for
that. In general, the discrimination of a more variant texture in a field of more regular
texture is stronger than vice versa. For example, the perception of gapped circles in
a field of closed circles is stronger than the perception of closed circles in a field of
gapped circles [140]. According to our discrimination measure, the asymmetry is due
to that the side peak of a regular texture is weaker than the side peak of a variant
texture, resulting in stronger discrimination. This needs to be further investigated.
4.8 Conclusions
In this chapter, we propose spectral histograms as a generic feature vector for natu-
ral images and a similarity measure is defined. This provides a generic non-parametric
similarity measure for images. We demonstrate the texture characterization and dis-
crimination capabilities of spectral histograms using image synthesis, classification,
content-based image retrieval, and texture discrimination. As shown in Figure 4.30,
we demonstrate that particular forms of filters are not important, but the distribu-
tions are critical. We also demonstrate that the performance of spectral histograms
largely do not depend on the particular form of the distance measure. We show the
spectral histogram gives much better performance than simple statistic features such
as variance and mean values.
The classification and other results would be improved even significantly if more
sophisticated algorithms for similarity measure and classification would be used.
The distance measures defined in equations (4.4) (4.5), and (4.6) are simply equally
129
weighted summations from all the histogram bins. If many samples are available for
one texture type, homogeneity of the texture can be refined using different weights for
different filters and even bins. Those weights can be learned from training samples,
as demonstrated by Zhu et al [149].
While spectral histograms can capture inhomogeneity of texture images, as shown
in texture synthesis examples, they are not able to capture the geometric relationships
among texture elements if spatially the relationships are longer than the filters used.
Modeling the geometric relationships may even be more salient for certain textures
such as textures consisting of regular patterns of similar elements. A level beyond
this filtering needs to be incorporated or a different module might be needed. One
possible way is to allow short-range coupling instead of nearest-neighbor coupling to
overcome the inhomogeneity present in the image. Those issues need to be further
investigated.
130
CHAPTER 5
IMAGE SEGMENTATION USING SPECTRAL
HISTOGRAMS
Image segmentation is a most fundamental problem in computer vision and im-
age understanding. In this chapter, we propose an new energy functional for image
segmentation which requires explicitly a feature and a distance measure for each seg-
mented region. Using the spectral histogram proposed in the Chapter 4, we derive
an iterative and deterministic approximation algorithm for segmentation. We apply
the algorithm to segmenting images under different assumptions.
5.1 Introduction
Image segmentation is the central problem in computer vision, and the perfor-
mance of high-level modules such as recognition critically depend on segmentation
results. Roughly segmentation can be defined as a constrained partition problem.
Each partition, or region, should be as homogeneous as possible and neighboring par-
titions should be as different as possible. Two computational issues can be identified
from the definition. The first issue is to define features to be used and the associated
131
similarity/dissimilarity measure. Given the similarity/dissimilarity measure, the sec-
ond issue is to design a computational procedure to derive a solution for a given input
image.
To address the first issue, one needs to define a feature, which can be a scalar or
vector and a distance measure in the feature space. In the literature, this issue is
not well addressed. For intensity images, mean and variance values are most widely
used, which are closely related to the Gaussian assumption behind many algorithms
[152] [89] [88] [13] [94] [151]. For texture images, many textural features have been
proposed [44] [43] [61] [18] [65] [21] [127] [107]. Those features in general are only
justified through experiments on selected texture images. Comparing with existing
features, spectral histograms provide a well-justified feature vector. As demonstrated
in the previous chapter, the spectral histogram can effectively capture the texture
appearance and has been successfully applied to image modeling and classification.
The second issue has been studied extensively in the literature. Existing methods
can be classified roughly into local algorithms and optimization of a global criterion.
Local algorithms include edge detectors and region growing [152] [89] [88] [13]. While
good experimental results have been obtained, the major difficulty of local algorithms
is that the segmented regions are not guaranteed to be homogeneous. For example,
in region growing, two different regions may be grouped together due to noise and
variations in local areas. Optimization approaches can be further divided as local
and global based on the segmentation criterion. If the criterion only consists of local
terms, such as in pair-wise classification, those approaches may suffer the problem of
local algorithms. The results may critically depend on parameter values such as the
number of regions.
132
For the global optimization, the central problem is to derive an efficient algorithm
to achieve a good solution due to the high dimensionality of the potential solution
space. In this regard, Mumford-Shah energy functional for segmentation [94] given
in (4.1) is representative in that most existing segmentation algorithms are special
cases [92]. As pointed out earlier, the underlying assumption of the solution space is
piece-wise smooth images. However, to give a solution for segmentation, the solution
space f must be piece-wise constant by the definition of segmentation. In this case,
the energy function becomes [94]:
E(Γ) =∑
i
∫ ∫
Ri
(g(x, y)−meanRi(g))2dxdy + ν|Γ|, (5.1)
where
meanRi(g) =
1
|Ri|∫ ∫
Ri
g(x, y)dxdy.
Here |Ri| is the area of region Ri. There are several limitations of the energy functional
(5.1). The feature to be used is limited to the mean value of a region, which is not
sufficient for characterizing regions as demonstrated in the previous chapter and also
may give undesirable solutions when the mean values of two regions are very close.
Another problem is that a result obtained by minimizing the energy functional is
unpredictable in regions which cannot be described by their mean values. Some of
problems are resolved by using a Gaussian model in the Region Competition algorithm
by Zhu and Yuille [151].
In this chapter, we extend the model (5.1) using the spectral histogram and the
associated distance measure. We develop an algorithm which couples the feature de-
tection and segmentation steps together by extracting features based on the currently
133
available segmentation result. We also develop an algorithm which identifies regional
features in homogeneous texture regions automatically.
In Section 5.2 we give a formulation of our segmentation energy functional. Sec-
tion 5.3 describes our segmentation algorithm. Section 5.4 provides experimental
results when features for each region are given manually. Section 5.5 describes an
automated algorithm for detecting features of homogeneous texture regions. Section
5.6 proposes a method for precise texture boundary localization. Section 5.7 discusses
future research questions along this line. Section 5.8 summarizes the chapter.
5.2 Formulation of Energy Functional for Segmentation
Following the notations used in Mumford and Shah [94], let R be a grid defined
on a planar domain and Ri, i = 1, · · · , n be a disjoint subset of R, Γi be the piece-
wise smooth boundary of Ri, and Γ be the union of Γi, i = 1, · · · , n. A feature Fi
is associated with each region Ri, i = 1, · · · , n. We also define R0, which is called
background [135], as
R0 = R− (R1 ∪ · · ·Rn).
Based on the energy functional by Mumford and Shah [94], given an input image
I, we define an energy functional for segmentation as
E(Ri, n) = λR
n∑
i=1
∑
(x,y)∈Ri
D(FRi(x, y),Fi)+λF
n∑
i=1
n∑
j=1
D(Fi,Fj)+λΓ
n∑
i=1
|Γi|−n∑
i=1
|Ri|.
(5.2)
Here D is a distance measure between a feature at a pixel location and the feature
vector of the region, λR, λF , and λΓ are weights that control the relative contributions
of the corresponding terms.
134
The functional given in (5.2) is motivated by a special case of the functional
by Mumford and Shah [94], as shown in (5.1). In (5.2), the first term encodes the
homogeneity requirement in each region Ri and the second term requires that the
features of the regions should be as different as possible. Here we allow Ri to consist
of several connected regions and we drop the requirement of neighboring regions in
the second term. Alternatively, we can require that each region must be a connected
region and only include neighboring regions in the second term.
The third term requires that boundaries of regions should be as short as possible,
or as smooth as possible. The last term is motivated by the fact that some regions
may not be described well by the selected features. In that case, those regions should
not be labeled as segmented regions. Rather, those regions should be treated as
background, which can be viewed as grouping through inhomogeneity. To illustrate
the motivations, Figure 5.1 shows an intensity image where two regions have similar
mean values but different variances. If we use mean values as features, we argue that
the most reasonable output should be one homogeneous region and another region
which can not be described by the current model. We will discuss this issue later for
segmentation at different scales.
5.3 Algorithms for Segmentation
Given the energy functional defined in (5.2), now the question is to compute a good
solution for a given image. Obviously, due to high dimensionality of the problem, it
is computationally not possible to achieve a good solution by search in the potential
solution space.
135
Figure 5.1: Gray-level image with two regions with similar means but different vari-ances.
Based on different assumptions, we derive approximate solutions. A simple case
is that feature vectors Fi are given. In this case, the problem becomes supervised
classification/segmentation problem. For pixels within a homogeneous region, the
label should be determined based on the minimum distance classifier, which minimizes
the first term in (5.2). For pixels near boundaries between different regions, the
boundary term plays an additional role besides the distance from each feature vector.
Pixels that cannot be classified well by any given feature vector should be labeled as
the background.
Another case is that seed points are given. In this case, the minimization of the
energy functional essentially can be achieved through a procedure similar to region
growing, as demonstrated in [151]. An iterative algorithm can be used by alternating
feature estimation and region growing [151]. In the feature estimation phase, we fix
the segmentation results Ri and estimate Fi by minimizing the energy functional.
The region growing phase is same as the procedure for the case where feature vectors
Fi are given. This case is a generalization of region growing algorithms.
136
The most difficult case is to automatically identify suitable feature vectors for a
given input image and derive a solution. If we could solve this problem, we could
essentially solve the segmentation problem. In this chapter, we develop an algorithm
to automatically identify features from an input image based on the relationships
between different scales and neighboring regions.
Another problem is how to estimate the feature at a given pixel. A straightforward
way is to use a window centered at the pixel always. Another way is to utilize
the currently available results and use asymmetric windows. This helps to further
minimize the boundary uncertainty [151].
In this chapter, we use an iterative but deterministic algorithm. We assume that
the feature vectors for regions, which may be given manually or detected automati-
cally, are close to the true region vectors. In other words, we do not do the feature
re-estimation along the iterations. The first step prior to iterative segmentation is a
minimum distance classifier which generates initial segmentation results. For a given
pixel (x, y) to be updated, we first estimate the spectral histogram using asymmetric
windows around the pixel. Figure 5.2 gives two examples. For simplicity, we use
square windows throughout this chapter, which, in general, give good results but the
biases due the square shapes are visible in several cases. Circular windows provide
better approximation for arbitrary shaped boundaries and are more biologically plau-
sible. In addition, there are many possible choices for windows and the influence of
different choices will be studied in the future.
Because there are several windows to choose at pixel (x, y), for each Fi, we use
the window that has the most number of labels of Ri, and thus the feature at pixel
(x, y) for different labels can be different. We use spectral histograms as features
137
(a) (b)
Figure 5.2: Examples of asymmetric windows. The solid cross is the central pixel.(a) Square windows. (b) Circular windows.
vectors, i.e., FRi(x, y) = H
W(s)
(x,y)
, where W(s)(x,y) is a local neighborhood, the size and
shape of which are given by integration scale W (s) for segmentation. W (s) is a pre-
defined neighborhood. We use χ2-statistic as the distance measure. However, this
distance measure may not provide an accurate measure close to boundaries due to
the inhomogeneity. For example, in the image shown in Figure 5.1, the left region
is homogeneous and the variations allowed should be small. In the right region, the
variations allowed should be relatively large. To overcome this problem and provide
an accurate model, we estimate a probability model of the χ2-statistic for each given
feature vector Fi from the initial classification result. The implementation details
and an example are given in the next section. We approximate the boundary term
by a local term, which is given by the percentage of pixels belonging to the region in
a pre-defined neighborhood. Finally, the local updating rule at pixel (x, y) is given
πi(x, y) = (1−λΓ)P (χ2(HW
(s)
(x,y)
, Hi))+λΓ
∑
(x1,y1)∈N(x,y)
(L(x1, y1) == i)/|N(x, y|. (5.3)
Here L(x, y) is the current region label of pixel (x, y). N(x, y) is a user-defined
neighborhood, and the eight nearest neighbors are used, and λB is a parameter that
138
controls the relative contributions from the region and boundary terms. The new label
of (x, y) is assigned as the one that gives the maximum πi(x, y). To save computation,
(5.3) only needs to be applied at pixels between different regions, which gives rise to
a procedure similar to region growing. A special case of (5.3) is for pixels along
boundaries between background region and a given region because we do not assume
any model for the background region. For pixel (x, y) ∈ R0, which is adjacent to
region Ri, i 6= 0, if
χ2(HW
(s)
(x,y)
, Hi) < λB ∗ Ti,
we assign label i to (x, y). Here Ti is a threshold for region Ri, which is determined
automatically based on the initial classification result, and λB is a parameter which
determines relative penalty for unsegmented pixels.
In order to extract spectral histograms, we need to select filters. For segmentation,
we use the same eight filters for classification including the intensity filter, two gra-
dient filters, LoG with two scales and three Gabor filters with different orientations
unless specified otherwise.
5.4 Segmentation with Given Region Features
In this section, we study a special case of image segmentation. Here we assume
that feature vector Fi for each region is given manually by specifying a seed pixel in
the region. We assume that given features provide a good approximation of the true
model underlying regions. The features are defined as spectral histograms and are
estimated at integration scale W (s).
139
5.4.1 Segmentation at a Fixed Integration Scale
First feature vectors Fi are extracted from windows centered at given pixel lo-
cations, the size of which is specified by integration scale W (s). Then the image is
classified using feature vectors Fi, and the classification result is used as the initial
segmentation. To save computation, the initial result is generated by sub-sampling
the input image. To obtain parameters Ti and estimate a probability model, we
compute the histograms of the χ2-statistic between the computed and given spectral
histograms, which are shown in Figure 5.4(b) for the image shown in Figure 5.3(a).
Parameter Ti is determined by the first trough after the first peak from its histogram.
Based on the assumption that feature vectors Fi are close to the true feature vectors,
we derive a probability model by assigning zero probability for values larger than Ti.
The derived probability models are shown in Figure 5.4(b).
To illustrate the effectiveness of using the derived probability model and asym-
metric windows, Figure 5.5(a) shows a row from the image shown in Figure 5.3(a).
Figure 5.5(b) shows the probability of the two labels at each pixel using asymmetric
windows. Here we can see that the edge point is localized precisely at the true loca-
tion. Figure 5.5(c) show the probability using windows centered at pixels. There is an
interval where labels can not be decided because the spectral histogram computed in
the interval does not belong to either of the regions. This shows that the probability
model is sensitive. For comparison, Figure 5.6 shows the results using χ2-statistic
directly. In both asymmetric and central windows cases, the decision boundaries are
not sharp and edge points cannot be localized accurately.
Then the initial segmentation result is refined through an iterative procedure sim-
ilar to region growing but with fixed region features. Due to that spectral histograms
140
(a) (b) (c)
Figure 5.3: Gray-level image segmentation using spectral histograms. The integrationscale W (s) for spectral histograms is a 15× 15 square window, λΓ = 0.2, and λB = 3.Two features are given at (32, 64) and (96, 64). (a) A synthetic image with size128×128. The image is generated by adding zero-mean Gaussian noise with differentσ’s at left and right regions. (b) Initial classification result. (c) Final segmentationresult. The segmentation error is 0.00 % and all the pixels are segmented correctly.
characterize texture properties well, we obtain good experimental results even with
this simple algorithm. The integration scale for spectral histograms is 15 × 15. Fig-
ure 5.3(b) shows the initial classification result. The final result is shown in Figure
5.3(c), where all the pixels are segmented correctly. Due to that the image consists
of two images with similar mean values but different variances, if we apply nonlin-
ear smoothing algorithms, the segmentation result would be wrong because the only
feature is the mean value after smoothing.
Figure 5.7 shows another example of similar means but different variances. Here
the boundary is ’S’ shaped to test the algorithm for irregular boundaries. Here the
boundary is preserved well but artifacts due to the square windows are evident. Near
top and bottom image borders, the error is the most due to the boundary effects.
Figure 5.8 shows an example with two texture regions. Because the two textures
are relatively homogeneous, the segmentation result is very accurate.
141
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
0
20
40
60
80
100
120
(a)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8−0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
(b)
Figure 5.4: The histogram and derived probability model of χ2-statistic for the givenregion features. Solid lines stand for left region and dashed lines stand for right region.(a) The histogram of the χ2-statistic between the given feature and the computed onesat a coarser grid. (b) The derived probability model for the left and right regions.
142
0 20 40 60 80 100 120
(a)
20 40 60 80 100 120
(b)
20 40 60 80 100 120
(c)
Figure 5.5: A row from the image shown in Figure 5.3 and the result using derivedprobability model. In (b) and (c), solid lines stand for left region and dashed linesstand for right region. (a) The 64th row from the image. (b) The probability ofthe two given regional features using asymmetric windows when estimating spectralhistogram. The edge point is correctly located between columns 64 and 65. (c) Similarto (a) but using windows centered at the pixel to compute spectral histogram. Labelsbetween columns 58 and 65 cannot be decided. This is because that the computedspectral histograms within that interval do not belong to either region.
143
20 40 60 80 100 120
(a)
20 40 60 80 100 120
(b)
Figure 5.6: Classification result based on χ2-statistic for the row shown in Figure5.4(a). Solid lines stand for left region and dashed lines stand for right region. (a)χ2-statistic from the two given regional features using asymmetric windows whenestimating spectral histogram. If we use the minimum distance classifier, the edgepoint will be located between columns 65 and 66, where the true edge point shouldbe between columns 64 and 65. (b) Similar to (b) but using windows centered at thepixel to compute spectral histogram. The edge point is localized between 61 and 62.
144
(a) (b) (c)
Figure 5.7: Gray-level image segmentation using spectral histograms. W (s) is a 15×15square window, λΓ = 0.2, and λB = 5. Two features are given at (32, 64) and (96, 45).(a) A synthetic image with size 128 × 128. The image is generated by adding zero-mean Gaussian noise with different σ’s at the two different regions. Here the boundaryis ’S’ shaped to test the segmentation algorithm in preserving boundaries. (b) Initialclassification result. (c) Final segmentation result.
(a) (b) (c)
Figure 5.8: Texture image segmentation using spectral histograms. W (s) is a 29× 29square window, λΓ = 0.2, and λB = 2. Features are given at pixels (32, 32) and(96, 32). (a) A texture image consisting of two texture regions with size 128× 64. (b)Initial classification result. (c) Final segmentation result.
145
(a) (b) (c)
Figure 5.9: Texture image segmentation using spectral histograms. W (s) is a 29× 29square window, λΓ = 0.2, and λB = 3. (a) A texture image consisting of two textureregions with size 128 × 64. (b) Initial classification result. (c) Final segmentationresult.
Figure 5.9 shows an example of two texture regions. Because the right region is
not homogeneous with respect to the integration scale 29×29, the boundaries between
the two texture regions are displaced several pixels and boundaries are not smooth
due to the black and white patterns in the right region. Overall, the segmentation
result is good.
Figures 5.10, 5.11, 5.12, and 5.13 show examples of textures consisting of four
texture regions. The segmentation results are good given the integration scale used
and the inhomogeneity in the texture regions.
Figures 5.14 and 5.15 show two challenging examples for texture segmentation
algorithms. Because the boundaries are not distinctive, it is difficult even for humans
to localize the boundaries precisely. We applied the same algorithm to the images.
While the results are not perfect, they are quite satisfactory if compared to many
segmentation methods.
Figure 5.16 shows a texton-like image. Because we only use Gabor filters at three
different orientations, the spectral histogram of the top region is not representative,
resulting in obvious displacement of boundary toward the upper region. But if we
146
(a) (b) (c)
Figure 5.10: Texture image segmentation using spectral histograms. W (s) is a 35×35square window, λΓ = 0.4, and λB = 3. Four features are given at (32, 32), (32, 96),(96, 32), and (96, 96). (a) A texture image consisting of four texture regions with size128× 128. (b) Initial classification result. (c) Final segmentation result.
(a) (b) (c)
Figure 5.11: Texture image segmentation using spectral histograms. W (s) is a 35×35square window, λΓ = 0.4, and λB = 3. Four features are given at (32, 32), (32, 96),(96, 32), and (96, 96). (a) A texture image consisting of four texture regions with size128× 128. (b) Initial classification result. (c) Final segmentation result.
147
(a) (b) (c)
Figure 5.12: Texture image segmentation using spectral histograms. W (s) is a 29×29square window, λΓ = 0.2, and λB = 3. Four features are given at (32, 32), (32, 96),(96, 32), and (96, 96). (a) A texture image consisting of four texture regions with size128× 128. (b) Initial classification result. (c) Final segmentation result.
(a) (b) (c)
Figure 5.13: Texture image segmentation using spectral histograms. W (s) is a 35×35square window, λΓ = 0.4, and λB = 3. Four features are given at (32, 32), (32, 96),(96, 32), and (96, 96). (a) A texture image consisting of four texture regions with size128× 128. (b) Initial classification result. (c) Final segmentation result.
148
(a) (b) (c)
Figure 5.14: A challenging example for texture image segmentation. W (s) is a 35×35square window, λΓ = 0.4, and λB = 20. Two features are given at (160, 160) and(252, 250). (a) Input image consisting of two texture images, where the boundary cannot be localized clearly because of their similarity. The size of the image is 320× 320in pixels. (b) Initial classification result. (c) Final segmentation result.
(a) (b) (c)
Figure 5.15: Another challenging example for texture segmentation. W (s) is a 35×35square window, λΓ = 0.4, and λB = 20. Two features are given at (160, 160) and(252, 250). (a) Input image consisting of two texture images, where the boundary cannot be localized clearly because of their similarity. The size of the image is 320× 320in pixels. (b) Initial classification result. (c) Final segmentation result.
149
(a) (b) (c)
(d) (e)
Figure 5.16: Segmentation for a texton image with oriented short lines. W (s) is a35 × 35 square window, λΓ = 0.4, and λB = 10. Two features are given at (185, 67)and (180, 224). (a) The input image with size of 402× 302 in pixels. (b) The initialclassification result. (c) The segmentation result using spectral histograms. (d) Theinitial classification result using two Gabor filters Gcos(10, 30) and Gcos(10, 60). (e)The segmentation result using two Gabor filters. The result is improved significantly.
allow to change filters used, we obtain a much better result as shown in Figure 5.16(d)
and (e).
5.4.2 Segmentation with Multiple Scales
In this section, we briefly study the effects of integration scales on the segmentation
results. Because no textures are absolutely homogeneous with respect to the finite
integration scales, in order to capture the characteristics of texture images, there is
a minimum integration scale required. In the framework of spectral histograms, we
150
0 5 10 15 20 25 30 350
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
(a) (b)
Figure 5.17: Segmentation results at different integration scales. Parameters λΓ = 0.4,and λB = 4 are fixed. (a) The input image. (b) The percentage of mis-classified pixels.
can estimate the minimum scale of a region using the relationships between different
integration scales, which will be explained and used in Section 5.5
We use the example shown in Figure 5.17(a), which was used previously. We try
the integration scales from 1×1 to 35×35. For the final segmentation result obtained
at each scale, we calculate the percentage of mis-segmented pixels using the ground
truth image. Figure 5.17(b) shows the result. Except at scale 7×7, the segmentation
decreases until windows get larger than 23 × 23. While the error varies at larger
windows, the error is very low. The error at scale 23× 23 is less than 0.2 %.
The segmentation results for selected small scales are shown in Figure 5.18. These
results reveal some of desirable properties of the segmentation algorithm. For exam-
ple, at scales less than 9 × 9, the right region is not homogeneous. The desirable
result at these scales is that the right region is classified as background. The result
obtained at scale 5×5 is produced due to the randomness. Note that the given pixels
are used to calculate the region feature Fi only and they are not used as seed pixels
for region growing.
151
(a) (b)
(c) (d)
Figure 5.18: Segmentation results using different segmentation scales for the imageshown in Figure 5.17(a). In each sub-figure, the left shows the initial classificationresult and the right shows the segmentation result. Parameters λΓ = 0.4, and λB = 4are fixed. (a) W (s) is a 1× 1 square window. (b) W (s) is a 3× 3 square window. (c)W (s) is a 5× 5 square window. (d) W (s) is a 7× 7 square window.
152
5.4.3 Region-of-interest Extraction
Here we try to segment natural texture images. Because natural images consist
many regions that are not homogeneous texture regions, we segment regions that are
specified through region features. For each image, we only give one region feature
and the algorithm is essentially to segment region of interest. As mentioned before,
in our algorithm, no assumption is made regarding the distributions and properties
of the background regions.
We apply the same algorithm. Figure 5.19 shows the result for a cheetah image. If
we use the same filters, the initial result is shown in Figure 5.19(b) and the segmenta-
tion result is shown in Figure 5.19(c). Here the boundaries are not localized well due
to the similarity between the surrounding areas and the white part of cheetah skin.
The tail is not included because that area is relatively dark. If we do not use the
intensity filter and use gradient and LoG filters, the tail is then correctly segmented
as shown in Figure 5.19(e). If we have a database for recognition, the cheetah can
be easily recognized due to its distinctive skin pattern which is characterized by its
spectral histograms.
Figure 5.20 shows an indoor image which includes a sofa with texture surface.
Figure 5.20(c) shows the final result. The lower boundary of the sofa is localized well
due to that the floors are different from the sofa texture. However, the boundaries
of top and right part are not localized well due to that the intensity values at those
regions are similar to the white part of sofa texture. If we put another regional
feature in the white area, the top boundary can be localized more accurately through
competition as shown in Figure 5.20(d).
153
(a)
(b) (c)
(d) (e)
Figure 5.19: A texture image with a cheetah. The feature vector is calculated at pixel(247, 129) at scale 19× 19, λΓ = 0.2, and λB = 2.5. To demonstrate the accuracy ofthe results, the classification and segmentation results are embedded into the originalimage by lowering the intensity values of the background region by a factor of 2. (a)The input image with size 324×486. (b) The initial classification result using 8 filters.(c) The final segmentation result using 8 filters. (d) The initial classification resultusing 6 filters consisting of Dxx, Dyy, LoG(
√2/2), LoG(1), LoG(2) and LoG(3). (e)
The final segmentation result corresponding to (d).
154
(a) (b)
(c) (d)
Figure 5.20: An indoor image with a sofa. The feature vector is calculated at pixel(146, 169) at scale 35×35, λΓ = 0.2, and λB = 3. (a) Input image with size 512×512.(b) Initial classification result. (c) Final segmentation result. (d) Segmentation resultif we assume there is another region feature given at (223, 38).
155
5.5 Automated Seed Selection
In the segmentation experiments presented above, we assume that several rep-
resentative pixels are given. Those pixels are used to calculate feature vectors of
the corresponding regions. This assumption is obviously too restrictive for an au-
tonomous system. Also the human visual system does not need this assumption but
can deal with a wide range of texture images robustly and reliably. In this section,
we attempt to develop a solution for identifying seed points automatically in an input
image. The proposed method is based on the spectral histogram and is consistent
with the proposed computational framework.
The basic idea under the proposed method is to identify homogeneous texture
regions within a given image. As discussed in the previous chapter, the spectral his-
togram can be defined on image patches with different sizes and shapes and those
spectral histograms defined on different patches can still be compared using the sim-
ilarity/dissimilarity measure as spectral histograms are naturally normalized.
We try to identify homogeneous texture regions based on the divergence between
two integration scales. Let W (a) be an integration scale larger than W (s), the integra-
tion scale for segmentation, we define the distance between the two scales centered
at pixel (x, y)
ψ(s,a)(x, y) = D(HW (s)(x,y), HW (a)(x,y)). (5.4)
Within a homogeneously defined texture region, ψ(s,a) should be small because
HW (s)(x, y) and HW (a)
(x, y) should be similar. We also define a distance measure
156
between different windows at scale W (s) within the window given by W (a),
ψ(s,s)(x, y) = maxW (s)(x1, y1) ⊂ W (a)(x, y)W (s)(x2, y2) ⊂ W (a)(x, y)
D(HW (s)(x1,y1), HW (s)(x2,y2)). (5.5)
Equation (5.5) is approximated at implementation using central and four corner
windows within W (a). Finally, we want to choose features that are as different as
possible from those already chosen. Suppose we choose n features already, where,
Fi = HW (s)(x,y), for i = 1, . . . , n, we define
ψ(c)(x, y) = max1≤i≤n
D(HW (s)(x,y),Fi). (5.6)
We have the following saliency measure
ψ(x, y) = (1− λC)(λA×ψ(s,a)(x, y) + (1− λA)×ψ(s,s)(x, y))− λC ×ψ(c)(x, y). (5.7)
Here λA and λC are parameters to determine the relative contribution of each term.
To save computation, we compute ψ(x, y) on a coarser grid. Feature vectors are
chosen according to the value of ψ(x, y) until
λA × ψ(s,a)(x, y) + (1− λA)× ψ(s,s)(x, y) < TA,
where TA is a threshold.
Figures 5.21-5.25 show the segmentation results for texture images. Here the algo-
rithm identifies the feature vectors automatically instead of manually chosen features.
Due to the inhomogeneity, some of texture boundaries are not as good as the results
with the manually given feature points. This is due to that the windows for feature
selection are large compared to the texture regions.
157
(a) (b) (c)
Figure 5.21: Texture image segmentation with representative pixels identified au-tomatically. W (s) is a 29 × 29 square window, W (a) is a 35 × 35 square window,λC = 0.1, λA = 0.2, λB = 2.0, λΓ = 0.2, and TA = 0.08. (a) Input texture image,which is shown in Figure 5.8. (b) Initial classification result. Here the representativepixels are detected automatically. (c) Final segmentation result.
(a) (b) (c)
Figure 5.22: Texture image segmentation with representative pixels identified au-tomatically. W (s) is a 29 × 29 square window, W (a) is a 43 × 43 square window,λC = 0.4, λA = 0.4, λB = 5.0, λΓ = 0.4, and TA = 0.30. (a) Input texture image,which is shown in Figure 5.10. (b) Initial classification result. Here the representativepixels are detected automatically. (c) Final segmentation result.
158
(a) (b) (c)
Figure 5.23: Texture image segmentation with representative pixels identified au-tomatically. W (s) is a 29 × 29 square window, W (a) is a 43 × 43 square window,λC = 0.1, λA = 0.2, λB = 5.0, λΓ = 0.4, and TA = 0.20. (a) Input texture image,which is shown in Figure 5.11. (b) Initial classification result. Here the representativepixels are detected automatically. (c) Final segmentation result.
(a) (b) (c)
W (s) is a 29×29 square window, W (a) is a 43×43 square window, λC = 0.1, λA = 0.2,λB = 5.0, λΓ = 0.4, and TA = 0.20.
Figure 5.24: Texture image segmentation with representative pixels identified au-tomatically. (a) Input texture image, which is shown in Figure 5.12. (b) Initialclassification result. Here the representative pixels are detected automatically. (c)Final segmentation result.
159
(a) (b) (c)
Figure 5.25: Texture image segmentation with representative pixels identified auto-matically. W (s) is a 29×29 square window, W (a) is a 43×43 square window, λC = 0.1,λA = 0.2, λB = 5.0, λΓ = 0.4, and TA = 0.20. (a) Input texture image, which is shownin Figure 5.13. Here the representative pixels are detected automatically. (c) Finalsegmentation result.
5.6 Localization of Texture Boundaries
Because textures need to be characterized by spatial relationships among pix-
els, relatively large integration windows are needed in order to extract meaningful
features, which is evident from Figure 4.23(b), which shows that the classification
performance degrades dramatically when the integration scale gets too small. The
large integration scale we use results in large errors along texture boundaries due to
the uncertainty introduced by large windows [13]. By using asymmetric windows for
feature extraction, the uncertainty effect is reduced. However, for arbitrary texture
boundaries, the errors along boundaries can be large even when the overall segmen-
tation performance is good. For example, Figure 5.26(b) shows a segmentation result
using spectral histograms. While the segmentation error is only 6.55%, visually the
160
(a) (b) (c)
(d) (e)
Figure 5.26: (a) A texture image with size 256 × 256. (b) The segmentation resultusing spectral histograms. (c) Wrongly segmented pixels of (b), represented in blackwith respect to the ground truth. The segmentation error is 6.55%. (d) Refinedsegmentation result. (e) Wrongly segmentation pixels of (d), represented in black asin (c). The segmentation error is 0.95%.
segmentation result is intolerable due to large errors along texture boundaries. as
shown in Figure 5.26(c).
In order to reduce the uncertainties along boundaries, we need to use smaller win-
dows for feature extraction. However, features extracted should capture the spatial
relationships among pixels. To overcome this problem, we propose the following mea-
sure to refine the result obtained using spectral histograms. As for segmentation, we
first build a probability model for given m pixels from a texture region. To capture
161
the spatial relationship, we choose for each texture region a window as a template.
In our case, the template is the same window from which the region feature F is
extracted. For the selected m pixels, we define the distance between those pixels and
a texture region as the minimum mean square distance between those pixels and the
template. Based on the result from spectral histograms, we build a probability model
for each texture region with respect to the proposed distance measure. Intuitively,
if the m pixels belong to a texture region, it should match the spatial relationship
among pixels when the m pixels are aligned with the texture structure.
After the probability model is derived, we use the local updating equation given
in (5.3) by replacing W(s)(x,y) by the m pixels in a texture region along its boundary
based on the current segmentation result and χ2(HW
(s)
(x,y)
, Hi) by the new distance
measure. Figure 5.26(d) shows the refined segmentation result with m = 11 pixels.
Visually the segmentation result is improved significantly and the segmentation error
is reduced to 0.95%. Figure 5.26(e) shows the wrongly segmented pixels.
Figure 5.27(c) shows the result for an intensity image. It is clear that the bound-
ary between two regions is improved significantly, especially at the top and bottom
borders of the image.
Figure 5.28(c) shows the refined segmentation result for the image which was used
in classification shown in Figure 4.22. Compared to the classification result, the errors
along the texture boundaries are reduced significantly.
Figure 5.29(c) shows another example. Here the resulting texture boundary be-
tween the top and left regions is jagged due to the local ambiguities of a small number
of pixels. This is partially because that the boundary smoothness is approximated
162
(a) (b) (c)
Figure 5.27: (a) A synthetic image with size 128 × 128, as shown in Figure 5.7(a).(b) The segmentation result using spectral histograms as shown in Figure 5.7(c). (c)Refined segmentation result.
Figure 5.28: (a) A texture image with size 256 × 256. (b) The segmentation resultusing spectral histograms. (c) Refined segmentation result.
163
(a) (b) (c)
Figure 5.29: (a) A texture image with size 256 × 256. (b) The segmentation resultusing spectral histograms. (c) Refined segmentation result.
using a local term given in (5.3). By using a more global boundary smoothness term
as in [151], the result could be improved, which will be investigated in the future.
5.7 Discussions
There are several improvements that can be done using spectral histograms. As
is evident from Figure 5.30, one can estimate automatically the minimum scale for a
given texture region. It shows clearly in Figure 5.30 (b) and (c) that the left region
needs a smaller scale than the right one. The minimum scale selection has been
studied by Elder and Zucker [29] under the assumption of Gaussian noise. I did some
work along this line and it is not included in this dissertation.
As we see from the examples, most texture images are not absolutely homogeneous,
but can be better described by responses of some filters, as shown in Figures 5.16 and
5.19. A similar question is how to effectively discriminate two textures by selecting
filters. A straightforward extension is to use a weighted average of filter responses,
164
0 10 20 30 40 50 600
0.05
0.1
0.15
0.2
0.25
0.3
0.35
(b)
0 10 20 30 40 50 600
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
(a) (c)
Figure 5.30: Distance between scales for different regions. (a) Input image. (b) Thedistance between different integration scales for the left region at pixel (32, 64). (c)The distance between different integration scales for the right region at pixel (96, 64).
165
as in the FRAME model [149] [150]
χ2(HI1 , HI2) =K∑
α=1
∑
z
λ(α)z
(H(α)I1
(z)−H(α)I2
(z))2
H(α)I1
(z) +H(α)I2
(z). (5.8)
There are two issues here. If we have training samples available, we can use param-
eter estimation methods [149] [150]. We can also use a neural network to learn the
weights. For image segmentation, the main problem is how we can obtain training
samples along the segmentation procedure. One way is to use the initial segmenta-
tion result. The other way is to alternate between two phases. In the training phase,
we can re-estimate the parameters using the current segmentation result and in the
segmentation phase we can use the current parameters for segmentation.
Some of textures cannot be captured well using spectral histograms when the fil-
ter responses are not homogeneous. Rather, the spatial relationships among texture
elements are more prominent. For the example, the zebra stripes shown in Figure
5.31(a) cannot be captured well using spectral histograms, especially the boundaries
are not localized well. Given that the surrounding area is close to the white part of
zebra stripes and zebra stripes are not homogeneous, the result is what one would
expect using a homogeneous model. No matter what homogeneity measure is used,
it would not work well in cases like the zebra image because the zebra stripes are
simply not homogeneous. To overcome this problem, one needs to define a model
which can characterize the structural relationships in a generic way. Within the spa-
tial/frequency representational framework, short-range order coupling may provide a
solution. In other words, the filter responses capture the local shape information and
short-range order coupling overcomes the inhomogeneity in the region. This may lead
to a new computational model and needs to be further investigated and studied.
166
(a) (b)
(e) (d)
Figure 5.31: A natural image with a zebra. λΓ = 0.2, and λB = 5.5. (a) The inputimage. (b) The segmentation result with one feature computed at (205, 279). (c)The segmentation result with one feature computed at (308, 298). (d) The combinedresult from (b) and (c).
167
5.8 Conclusions
In this chapter, we formulate an energy functional for image segmentation by
making features and homogeneity measures explicit for segmented regions. We have
developed a segmentation algorithm using spectral histograms as a generic feature for
natural images and χ2-statistic as a similarity measure. We investigate the perfor-
mance of the algorithms under different assumptions. Satisfactory results have been
obtained using intensity, texture, and natural images. Future work along this line is
discussed.
168
CHAPTER 6
PERCEPTUAL ORGANIZATION BASED ON
TEMPORAL DYNAMICS
This chapter presents a computational model for perceptual organization. A
figure-ground segregation network is proposed based on a novel boundary pair rep-
resentation. Nodes in the network are boundary segments obtained through local
grouping. Each node is excitatorily coupled with the neighboring nodes that belong
to the same region, and inhibitorily coupled with the corresponding paired node. The
status of a node represents the probability of the node being figural and is updated
according to a differential equation. The system solves the figure-ground segregation
problem through temporal evolution. Gestalt-like grouping rules are incorporated
by modulating connections, which determines the temporal behavior and thus the
perception of the system. The results are then fed to a surface completion module
based on local diffusion. Different perceptual phenomena, such as modal and amodal
completion, virtual contours, grouping and shape decomposition are explained by the
model with a fixed set of parameters. Computationally, the system eliminates combi-
natorial optimization, which is common to many existing computational approaches.
It also accounts for more examples that are consistent with psychological experiments.
In addition, the boundary-pair representation is consistent with well-known on- and
169
off-center cell responses and thus biologically more plausible. The results appear in
[81] [82].
6.1 Introduction
Perceptual organization refers to the ability of grouping similar features in sen-
sory data. This, at a minimum, includes the operations of grouping and figure-ground
segregation. Here grouping includes both local grouping, generally known as segmen-
tation, and long-range grouping, referred to as perceptual grouping in this chapter.
Figure-ground segregation refers to the process of determining the relative depth of
adjacent regions in the input.
This problem setting has several computational implications. The central problem
in perceptual organization is figure-ground segregation. When the relative depth
between regions is determined, different types of surface completion phenomena, such
as modal and amodal completion, shape composition and perceptual grouping, can be
solved and explained using a single framework. Perceptual grouping can be inferred
from surface completion. Grouping rules, such as those summarized by Gestaltists,
can be incorporated for figure-ground segregation.
Many computational models have been proposed for perceptual organization.
Many of the existing approaches [86] [41] [98] [141] [34] start from detecting dis-
continuities i.e. edges in the input; one or several configurations are then selected
according to certain criteria, for example, non-accidentalness [86]. Those approaches
to a larger extend are influenced by Marr’s paradigm [88], which is supported by that
on- and off-center cells response to luminance differences, or edges [53], and that the
three-dimensional shapes of the parts can be inferred from a two-dimensional line
170
drawing [7]. While those approaches work well to derive meaningful two-dimensional
regions and their boundaries, there are several disadvantages for perceptual organi-
zation. Theoretically speaking, edges should be localized between regions and do
not belong to any region. By detecting and using edges from the input, an addi-
tional ambiguity, the ownership of a boundary segment, is introduced. Ownership
problem is equivalent to figure-ground segregation [97]. Due to that, regional attri-
butions cannot be associated with boundary segments. Further more, because each
boundary segment can belong to different regions, the potential search space is com-
binatorial; constraints among different segments such as topological constraints must
be incorporated explicitly [141]. Furthermore, obtaining the optimal configuration(s)
is computationally expensive.
To overcome some of the problems, we propose a laterally-coupled network based
on a boundary-pair representation. An occluding boundary is represented by a pair
of boundaries of the two involved regions, and initiates a competition between the
regions. Each node in the network represents a boundary segment. A closed region
boundary is represented as a ring structure with laterally coupled nodes. A region
consists of one or more rings. Regions compete to be figural through boundary-pair
competition and the figure-ground segregation is solved through temporal evolution.
Gestalt grouping rules are incorporated by modulating the coupling strength be-
tween different nodes within a region, which influences the temporal dynamics and
determines the perception of the system. Shape decomposition and grouping are im-
plemented through local diffusion using the results from figure-ground segregation.
This approach offers several advantages over edge-based approaches:
171
• Boundary-pair representation makes explicit the ownership of boundary seg-
ments and eliminates the combinatorial optimization necessary for many exist-
ing approaches.
• The model can explain more perceptual phenomena than existing approaches
using a fixed set of parameters.
• It can incorporate top-down influence naturally.
In Section 6.2 we introduce figure-ground segregation network and demonstrate
the temporal properties of the network. Section 6.3 shows how surface completion
and decomposition are achieved. Section 6.4 provides experimental results. Section
6.5 concludes the chapter with further discussions.
6.2 Figure-Ground Segregation Network
The central problem in perceptual organization is to determine the relative depth
among regions. As figural reversal occurs in certain circumstances, figure-ground
segregation cannot be resolved only based on local attributes. By using a boundary-
pair representation, the solution to figure-ground segregation is given by temporal
evolution.
6.2.1 Boundary-Pair Representation
The boundary-pair representation is motivated by on- and off-center cell responses.
Figure 6.1(a) shows an input image. Figure 6.1(b) and (c) show the on-center and
off-center responses. Without zero-crossing, we naturally obtain double responses for
each occluding boundary, as shown in Figure 6.1(d).
172
(a) (b) (c) (d)
Figure 6.1: On- and off-center cell responses. (a) Input image. (b) On-center cellresponses. (c) Off-center cell responses (d) Binarized on- and off-center cell responses.White regions represent on-center response regions and black off-center regions.
More precisely, closed region boundaries are obtained from segmentation and then
segmented into segments using corners and junctions, which are detected through local
corner and junction detectors. A node i in the figure-ground segregation network
represents a boundary segment, and its status Pi represents the probability of the
corresponding segment being figural, which is set to 0.5 initially. Each node is laterally
coupled with neighboring nodes on the closed boundary. The connection weight
from node i to j, wij, is 1 and can be modified by T-junctions and local shape
information. Each occluding boundary is represented by a pair of boundary segments
of the involved regions. A node in a pair competes with the other to be figural
temporally. This competition determines the figure-ground segregation. Here the
critical point is that each occluding boundary has to be represented using a pair
before we solve the figure-ground segregation problem; otherwise, a combinatorial
search would be inherit in order to cover all the possible configurations. Figure 6.2
shows an example. In the example, nodes 1 and 5 form a boundary pair, where node
173
92 6
5
8
3
1
4
7
10 11 12
Figure 6.2: The figure-ground segregation network architecture for Figure 6.1(a).Nodes 1, 2, 3 and 4 belong to the white region; Nodes 5, 6, 7, and 8 belong to theblack region; Nodes 9 and 10, 11 and 12 belong to the left and right gray regionsrespectively. Solid lines represent excitatory coupling while dashed lines representinhibitory connections.
1 belongs to the white region, or the background region and node 5 belongs to the
black region, or the figural region.
Node i updates its status by:
τ dPi
dt= µL
∑
k∈N(i)wki(Pk − Pi)+µJ(1− Pi)
∑
l∈J(i)H(Qli)+µB(1− Pi)exp(−Bi/KB)
(6.1)
Here N(i) is the set of neighboring nodes of i, and µL, µJ , and µB are parameters to
determine the influences from lateral connections, junctions, and bias. J(i) is the set
of junctions that are associated with i and Qli is the junction strength of node i of
junction l. H(x) is given by:
H(x) = tanh(β(x− θJ)
Here β controls the steepness and θJ is a threshold.
In (6.1), the first term on the right reflects the lateral influences. When nodes
are strongly coupled, they are more likely to be in the same status, either figure or
174
background. Second term incorporates junction information. In other words, at a
T-junction, segments that are more smooth are more likely to be figural. The third
term is a bias term, where Bi is the bias introduced to simulate human perception.
After all the nodes are updates, the competition between paired nodes is through
normalization based on the assumption that only one of the paired nodes should be
figural at a given time. Suppose that j is the corresponding paired node of i, we have:
P(t+1)i = P t
i /(Pti + P t
j ) (6.2a)
P(t+1)j = P t
j /(Pti + P t
j ) (6.2b)
As a dynamic system, this shares some similarities with relaxation labeling tech-
niques [55]. Because the status of a node is only influenced by the nodes in a local
neighborhood in the network, as shown in Figure 6.2, the figure-ground segregation
network defines a Markov random field. This shares some similarities with the Markov
random fields proposed by Zhu [146] for perceptual organization. As will be demon-
strated later, our model can simulate many perceptual phenomena while the model
by Zhu [146] is a generic and theoretical model for shape modeling and perceptual
organization.
6.2.2 Incorporation of Gestalt Rules
Without introducing grouping cues such as T-junctions and preferences, the solu-
tion of the network is not well defined. To generate behavior that is consistent with
human perception, we incorporate grouping cues and some Gestalt grouping princi-
ples. As the network provides a generic model, many other rules can be incorporated
in a similar manner.
175
T-junctions T-junctions provide important cues for determining relative depth
[97] [141]. In Williams and Hanson’s model [141], T-junctions are imposed as topolog-
ical constraints. Given a T-junction l, the initial strength for node i that is associated
with l is:
Qli =exp(−α(i,c(i))/KT )
1/2∑
k∈NJ (l)exp(−α(k,c(k))/KT )
where KT is a parameter, NJ(l) is a set of all the nodes associated with junction l,
c(i) is the other node in NJ(l) that belongs to the same region as node i, and α(ij) is
the angle between segments i and j.
Non-accidentalness Non-accidentalness tries to capture the intrinsic relation-
ships among segments [86]. In our system, an additional connection is introduced to
node i if it is aligned well with a node j from the same region and j 6∈ N(i) initially.
The connection weight wij is a function of distance and angle between the involved
ending points. This can also be viewed as virtual junctions, resulting in virtual con-
tours and conversion of a corner into a T-junction if involving nodes become figural.
This corresponds to the organization criterion proposed by Geiger et al. [34].
Shape information Shape information plays a central role in Gestalt princi-
ples. For example, that virtual contours are vivid in Figure 6.8(a) but not in Figure
6.8(b) is due to the figural properties [64]. Shape information is incorporated through
enhancing lateral connections. In this chapter, we consider local symmetry. Let j
and k be the two neighboring nodes of i.
wij = 1 + C exp(−|αij − αki|/Kα)∗exp(−(Lj/Lk + Lk/Lj − 2)/KL))
(6.3)
176
Essentially (6.3) strengthens the lateral connections when the two neighboring seg-
ments of i are symmetric. Those nodes are then strongly grouped together according
to (6.1), resulting in different perceptions for Figure 6.8 (a) and (b).
Preferences Human perceptual systems often prefer some organizations over
the others. In this model, we incorporated a well-known figure-ground segregation
principle, called closedness. In other words, the system prefers regions over holes. In
current implementation, we set Bi = 1.0 if node i is part of a hole and otherwise
Bi = 0.
6.2.3 Temporal Properties of the Network
After we construct the figure-ground segregation network, there are two funda-
mental questions to be addressed. First we need to demonstrate that the equilibrium
state of the system gives a desired solution. Second, we need to show that the system
converges to the desired state. Here we demonstrate those using the example shown
in Figure 6.2. Figure 6.3 shows the temporal behavior of the network. First, the sys-
tem approaches to a stable solution. For figure-ground segregation, we can binarize
the status of each node using threshold 0.5. In this case, the system convergences
very quickly. In other words, the system outputs the solution in a few iterations.
Second, the system generates the correct perception. The black region is occluding
other regions while gray regions are occluding the white region. For example, P5 is
close to 1 and thus segment 5 is figural, and P1 is close to 0 and thus segment 1 is at
background.
177
1 5
2 6
3 7
4 8
9 11
10 12Time Time
Figure 6.3: Temporal behavior of each node in the network shown in Figure 6.2. Eachplot shows the status of the node with respect to the time. The dashed line is 0.5.
6.3 Surface Completion
After the figure-ground segregation is solved, surface completion and shape de-
composition can be implemented in a straightforward manner. Currently this stage is
implemented through diffusion. Because the ownership of each boundary segment is
known, fixed heat sources are generated along occluding boundaries, and the occlud-
ing boundaries naturally block diffusion. This method is similar to the one used by
Geiger et al. [34] for generating salient surfaces. However, in their approach, because
the hypotheses are defined only at junction points, fixed heat sources for diffusion
have to be given. On the other hand, in our model, fixed heat sources are generated
automatically along the occluding boundaries. In other words, the hypotheses in our
system are defined along boundaries, not at junction points.
178
To be more precise, regions from local segmentation are now grouped into diffusion
groups based on the average gray value and that if they are occluded by common
regions. Segments that belong to one diffusion group are diffused simultaneously. For
a figural segment, a buffer with a given radius is generated. Within the buffer, the
values are set to 1 for pixels belonging to the region and 0 otherwise. If there is no
figural segment in the diffusion group, it is the background, which is always the entire
image. Because the figure-ground segregation has been solved, with respect to the
diffusion group, only the parts that are being occluded need to be completed. Now
the problem becomes a well-defined mathematical problem. We need to solve the
heat equation with given boundary conditions. Currently, diffusion is implemented
though local diffusion. The results from diffusion are then binarized using threshold
0.5.
Figure 6.4 shows the results of Figure 6.1 after surface completion. Here the
two gray regions are grouped together through surface completion because occluded
boundaries allow diffusion. Figure 6.5(a) shows the result using a layered representa-
tion to show the relative depth between the surfaces. While the order in this example
is well defined, in general the system can handle surfaces that are overlapped with
each other, making the order ill-defined.
6.4 Experimental Results
For all the experiments shown in this chapter, we use a fixed set of parameters
for the figure-ground segregation network. Given an input image, the system auto-
matically constructs the network and establishes the connections based on the rules
discussed in Section 2.2.
179
(a) (b) (c)
Figure 6.4: Surface completion results for Figure 6.1(a). (a) White region. (b) Grayregion. (c) Black region.
Figure 6.5: Layered representation of surface completion for results shown in Figure6.4.
180
(a) (b) (c)
Figure 6.6: Images with virtual contours. (a) Kanizsa triangle. (b) Woven square.(c) Double kanizsa.
We first demonstrate that the system can simulate virtual contours and modal
completion. Figure 6.6 shows the input images and Figure 6.7 shows the results. The
system correctly solves the figure-ground segregation problem and generates the most
probable percept. In Figure 6.6(b), the rectangular-like frame is tilted, making the
order between the frame and virtual square not well-defined. Our system handles that
in the temporal domain. At any given time, the system outputs one of the completed
surfaces. Due to this, the system can also handle the case in Figure 6.6(c), where the
perception is bistable, as the order between the two virtual squares is not defined.
Figure 6.8 shows three images, where the optimal percept is difficult to be simu-
lated by a single existing model. Our system, even with fixed parameters, generates
the outputs shown in Figure 6.9 due to that the system allows interactions between
shape information and non-accidental alignment. In Figure 6.8(a), pacman pattern
is not very stable and gives rise to virtual contours. However, in Figure 6.8(b), the
symmetric crosses are more stable and the lateral connections are much stronger, and
the perception of four crosses generated from the system is consistent with that in the
psychological literature [64]. In Figure 6.8(c), the crosses are not symmetric any more
and are perceived as overlapping rectangular bars, which is shown in Figure 6.9(c).
181
(a) (b) (c)
Figure 6.7: Surface completion results for the corresponding image in Figure 6.6.
Both models by Williams and Hanson [141] and Geiger et al. [34] do not correctly
handle the case shown in Figure 6.8(b).
Figure 6.10 shows three variations of pacman images. The results from our system
are shown in Figure fig:pacman-statck. While our system can correctly handle all of
them in a similar way and generate correct results, edge-based approaches tend to have
problems, as pointed out in [141]. This is because the edges have different contrast
signs. Theses examples are strong evidence for boundary-pair representation.
Figure 6.12 (a) and (b) show well-known examples by Bregman [9]. While the
edge elements in both cases are similar, the perception is quite different. In Figure
6.12(a), there is no perceptual grouping and parts of B’s remain fragmented. However,
when occlusion is introduced as in Figure 6.12(b), perceptual grouping is evident and
fragments of B’s are grouped together. These perceptions are consistent with our
182
(a) (b) (c)
Figure 6.8: Images with virtual contours. (a) Kanizsa triangle. (b) Four crosses. (c)Overlapping rectangular bars.
(a) (b) (c)
Figure 6.9: Surface completion results for the corresponding image in Figure 6.8.
(a) (b) (c)
Figure 6.10: Images with virtual contours. (a) Original pacman image. (b) Mixedpacman image. (c) Alternate pacman image.
183
(a) (b) (c)
Figure 6.11: Layered representation of surface completion for the corresponding im-ages shown in Figure 6.10.
results shown in Figure 6.13 (a) and (b). This is also strong evidence for boundary-
pair representation and against edge-based approaches. It shows clearly that grouping
plays a very important for recognition. Figure 6.12(c) shows an image of a grocery
store used in [98]. Even though the T-junction at the bottom is locally confusing,
our system gives the most plausible result through the lateral influence of the other
two strong T-junctions. Without search and parameter tuning, our system gives the
optimal solution shown in Figure 6.13(c).
6.5 Conclusions
One of the critical advantages of our model is that it allows interactions among
different modules dynamically and thus accounts for more context-sensitive behav-
iors. It is not clear to us whether there exists an energy function for the model.
Nodes belonging to one region can be viewed as a Markov random field because the
184
(a) (b) (c)
Figure 6.12: Bregman and real images. (a) and (b) Examples by Bregman [9]. (c) Agrocery store image.
(a) (b) (c)
Figure 6.13: Surface completion results for images shown in Figure 6.12.
185
(a) (b) (c)
Figure 6.14: Bistable perception. (a) Face-vase input image. (b) Faces as figures. (c)Vase as figure.
influence is defined locally in the network. However, the inhibition between paired
nodes introduced in (6.2) complicates the system analysis.
Multiple solutions can also be generated by our models. A simple way is through
self-inhibition. Here we demonstrate that through habituation [132]. It is well known
that the strength of responses decreases when a stimulus is presented repeatedly.
Figure 6.14(a) shows an image, where either two faces or vase can be perceived, but
not both at the same time. Figure 6.14(b) and (c) show the two possible results
using layered representation. In Figure 6.14(b), two faces are perceived and and the
vase is suppressed into the background; Figure 6.14(c) shows the other case. Here
the differences can be seen from the middle layer. By introducing habituation, our
system offers a computational explanation. As shown in Figure 6.15, two faces and
vase alternate to be figural, resulting in bistable percept. This example demonstrates
that top-down influence from memory and recognition can be naturally incorporated
in the network.
186
Left face
Right face
Vase
Time
Figure 6.15: Temporal behavior of the system for Figure 6.14(a). Dotted lines are0.5.
187
CHAPTER 7
EXTRACTION OF HYDROGRAPHIC REGIONS FROM
REMOTE SENSING IMAGES USING AN OSCILLATOR
NETWORK WITH WEIGHT ADAPTATION
We study the extraction of objects with accurate boundaries from remote sensing
images. We use a locally excitatory globally inhibitory oscillator network (LEGION)
as a representational framework to combine the advantages of classification-based
methods and locally coupled networks. A multi-layer perceptron is used to select
seed points within regions to be extracted. The boundaries of the extracted regions
are accurately located through a topology preserving LEGION network. A novel
weight-adaptation method, which preserves significant boundaries between regions
and smoothes details due to variations and noise, is employed to increase the robust-
ness of the system. Together, these provide a generic framework for feature extraction.
A functional system has been developed and applied to hydrographic region extrac-
tion from Digital Orthophoto Quarter-Quadrangle (DOQQ) images. Experimental
results show that the extracted regions are comparable with existing topographic
maps and can be used for map revision. Preliminary versions appear in [78] [85] [84]
[76] [77]. Weight adaptation is first proposed by Chen et al [17].
188
7.1 Introduction
With the availability of remotely sensed high-resolution imagery and advances
in computing technologies, cost-effective and efficient ways to generate accurate ge-
ographic information are possible. Because geographic information is implicitly en-
coded in images, the critical step is how to extract geographic information from im-
ages and make it explicit. While humans can efficiently and robustly extract desired
features from images, image understanding is regarded as one of the most difficult
problems in machine intelligence. For remote sensing applications, classification is
one of the most commonly used techniques to extract quantitative information from
images. Because multilayer perceptrons can potentially approximate complex deci-
sion functions, they have been widely used for solving many practical problems and
specifically in remote sensing applications [38] [126] [2] [121]. The major advantages
of neural network approaches are that no prior information is required and parameters
can be obtained automatically through training [115]. A major disadvantage is that
neural networks classify each pixel individually and do not incorporate contextual
information, i.e., the relationship among neighboring pixels, resulting in relatively
poor performance when the variations within a class are large. To illustrate the prob-
lem, Figure 7.1(a) shows a noisy synthetic image and Figure 7.1(b) shows the ground
truth image. A three-layer perceptron is trained using a standard back-propagation
algorithm [46] with 12 positive and 12 negative examples as shown in Figure 7.1(c).
Figure 7.1(d) shows the classification result. While the central regions are classified
correctly, the boundaries are not located accurately, resulting in a large classification
error as shown in Table 7.1.
189
(a) (b)
(c) (d)
Figure 7.1: Classification result of a noisy synthetic image using a three-layer percep-tron. (a) The input image with size of 230 × 240. (b) The ground truth image. (c)Positive and negative training samples. Positive examples are shown as white andnegative ones as black. (d) Classification result from a three-layer perceptron.
190
Classification can be posed as a statistical mapping from observations to specified
classes. In general, the design of a statistical pattern classification algorithm requires
some prior statistical information. With assumptions of specific distribution and er-
ror functions, many statistical classifiers have been proposed [25] [109] and have been
widely used in remote sensing applications [3] [109] [39]. The statistical formulation
provides a unified way to incorporate prior knowledge and contextual information.
Contextual classifiers explicitly incorporate contextual information in classification
[32] [57], resulting in significant performance improvement. Markov Random Fields,
as a special case in specifying contextual information, have been successful in image
restoration, modeling, and segmentation [35]. Essentially by specifying joint distri-
butions through Gibbs’ distribution, the central computation task is an optimization
problem for non-convex energy functions with high dimensionality. Because ideal
optimization is computationally prohibitive, in practice only a local optimum can
be obtained. With approximate but more efficient optimization algorithms, Markov
Random Fields have been applied widely in remote sensing applications [110] [118].
As we can see from the development of classification algorithms, one of the criti-
cal issues in improving classification performance is how to incorporate contextual
information.
We attempt to develop a framework for automated feature extraction that can
derive accurate geographic information for map revision and be applied to very large
images, such as DOQQ images. In this chapter, we pose the automated feature ex-
traction problem as a binding problem. Pixels that belong to desired regions should
be bound together to form desired objects. We use a LEGION network [123] [134]
[135], which is a generic framework for binding and image segmentation. As shown
191
analytically, LEGION networks can achieve both synchronization within an oscillator
group representing a region and desynchronization among different oscillator groups
rapidly. This offers a theoretical advantage over pure local networks such as Markov
Random Fields [35], where efficient convergence has not been established in general.
To improve the performance, we incorporate contextual information through a weight
adaptation method proposed by Chen et al [17]. As multiple scales in edge detection
algorithms [74] can be viewed as a way of incorporating contextual information by
relating detected structures at different scales, weight adaptation can be viewed as a
concrete multiple scale approach with two scales. Instead of applying the same opera-
tors at different scales, in weight adaptation, statistical information from a larger scale
is derived and mainly used to govern a locally coupled adaptation process, resulting in
accurate boundary localization and robustness to noise. We assume that features to
be extracted are specified through examples and use a multilayer perceptron trained
through back-propagation [46] for seed selection. To reduce the number of necessary
training samples and increase the generalization of the classification method, instead
of classifying the entire image, we use a multilayer perceptron to identify seed points
only. LEGION is then used to provide a framework for integration. We have de-
veloped a functioning system using the proposed method for hydrographic feature
extraction from DOQQ images and have obtained satisfactory results.
This chapter is organized as follows. Section 7.2 introduces weight adaptation.
Section 7.3 describes a multilayer perceptron for automated seed selection. Section
7.4 provides experimental results using DOQQ images. Section 7.5 concludes the
chapter with further discussions.
192
7.2 Weight Adaptation
Given LEGION dynamics, as presented in Section 2.2, to extract desired features,
we need to form local connections based on the input image. To facilitate the following
notations for weight adaptation, we re-write the coupling term Sij in (2.1a) as [17]:
Sij =
∑
(k,l)∈N(i,j)H(xkl)/(1 + |Wij;kl|)log
(
∑
(k,l)∈N(i,j)H(xkl) + 1) −WzH(z − θz), (7.1)
As in (2.2), the first term here is the total excitatory coupling that oscillator (i, j)
receives from the oscillators in a local neighborhood N(i, j), and Wij;kl is the dynamic
connection from oscillator (k, l) to (i, j). Here Wij;kl encodes dissimilarity to simplify
the equations for weight adaptation, hence the reciprocal3. Note that, unlike in (2.2),
the first term implements a logarithmic grouping rule [17], which generates better
segmentation results in general.
As illustrated in Figure 2.3, effective couplings in a local neighborhood are used.
Without introducing assumptions about the desired features, Wij;kl in general can
be formed based on the intensity values at the corresponding pixels (i, j) and (k, l)
in the input image. However, due to variations and noise in real images, individual
pixel values are not reliable, and the resulting connections would be noisy and give
undesirable results. Figure 7.2(a) shows a one-dimensional signal which is a row from
the image shown in Figure 7.1(a). As shown in Figure 7.2(b), the connections formed
based on input intensity values are noisy, which lead to undesired region boundaries.
3This interpretation is different from a previous one used in (2.2) [123] [135]. After algebraicmanipulations, the equations can be re-written in terms of the previous, more conventional inter-pretation.
193
(a)
(b)
(c)
(d)
Figure 7.2: Lateral connection evolution through weight adaptation illustrated usingthe 170th row from the image shown in Figure 7.1(a). (a) The original signal. (b) Ini-tial connection weights. (c) Connection weights after 40 iterations. (d) Correspondingsmoothed signal.
194
To overcome this problem, we use a weight adaptation method for noise removal
and feature preservation [17]. For each oscillator in the network, two kinds of con-
nections, namely, fixed and dynamic connections, are introduced. For oscillator (i, j),
the fixed connectivity specifies a group of neighboring oscillators which affect the os-
cillator, and the associated neighborhood is called lateral neighborhood Nl(i, j). On
the other hand, the dynamic connectivity encodes the transient relationship between
two oscillators in a local neighborhood during weight adaption, and the associated
neighborhood is called local neighborhood N(i, j). To achieve accurate boundary
localization, in this chapter, N(i, j) is defined as the eight nearest neighborhood of
(i, j), as depicted in Figure 2.3. Fixed connection weights are established based on the
input image, while dynamic connection weights adapt themselves for noise removal
and feature preservation, resulting interactions between two scales. Intuitively, dy-
namic weights between two oscillators should be adapted so that the absolute dynamic
weight becomes small if the corresponding pixels are in a homogeneous region, while
the weight should remain relatively large if the corresponding pixels cross a boundary
between different homogeneous regions. Based on the observation that most of the
discontinuities in the lateral neighborhood Nl(i, j) correspond to significant features,
such discontinuities should remain unchanged and be used to control the speed of
weight adaptation for preserving features. Such discontinuities in the lateral neigh-
borhood are called lateral discontinuities [17]. Furthermore, because proximity is a
major grouping principle [111], we use another measure that reflects local discontinu-
ities sensitive to the changes of local attributes among local oscillators. The lateral
neighborhood provides a more reliable statistical context, by which the weight adap-
tation algorithm is governed. The local neighborhood utilizes the statistical context
195
and local geometrical constraints to adaptively change the local connections. These
two discontinuity measures are jointly incorporated in weight adaptation.
Mathematically, the weight adaption method is formulated as follows. For oscil-
lator (i, j), the weight of its fixed connection from oscillator (k, l) in Nl(i, j), Tij;kl, is
defined as the difference between the external stimuli received by (i, j) and (k, l), i.e.
Tij;kl = Ikl − Iij. (7.2)
Here Iij and Ikl are the intensities of pixels (i, j) and (k, l) respectively. For oscillator
(i, j), the fixed connections exist only in Nl(i, j) and Tij;kl = −Tkl;ij. On the other
hand, to achieve accurate boundary localization, a dynamic connection weight from
oscillator (k, l) to oscillator (i, j), Wij;kl, is defined only within a local neighborhood
N(i, j) and initialized to the corresponding fixed weight, i.e., W(0)ij;kl = Tij;kl. Dynamic
weight |W (t)ij;kl| encodes the dissimilarity between oscillators (i, j) and (k, l) at time t.
First, the variance of all the fixed weights associated with an oscillator is used to
measure its lateral discontinuities. For oscillator (i, j), the mean of its fixed weights
on Nl(i, j), µij, is calculated using
µij =
∑
(k,l)∈Nl(i,j) Tij;kl
|Nl(i, j)|. (7.3)
Accordingly, we compute the variance of its fixed weights, σ2ij, by
σ2ij =
∑
(k,l)∈Nl(i,j)
(
Tij;kl − µij
)2
|Nl(i, j)|
=
∑
(k,l)∈Nl(i,j) T2ij;kl
|Nl(i, j)|−(
∑
(k,l)∈Nl(i,j) Tij;kl
|Nl(i, j)|
)2
. (7.4)
The variance, σ2ij, is normalized through
σ2ij =
σ2ij − σ2
min
σ2max − σ2
min
. (7.5)
196
Here σ2max is the maximal variance across the entire image and σ2
min the minimal.
Intuitively, σ2ij encodes the lateral discontinuities for oscillator (i, j). Oscillators cor-
responding to significant features tend to have large σ2ij and vice versa. Based on
this observation, the local discontinuity of an oscillator with a high lateral discon-
tinuity should be preserved; the local attributes of an oscillator with a low lateral
discontinuity should adapt towards homogeneity.
To preserve accurate region boundaries during weight adaptation, local discon-
tinuities are detected along four orientations, namely vertical (V), horizontal (H),
diagonal (D), and counter-diagonal (C), respectively. Accordingly, we define four
detectors for oscillator (i, j) as
DHij= |Wij;i−1,j −Wij;i+1,j |, (7.6a)
DVij= |Wij;i,j−1 −Wij;i,j+1|, (7.6b)
DCij= |Wij;i−1,j−1 −Wij;i+1,j+1|, (7.6c)
DDij= |Wij;i−1,j+1 −Wij;i+1,j−1|. (7.6d)
If there is an edge through (i, j) in one of these four orientations, the corresponding
detector will respond strongly. Based on the responses from the four detectors, a
measure of local discontinuity for oscillator (i, j) is defined as
Dij =DHij
+DVij+DCij
+DDij
4. (7.7)
Here Dij is sensitive to local discontinuity along all the orientations.
To integrate the lateral and local attributes of oscillator (i, j) and realize noise
removal and feature preservation, V(t)ij is introduced based on σ2
ij and Dij:
V(t)ij =
∑
(k,l)∈N(i,j) exp[
−(
κΦ(σ2kl, θσ) +D
(t)kl /s
)]
W(t)ij;kl
∑
(k,l)∈N(i,j) exp[
−(
κΦ(σ2kl, θσ) +D
(t)kl /s
)] , (7.8)
197
where s (s > 0) is a parameter that determines to what extent local discontinuities
should be preserved, and κ (κ > 0) is to control to what extent features should be
preserved in terms of lateral discontinuities during weight adaptation. The function
Φ(ν, θ) is a rectification function, defined as Φ(ν, θ) = ν if ν ≥ θ and Φ(ν, θ) = 0 oth-
erwise. To deal with variations due to noisy details, θσ (0 ≤ θσ ≤ 1) is introduced to
alleviate the influence of noise in the estimation of lateral discontinuities. The degree
of lateral discontinuities in an image gives a measure of the significance of the corre-
sponding discontinuities. In (7.8), if the lateral discontinuities of all the oscillators in
N(i, j) are similar, their local discontinuities should play a dominant role in updat-
ing V(t)ij . In this case, the contribution of dynamic weight W
(t)ij;kl in updating V
(t)ij is
determined by the corresponding local discontinuity D(t)kl of oscillator (k, l) in N(i, j).
If D(t)kl is small, W
(t)ij;kl has a large contribution, and vice versa. The local attributes of
oscillator (i, j) are changed through updating V(t)ij so that the dissimilarity between
(i, j) and its neighboring oscillators is reduced with respect to s. Noise removal along
with feature preservation is achieved through the reduction of dissimilarity in terms
of local discontinuities. On the other hand, when neighboring oscillators of (i, j)
have different lateral discontinuities, both lateral and local discontinuities must be
employed in determining the contribution from W(t)ij;kl to V
(t)ij . In this case, when
the overall discontinuities associated with oscillator (k, l) are relatively small, W(t)ij;kl
makes a large contribution. Lateral and local discontinuities jointly provide a robust
way to realize feature preservation and noise removal for those oscillators associated
with large lateral discontinuities, i.e. σ2kl ≥ θσ. The local attributes of oscillator (i, j)
are adapted towards reduction of the dissimilarity between (i, j) and the neighbor-
ing oscillators with relatively small overall discontinuities. The dissimilarity between
198
(i, j) and those with relatively large overall discontinuities tends to remain relatively
large.
Based on (7.8), We define weight adaptation for Wij;kl
W(t+1)ij;kl = W
(t)ij;kl +
[
exp(
−κΦ(σ2kl(R), θσ)
)
V(t)kl − exp
(
−κΦ(σ2ij(R), θσ)
)
V(t)ij
]
. (7.9)
Here, W(t+1)ij;kl is updated based on the local attributes associated with (i, j) and (k, l).
The lateral discontinuity further determines gain control during weight adaptation.
Figure 7.2(c) and (d) show the adapted connection weights and the smoothed
signal after 40 iterations. From Figure 7.2(c), one can see that pixels belonging to
the same region are strongly coupled, while couplings between pixels belonging to
different regions are weak ( As pointed out earlier, in this chapter the coupling is the
reciprocal of the connection weight). This greatly improves LEGION segmentation
performance when noise is large.
This method can be viewed as a simplified and efficient way of utilizing and
integrating information from multiple scales. However, it is different from general
multiple scale approaches [74]. Instead of applying the same operators on different
scales, in lateral neighborhood, statistical information is derived and mainly used to
guide the local weight adaptation; in local windows, geometrical constraints are en-
forced through local coupling, preserving significant boundaries precisely. The weight
adaptation scheme is closely related to nonlinear smoothing algorithms [105]. It
preserves significant discontinuities while adaptively smoothing variations caused by
noise. Compared to existing nonlinear smoothing methods, the weight adaptation
method offers several distinctive advantages [17]. First, it is insensitive to termination
conditions while many existing nonlinear methods critically depend on the number of
iterations. Second, it is computationally fast. Third, by integrating information from
199
different scales, this method can generate better results. Quantitative comparisons
with other methods, including various smoothing algorithms, are provided in [17].
7.3 Automated Seed Selection
Both LEGION networks and weight adaptation methods are generic approaches,
where no assumptions about the features being extracted are made. To extract de-
sired features from remote sensing images, we need to specify relevant features. One
way to do this is to use certain parametric forms based on assumptions about the
statistical properties of features and noise. However, for map revision and other re-
mote sensing applications, images may be acquired under different conditions and
even through different sources, such as DOQQ images. These make it very difficult
to model the features using parametric forms. A more feasible way is to specify the
desired features through positive and negative examples. There are essentially two
broad categories of approaches for solving the problem. Statistical based approaches
need knowledge of statistical distribution [3] [115], which may not be available or
may be difficult to obtain due to different data sources and acquisition conditions.
Artificial neural networks provide an alternate approach. Because they can approxi-
mate arbitrary complex functions, artificial neural networks are suitable for describing
features being extracted in remote sensing applications, where many factors are in-
volved in generating the features in input images. The major advantages of using
neural networks include that no prior knowledge of statistical distribution is needed
and parameters can be obtained automatically through training [3] [115]. Artificial
neural networks, especially multilayer perceptrons, have been widely used in remote
sensing applications [38] [126] [2] [121].
200
To apply a multilayer perceptron, a number of design choices must be made.
These choices may greatly affect the convergence of the network and learning results.
For our task, we use a three-layer perceptron, with four input units, three hidden
units, and one output unit, as shown in Figure 7.3. It is trained using a standard
back-propagation algorithm [46]. If we present the pixels in the training examples
directly to the network, we observe that many training examples are necessary to
achieve good results. Due to the potential conflicting conditions, the network often
does not converge. To achieve rotational invariance and reduce the necessary number
of training samples, instead of presenting the training windows directly to the network
we extract several local attributes from training windows as input to the network.
More specifically, we use the average value, minimum value, maximum value, and
variance from training samples. These values are normalized to improve training and
classification results.
To further reduce the number of necessary training samples and deal with varia-
tions within features being extracted, we use a three-layer perceptron for seed selection
only. In other words, instead of classifying all the pixels directly, we train the network
to find pixels that are within a large region to be extracted. The accurate boundaries
are derived using an oscillator network with weight adaptation. As demonstrated
in the next section, a small of training samples are sufficient for very large images
with significant variations. Our training methodology offers a distinctive advantage
in reducing necessary training examples. In contrast, many experiments often divide
the entire data into training and test sets with approximately the same size to achieve
good performance [118], ending up using many more training samples.
201
Min Max Avg Var
Output
Figure 7.3: Architecture and local features for the seed selection neural network.
7.4 Experimental Results
We have developed a fully functioning system with a user-friendly graphical in-
terface using the proposed method. As shown in Figure 2.3, we construct a two-
dimensional LEGION network, where each pixel in the input image corresponds to
an oscillator. Seed points are extracted using a three-layer perceptron, the output
of which is used to determine the leaders in the LEGION network. Oscillators in a
major seed region develop high potentials and thus are oscillating. In this chapter,
a seed region is considered to be a major one if its size is larger than a threshold
θp. The dynamic connection Wij;kl between oscillator (i, j) and (k, l) is established
based on the weight adaptation method presented in Section III. We have developed
a computationally faster LEGION algorithm for grouping, compared to the one given
[135]. The result of region extraction corresponds to all the oscillating oscillators.
202
7.4.1 Parameter Selection
Our method consists of three relatively independent systems. For LEGION net-
works, except Wz and θp, all other parameters are system parameters and application
independent. For weight adaptation, three parameters are involved. In order to
achieve optimal results, these parameters are application dependent and subject to
tuning. However, as demonstrated in [17], these parameters can be fixed for images
of a similar type. In this chapter, they are fixed for all the experiments: s = 10.0,
κ = 60.0, and θσ = 0.02. The lateral neighborhood Nl(i, j) consists of 7×7 oscillators.
Wij;kl is updated for 40 iterations using (7.9).
The three-layer perceptron for seed selection is trained using a standard back-
propagation with momentum. The learning rate is set to 0.025 and the momentum
parameter is set to 0.9. The activation function used is a sigmoid function, the
steepness parameter of which is 0.5. The training is terminated when the error from
all the samples is less than 0.05.
7.4.2 Synthetic Image
To demonstrate the capability of the proposed method, we have applied the sys-
tem to a noisy synthetic image shown in Figure 7.1(a). A three-layer perceptron is
trained using 12 positive and 12 negative samples, as shown in Figure 7.1(c). Each
sample is a window with 23× 23 pixels from the original image. Local attributes are
first calculated from the training samples and then fed to a three-layer perceptron as
input vectors. The classification result is shown in Figure 7.1(d). While the region
boundaries are not localized precisely, seed points within central regions are found
203
(a) (b)
Figure 7.4: Segmentation result using the proposed method for a synthetic image.(a) A synthetic image as shown in Figure 7.1(a). (b) The segmentation result fromthe proposed method. Here Wz = 0.25 and θp = 100.
correctly. Leaders in the LEGION network are then determined based on the classi-
fication result from the three-layer perceptron. With weight adaptation, the regions
are extracted with quite precise boundaries, as shown in Figure 7.4(b). As shown in
Table 7.1, the classification error rate is reduced significantly. The total error rate
of the proposed method is 5.12% while the total error rate is 45.07% when multi-
layer perceptron classification is applied only. The obtained result is also comparable
with the best result obtained by an edge-based approach [114] with carefully tuned
parameters.
7.4.3 Hydrographic Region Extraction from DOQQ Images
We have applied the proposed method to extract hydrographic regions from sev-
eral DOQQ images. High resolution DOQQ images are readily available from many
commercial sources and a national coverage for the United States will be available in
the near future. Map revision and other remote sensing applications using DOQQ will
204
False nontarget rate False target rateDataset Classification Proposed Classification Proposed
only method only method
Synthetic image 45.07 % 4.89 % 0.00 % 0.23 %Washington East 51.95 % 11.70 % 1.97 % 0.75 %
Damascus 32.56 % 8.42 % 1.79 % 0.18 %
Table 7.1: Comparison of error rates using neural network classification and theproposed method.
have significant values since geographical structures undergo constant alterations due
to seasonal and other natural and man-made changes. Because it requires very large
memory if we process the entire DOQQ image once, DOQQ images are partitioned
into tiles with user specified sizes. This partitioning may cause a boundary problem
where some tiles may contain only small parts of hydrographic regions and produce
no seeds by themselves using our seed selection criterion. This problem is resolved
by running the system for several additional iterations. At each iteration, the input
image is re-partitioned so that tiles sufficiently overlap with those from the previous
partition and seeds are generated using the extracted hydrographic regions so far.
The first DOQQ image, shown in Figure 7.5, is from the Washington East, D.C.-
Maryland area and the size is 6204× 7676 pixels. A three-layer perceptron is trained
using 19 positive and 28 negative representational examples where each sample is a
31 × 31 window. The trained network is then applied to classify the entire DOQQ
image, and seed points from classification are shown in Figure 7.6. While the pixels
in central and major river regions are correctly classified, the river boundaries are
rough. Also there are pixels that are misclassified as hydrographic seed points even
though they do not belong to any hydrographic regions. As shown in Table 7.1, the
205
false target rate is 1.97%. Here the false target rate is the ratio of the number of non-
hydrographic pixels which are misclassified to the total number of true hydrographic
pixels. Similarly, the false nontarget rate is the ratio of the number of hydrographic
pixels that are misclassified to the total number of true hydrographic pixels. The
ground truth, as shown in Figure 7.8 is generated by manual seed selection based
on a 1:24,000 topographic map from the United States Geological Survey (see Figure
7.10(c) for example) and DOQQ image. Figure 7.9(b) is generated from Figure 7.9(a)
using the same procedure. Leaders in LEGION are determined with θp = 4000.
Because noisy seed points cannot develop high potentials, no hydrographic regions are
extracted around those pixels. We apply a LEGION network with weight adaptation
where there are leaders in the tile being processed. Figure 7.7 shows the result
from our system. As shown in Table 7.1, both the false target and false nontarget
rates are reduced dramatically. The false target rate is reduced to 0.75%. Also the
hydrographic region boundaries are localized much more accurately, and thus the
false nontarget rate is reduced. Mainly because the Kenil Worth Aquatic Gardens,
magnified in Figure 7.9(a), is not extracted, the false nontarget rate of the proposed
method still stands at 11.70%. Because the aquatic area is statistically very similar
to soil land, it is not possible to classify it correctly solely using the perceptron.
If we assume that seed points are correctly detected in the area, the proposed
method can correctly extract the aquatic region with high accuracy. The result,
shown in Figure 7.9(b), is generated by manually selecting seed points in the area.
This would reduce the error rate of the proposed method even further.
More importantly, major hydrographic regions are extracted with accurate bound-
aries and cartographic features, such as bridges and islands, are preserved well. These
206
Figure 7.5: A DOQQ image with size of 6204× 7676 pixels of the Washington East,D.C.-Maryland area.
207
Figure 7.6: Seed pixels obtained by applying a trained three-layer perceptron to theDOQQ image shown in Figure 7.5. Seed pixels are marked as white and superimposedon the original image. The network is trained using 19 positive and 28 negativesamples, where each sample is a 31× 31 window.
208
Figure 7.7: Extracted hydrographic regions from the DOQQ image shown in Figure7.5. Hydrographic regions are marked as white and superimposed on the originalimage to show the accuracy of the extracted result. Here Wz = 0.15 and θp = 4000.
209
Figure 7.8: A ground truth generated by manually placing seeds based on the corre-sponding 1:24,000 USGS topographic map and DOQQ image. The result was manu-ally edited.
210
(a) Input image. (b) Extracted regions.
Figure 7.9: Hydrographic region extraction result for an aquatic garden area withmanually placed seed pixels. Due that no reliable seed region is detected, this aquaticregion, which is very similar to soil regions, is not extracted from the DOQQ imageas shown in Figure 7.7. Extracted regions are marked as white and superimposed onthe original image.
211
are critically important for deriving accurate spatial information. The major river,
Anacostia River, is extracted correctly. Several roads crossing the river are preserved.
To demonstrate the effectiveness of our method in preserving important cartographic
features, Figure 7.10(a) shows a magnified area around the Kingman Lake. Figure
7.10(b) shows the classification result, Figure 7.10(c) shows the corresponding part
of the USGS 1:24,000 topographic map, and Figure 7.10(d) shows our result. Within
this image, intensity values and local attributes change considerably as shown in Fig-
ure 7.10(a). The boundaries of small islands are localized accurately even though
they are covered by forests. In weight adaptation, the information from the lateral
and local windows is jointly used when variances in a local neighborhood are large,
resulting in robust feature preservation and noise removal. Similarly, the forests along
the river banks are preserved well. A bridge connects the lake and the river is also
preserved. As shown in Figure 7.10(a), the bridge is spatially small and it would be
very difficult for nonlinear smoothing algorithms to preserve this cartographic feature.
By comparing Figure 7.10(c) and (a), one can see that hydrographic regions have
changed from the map. Note, for example, the lower part of the left branch. This
geographical change illustrates our previous point about the constant nature of such
changes and the need for frequent map revision. With precise region boundaries from
our method, we believe that our system is suited for frequent map revision purposes.
While the major features are the same, the lake has shrunk in size, and such shrinkage
is captured by our algorithm (see Figure 7.10(d)). This suggests that our method can
be used for monitoring changes of hydrographic features.
We have also applied our system to the DOQQ image of a rural area around
Damascus, Pennsylvania-New York, with 5802 × 7560 pixels, shown in Figure 7.11.
212
(a) (b)
(c) (d)
Figure 7.10: Extraction result for an image patch from Figure 7.5. (a) The inputimage. (b) The seed points from the neural network. (c) A topographic map of thearea. Here the map is scanned from the chapter version and not wrapped with respectto the image. (d) Extracted result from the proposed method. Extracted regions arerepresented by white and superimposed on the original image.
213
Because this dataset is dramatically different from the Washington East dataset, the
multilayer perceptron is retrained using 25 positive and 17 negative training exam-
ples. Figure 7.12 shows the result from our method. Table 7.1 gives the quantitative
results from classification using the three-layer perceptron only and the proposed
method compared against a ground truth. Again, the ground truth, shown in Figure
7.13 is generated based on a 1:24,000 topographic map of the area. Generally speak-
ing, our extraction results for the Damascus image are comparable with those for the
Washington East image (In fact, a little better as revealed in Table 7.1). While the
major river, Delaware River, is extracted out with branches with great accuracy, and
small island features are preserved. However, a small river, the East Branch Calli-
coon Creek, located at upper right in the DOQQ image, is missing. When carefully
checking the DOQQ image, it is clear that the river is similar to its surrounding areas,
suggesting that the creek is not prominent.
To summarize, the results generated for both DOQQ images are comparable with
the hydrographic regions shown in the United States Geological Survey 1:24,000 to-
pographic maps. In certain cases, our results reflect better the current state of the
geographical areas.
By using a multilayer perceptron, characteristics of hydrographic regions are cap-
tured with only a small of number of training samples, even though there are consid-
erably variations within the regions. This greatly alleviates the problem of parameter
selection. For the DOQQ images shown here, only two parameters, namely Wz and θp,
need to be changed. This offers a distinctive advantage over many existing methods,
where extensive parameter tuning is needed.
214
Figure 7.11: A DOQQ image with size of 5802 × 7560 pixels of Damascus,Pennsylvania-New York area.
215
Figure 7.12: Extracted hydrographic regions from the DOQQ image shown in Figure7.11. The extracted regions are represented by white pixels and superimposed on theoriginal image.
216
7.5 Discussions
In this chapter, we present a novel computational framework for extracting geo-
graphic features in general from remote sensing images. We demonstrate the feasibil-
ity of the method using hydrographic region extraction that is important for remote
sensing applications. By using LEGION as a segmentation framework, we combine
advantages of different methods. Using a multilayer perceptron, parameters can be
selected much more easily and the system can be adapted easily for other types of
features. Because multilayer perceptrons do not incorporate geometrical constraints
among neighboring pixels, we use a trained perceptron only for seed selection, which
determines the leaders in a LEGION network. We have used a weight adaptation
method to adaptively change local weights based on the statistical context provided
by a large window. This preserves major region boundaries while smoothing out
details due to variations and noise. It also preserves cartographic features such as
bridges and islands, which are important spatial information. Because geometric
constraints are incorporated in LEGION and weight adaptation, we achieve accurate
region boundaries. As shown by the numerical comparison, the proposed method
significantly improves the classification error from the multilayer perceptron.
Compared with existing classification approaches in remote sensing applications,
our method offers several advantages. By using the multilayer perceptron for seed se-
lection only, our method greatly improves generalization of the classification method
and reduces the number of necessary training samples. As shown in the experimental
results, the network is trained using only about 50 training samples and is successfully
applied to classify images with 6, 000× 8, 000 pixels. It would be very difficult, if not
218
impossible, to train a network to achieve results comparable to ours due to inconsis-
tency among different samples. Through weight adaptation, contextual information is
incorporated more efficiently and effectively. More importantly, our method extracts
boundaries that are comparable with features shown in topographic maps and thus
can be used for map revision purposes. To our knowledge, this is the only system
that is demonstrated for map revision using DOQQ images.
Practically, we demonstrate through a prototype system that hydrographic fea-
tures can be extracted highly automatically from DOQQ images. We have applied the
system to several DOQQ images and have obtained good results. With advances in
computing technology, very affordable systems can be built for map revision and other
feature extraction tasks. Compared with traditional map-making methods based on
aerial photogrammetry, our method is computationally efficient. We believe that this
kind of technology would be very useful for improving map revision and other remote
sensing applications. Further more, because remotely sensed images can be captured
more readily with high resolutions, efficient methods like the one proposed here are
necessary for generating up-to-date and accurate geographic information.
There are a number of improvements that can be made for our prototype system.
In the current version, features are extracted based on a single input image. While
good results have been obtained, they can be further improved by using data from
multiple sources. Feature extraction from multiple data sources has been studied
extensively [3] [118] [2]. Our method can potentially be applied to feature extraction
from multiple data sources. One possible way is to extend weight adaptation to vector
data by viewing the data from different sources in a vector form. A similar extension
has been done for nonlinear smoothing techniques [139]. Another constraint that is
219
not utilized is the relationships among different features; for example, when a road
crosses a river, there should be a bridge. In general, the knowledge concerning differ-
ent features can provide contextual information at a higher level than that currently
incorporated in our system. With these improvements, a complete feature extraction
system could be feasible.
220
CHAPTER 8
CONCLUSIONS AND FUTURE WORK
8.1 Contributions of Dissertation
This dissertation has investigated computational issues at different levels of image
organization and the major contributions are:
• We propose a new similarity for range images and implement a range image
segmentation system using a LEGION network.
• We propose a contextual nonlinear smoothing algorithm and show that several
widely used nonlinear smoothing algorithms are special cases. The proposed
algorithm generates quantitatively better results and exhibits nice properties
such as quick convergence.
• We propose the spectral histogram as a generic statistic feature for texture as
well as intensity images.
• We study image classification using the spectral histogram. We show that mean
and variance as statistical features are not sufficient and the distribution of
features is critically important for classification and segmentation.
221
• We propose a new energy function for image segmentation which expresses
explicitly the homogeneity criteria for segmentation. We implement an approx-
imate deterministic algorithm for image segmentation.
• We propose a method which can detect homogeneous texture regions in an input
image using the relationships between different scales.
• We propose a novel method for precise texture boundary localization utilizing
the structures of textures.
• We propose a boundary-pair representation and figure-ground segregation net-
work using temporal dynamics for perceptual organization.
• We propose a new computational framework for extracting features from re-
mote sensing images by combining the advantages of the learning-by-example
methods and locally coupled oscillator networks for better boundary accuracy.
8.2 Future Work
8.2.1 Correspondence Through Spectral Histograms
Two major areas in computer vision that are not addressed in this dissertation are
stereo matching and motion analysis. The central issue underlying both problems is
how to establish correspondence between input images, known as the correspondence
problem. We argue that the spectral histogram with the associated similarity measure
would potentially provide a solution to the correspondence problem. Because the
spectral histogram implicitly encodes the structures through marginal distributions,
it reduces matching ambiguities significantly compared to cross-correlation and other
feature-based matching techniques. For example, Figure 8.1 (a) and (b) show a stereo
222
image pair of a bridge. Without any assumptions of camera positions and matching
models, we extract the spectral histogram at a given pixel and find the matches in
the paired image through search. Figure 8.1 (c)-(e) show three examples. where the
middle image shows the probability of pixels in the paired image being a match of
the given pixel. In all the three cases, the matched regions are identified uniquely
and correctly. In Figure 8.1(e), the matched region is not localized because the
pixels in the surrounding area are structurally similar. If we modify the algorithm
for automatic homogeneous texture region extraction proposed in Chapter 5, we can
essentially identify good features in one image and then find the matching region(s)
in the paired image. Parameters for transformation between the images can then
be estimated. From this example, one can see the correspondence problem can be
potentially solved more effectively.
8.2.2 Integration of Bottom-up and Top-down Approaches
The purpose of a vision system is to localize and recognize important objects. To
achieve that, different cues need to be integrated together. For example, contours
have long been realized as an important feature in characterizing objects. However,
obtaining reliable contours from natural images remains difficult; edge detection al-
gorithms often give “meaningless” edges. While the spectral histogram incorporates
only photometric properties of surfaces and objects, i.e., intensity values, meaningful
contours can be extracted from the segmentation results using the algorithms pro-
posed in Chapter 5. Figure 8.2(a) shows a natural image of a giraffe. Figure 8.2(b)
shows a typical output from a Canny edge detector [13]. It is evident that deriving
223
(a) (b)
(c)
(d)
(e)
Figure 8.1: A stereo image pair and correspondence using the spectral histogram. (a)The left image. (b) The right image. (c)-(e) The matching results of marked pixelsin the left image. In each row, the left shows the marked pixel, the middle showsthe probability of being a match in the paired image, and the right shows the highprobability area in the paired image.
224
a reliable contour from the generated edges is not feasible due to the local ambigu-
ities. Figure 8.2(c) shows an initial segmentation result using the method proposed
in Chapter 5. One can see that a contour of the giraffe can be obtained easily. This
demonstrates that features from bottom-up processes must be generic.
Of course, no one can expect a perfect segmentation and recognition result from
purely bottom-up algorithms. The top-down influence from recognition plays an
important role in achieving the purpose of a vision system. For example, an iterative
procedure can be initiated based on the result shown in Figure 8.2(c). In this case,
both the photometric properties and the contour may suggest a giraffe with high
probabilities. With the top-down knowledge, the segmentation result and contour
can be improved by recovering the missing parts of the giraffe.
There are important computational issues that need to be addressed in order to
model the interactions between bottom-up and top-down processes. In this regard,
temporal correlation with LEGION as a concrete implementation provides an elegant
representational framework. By utilizing temporal domain, LEGION provides dis-
tinctive advantages that are unique to dynamic systems and is biologically plausible.
The current model of LEGION [123] [134] [135] employs only very local couplings,
which significantly limits its potential. By incorporating longer-range and top-down
couplings, a complete vision system is conceivable. We have obtained very promising
results by integrating bottom-up and top-down processes [133], which is not included
in this dissertation.
225
(a) (b)
(c) (d)
Figure 8.2: Comparison between en edge detector and the spectral histogram usinga natural image of a giraffe. (a) The input image with size 300× 240. (b) The edgemap from a Canny edge detector [13]. (c) The initial classification result using themethod presented in Chapter 5. A spectral histogram is extracted at pixel (209, 291)and the segmentation scale is 29 × 29. (d) The initial classification is embedded inthe input image to show the boundaries.
226
8.2.3 Psychophysical Experiments
While we claim that spectral histograms provide a generic feature for images and
have demonstrated that by synthesizing a wide range of texture images, including reg-
ular patterns, texton images, and many other natural images, rigorous psychophysical
experiments are needed to solidify the hypothesis. The result for texture discrimi-
nation discussed in Section 4.7 is very promising. However, the images used in the
experiment are synthetic and thus are not representative for natural images. One
straightforward experiment is to test on the correspondence between the observed
and synthesized images, as shown in Figure 4.16 and Figure 4.13. Another experi-
ment is to order a set of textures by humans as well as by an algorithm based on
spectral histograms. Other experiments can utilize the texture synthesis tool we have
to control the sharpness of texture boundaries and test on boundary accuracy and
asymmetry in texture perception.
Given all the results we have achieved using spectral histograms, some of which
match the human performance well, we would like to investigate if spectral histograms
are biologically plausible. It is well known that neurons encode information through
temporal spikes and spectral histograms can also be encoded very effectively that way
because spectral histograms can naturally be represented using temporal spikes. Also
the distance measure we used between two spectral histograms can be approximated
through cross correlations between them. As hypothesized by von der Malsburg [130],
temporal correlation provides a mechanism for solving fundamental problems in per-
ception such as feature binding. Spectral histograms would then extend significantly
the functionalities that can be achieved through temporal correlation.
227
8.3 Concluding Remarks
While developing a generic computational system for seeing remains the dream of
many vision researchers, significant progress can certainly be made by pursuing the
fundamental problems in a natural environment. Certainly there are many plausible
approaches for computer vision and the criterion to compare them is how efficient
vision tasks can be solved. Given the successful methods we have for relative inde-
pendent problems such as segmentation and pattern recognition, next steps would be
how to model the interactions among different modules and integrate them effectively
into a complete vision system. It is my sincere hope that this work would provide
some useful insights for solving the computer vision problems.
228
BIBLIOGRAPHY
[1] A. J. Bell and T. J. Sejnowski. “The “independent components” of naturalscenes are edge filters”. Vision Research, 37(23):3327–3338, 1997.
[2] J. A. Benediktsson and J. R. Sveinsson. “Feature extraction for multisourcedata classification with artificial neural networks”. International Journal onRemote Sensing, 18(4):727–740, March 1997.
[3] J. A. Benediktsson, P. H. Swain, and O. K. Ersoy. “Neural network approachesversus statistical methods in classification of multisource remote sensing data”.IEEE Transactions on Geoscience and Remote Sensing, 28(4):540–552, July1990.
[4] P. J. Besl and R. C. Jain. “Segmentation through variable-order surface fitting”.IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(2):167–192, 1988.
[5] S. M. Bhandarkar, J. Koh, and M. Suk. “Multiscale image segmentation usinga hierarchical self-organizing map”. Neurocomputing, 14:241–272, 1997.
[6] B. Bhanu, S. Lee, C. C. Ho, and T. Henderson. “Range data processing: Rep-resentation of surfaces by edges”. In Proceedings of the IEEE InternationalPattern Recognition Conference, pages 236–238, 1986.
[7] I. Biederman. “Recognition-by-component: A theory of human image under-standing”. Psychological Review, 94(2):115–147, 1987.
[8] J. S. De Bonet and P. Viola. “A Non-parametric Multi-Scale Statistical Modelfor Natural Images”. In M. I. Jordan, M. J. Kearns, and S. A. Solla, editors,Advances in Neural Information Processing, volume 10, 1997.
[9] A. S. Bregman. Asking the ‘What for’ question in auditory perception. InM. Kubovy and J R. Pomerantz, editors, Perceptual Organization, pages 99–118. Lawrence Erlbaum Associates, Publishers, Hillsdale, New Jersey, 1981.
[10] P. Brodatz. Textures: A Photographic Album for Artists and Designers. DoverPublications, New York, 1966.
229
[11] T. Caelli, B. Julesz, and E. Gilbert. “On perceptual analyzers underlying visualtexture discrimination: Part II”. Biological Cybernetics, 29(4):201–214, 1978.
[12] F. W. Campbell and J. G. Robson. “Application of Fourier analysis to thevisibility of gratings”. Journal of Physiology (London), 197:551–566, 1968.
[13] J. Canny. “A computational approach to edge detection”. IEEE Transactionson Pattern Analysis and Machine Intelligence, 8(6):679–698, 1986.
[14] K. R. Castleman. Digital Image Processing. Prentice Hall, Englewood Cliffs,NJ, 1996.
[15] F. Catte, P.-L. Lions, J.-M. Morel, and T. Coll. “Image selective smoothing andedge detection by nonlinear diffusion”. SIAM Journal on Numerical Analysis,29:182–193, 1992.
[16] E. Cesmeli and D. L. Wang. “Texture segmentation using Gaussian Markovrandom fields and LEGION”. In Proceedings of the 1997 IEEE InternationalConference on Neural Networks, pages 1529–1534, 1997.
[17] K. Chen, D. L. Wang, and X. Liu. Weight adaptation and oscillatory cor-relation for image segmentation. Technical Report OSU-CISRC-8/98-TR37,Department of Computer and Information Science, The Ohio State University,1998.
[18] P. C. Chen and T. Pavlidis. “Segmentation by texture using correlation”. IEEETransactions on Pattern Recognition and Machine Intelligence, 5:64–69, 1983.
[19] M. A. Cohen and S. Grossberg. “Neural dynamics of brightness perception:features, boundaries, diffusion and resonance”. Perception and Psychophysics,36:428–456, 1984.
[20] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms.MIT Press, Cambridge, MA, 1997.
[21] G. R. Cross and A. K. Jain. “Markov random field texture models”. IEEETransactions on Pattern Recognition and Machine Intelligence, 5:25–39, 1983.
[22] J. Daugman. “Uncertainty relation for resolution in space, spatial frequency,and orientation optimized by two-dimensional visual cortical filters”. Journalof the Optical Society of America A, 2(7):23–26, July 1985.
[23] R. L. De Valois and K. K. De Valois. Spatial Vision. Oxford University Press,New York, 1988.
230
[24] P. Diaconis and D. Freedman. “On the statistics of vision: the Julesz conjec-ture”. Journal of Mathematical Psychology, 24(2):112–138, 1981.
[25] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. JohnWiley and Sons, New York, 1973.
[26] R. Eckhorn, R. Bauer, W. Jordan, M. Brosch, W. Kruse, M. Munk, and H. J.Reitboeck. “Coherent oscillations: A mechanism of feature linking in the visualcortex?”. Biological Cybernetics, 60:121–130, 1988.
[27] R. Eckhorn, H. J. Reitboeck, M. Arndt, and P. Dicke. “Feature linking viasynchronization among distributed assemblies: Simulations of results from catvisual cortex”. Neural Computation, 2:293–307, 1990.
[28] T. F. El-Maraghi. An implementation of Heeger and Bergen’s textureanalysis/synthesis algorithm. Technical report, Department of ComputerScience, University of Toronto, Toronto, Ontario, 1998. (available athttp://www.cs.toronto.edu/∼tem/2522/texture.html).
[29] J. Elder and S. W. Zucker. Local scale control for edge detection and blurestimation. In Proceedings of the 4th European Conference on Computer Vision,volume II, pages 57–69. Springer Verlag, 1996.
[30] B. S. Everitt and D. J. Hand. Finite Mixture Distributions. Chapman and Hall,London, 1981.
[31] R. FitzHugh. “Impulses and physiological states in models of nerve membrane”.Biophysical Journal, 1:445–466, 1961.
[32] K. S. Fu and T. S. Yu. Statistical Pattern Classification using Contextual In-formation. Research Studies Press, Chichester, England, 1980.
[33] D. Gabor. “Theory of Communication”. Journal of IEE (London), 93:429–457,1946.
[34] D. Geiger, H. Pao, and N. Rubin. Salient and multiple illusory surfaces. InProceedings of IEEE Computer Society Conference on Computer Vision andPattern Recognition, pages 118–124, 1998.
[35] S. Geman and D. Geman. “Stochastic relaxation, Gibbs distributions and theBayesian restoration of images”. IEEE Transactions on Pattern Analysis andMachine Intelligence, 6(6):721–741, 1984.
[36] Z. Gigus and J. Malik. Detecting curvilinear structure in images. TechnicalReport Technical Report UCB/CSD 91/619, Computer Science Division, Uni-versity of California at Berkeley, 1991.
231
[37] C. D. Gillbert. “Horizontal integration and cortical dynamics”. Neuron, 9:1–13,1992.
[38] S. Gopal and C. Woodcock. “Remote sensing of forest change using artifi-cial neural networks”. IEEE Transactions on Geoscience and Remote Sensing,34(2):398–404, March 1996.
[39] B. Gorte and Alfred Stein. “Bayesian classification and class area estimationof satellite images using stratification”. IEEE Transactions on Geoscience andRemote Sensing, 36(3):803–812, May 1998.
[40] C. M. Gray, P. Konig, A. K. Engel, and W. Singer. “Oscillatory responses incat visual cortex exhibit inter-columnar synchronization which reflects globalstimulus properties”. Nature, 338:334–337, 1989.
[41] S. Grossberg and E. Mingolla. “Neural dynamics of perceptual grouping: Tex-tures, boundaries, and emergent segmentations”. Perception & Psychophysics,38(2):141–171, 1985.
[42] M. W. Hansen and W. E. Higgins. “Relaxation methods for supervised imagesegmentation”. IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 19:949–962, 1997.
[43] R. M. Haralick. “Statistical and structural approach to texture”. Proceedingsof IEEE, 67:786–804, 1979.
[44] R. M. Haralick, K. Shanmugam, and I. Dinstein. “Texture features for im-age classification”. IEEE Transactions on Systems, Man, and Cybernetics,3(6):610–621, 1973.
[45] D. J. Heeger and J. R. Bergen. Pyramid-based texture analysis/synthesis. InProceedings of SIGGRAPHS, pages 229–238, 1995.
[46] J. Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of NeuralComputation. Addison-Wesley, Reading, MA, 1991.
[47] W. E. Higgins and C. Hsu. “Edge detection using two-dimensional local struc-ture information”. Pattern Recognition, 27:277–294, 1994.
[48] A. L. Hodgkin and A. F. Huxley. “A quantitative description of membranecurrent and its application to conduction and excitation in nerve”. Journal ofPhysiology (London), 117:500–544, 1952.
[49] R. Hoffman and A. K. Jain. “Segmentation and classification of range images”.IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(5):608–620, 1987.
232
[50] T. Hofmann, J. Puzicha, and J. M. Buhmann. “Unsupervised texture segmen-tation in a deterministic annealing framework.”. IEEE Transactions on PatternAnalysis and Machine Intelligence, 20(8):803–818, 1998.
[51] A. Hoover, G. Jean-Baptiste, X. Jiang, P. J. Flynn, H. Bunke, D. B. Goldgof,K. Bowyer, D. W. Eggert, A. Fitzgibbon, and R. B. Fisher. “An experimentalcomparison of range image segmentation algorithms”. IEEE Transactions onPattern Analysis and Machine Intelligence, 18(7):673–689, 1996.
[52] J. Y. Hsiao and A. A. Sawchuk. “Unsupervised textured image segmentationusing feature smoothing and probabilistic relaxation techniques”. ComputerVision, Graphics, and Image Processing, 48(1):1–21, 1989.
[53] D. H. Hubel. Eye, Brain, and Vision. W. H. Freeman and Company, New York,1988.
[54] D. H. Hubel and T. N. Wiesel. “Receptive fields, binocular interaction and func-tional architecture in the cat’s visual cortex”. Journal of Physiology (London),160:106–154, 1962.
[55] R. H. Hummel and S. W. Zucker. “On the foundations of relaxation labelingprocesses”. IEEE Transactions on Pattern Analysis and Machine Intelligence,5(3):267–286, 1983.
[56] D. J. Ittner and A. K. Jain. 3-D surface discrimination from local curvaturemeasures. In Proceedings of Computer Vision and Pattern Recognition Confer-ence, pages 119–123, 1985.
[57] Y. Jhung and P. H. Swain. “Bayesian contextual classification based on modifiedM-estimates and Markov Random Fields”. IEEE Transactions on Geoscienceand Remote Sensing, 34(1):67–75, January 1996.
[58] X. Y. Jiang and H. Bunke. “Fast segmentation of range images into planarregions by scan line grouping”. Machine Vision and Applications, 7(2):115–122, 1994.
[59] J. L. Johnson. “Pulse-coupled neural nets: translation, rotation, scale, distor-tion, and intensity signal invariance for images”. Applied Optics, 33(26):6239–6253, 1994.
[60] J. L. Johnson, M. L. Padgett, and W. A. Friday. Multiscale image factoriza-tion. In Proceedings of the IEEE International Conference on Neural Networks,volume 3, pages 1465–1468, 1997.
[61] B. Julesz. “A theory of preattentive texture discrimination based on first-orderstatistics of textons”. Biological Cybernetics, 41:131–138, 1962.
233
[62] B. Julesz. “Visual pattern discrimination”. IRE Transactions on InformationTheory, 8:84–92, 1962.
[63] B. Julesz. Dialogues on Perception. MIT Press, Cambridge, MA, 1995.
[64] G. Kanizsa. Quasi-perceptual margins in homogeneously stimulated fields. InS. Petry and G. E. Meyer, editors, The Perception of Illusory Contours, pages40–49. Springer-Verlag, New York, 1987.
[65] R. L. Kashyap, R. Chellappa, and A. Khotznzad. “Texture classification usingfeatures derived from random field models”. Pattern Recognition Letters, 1:43–50, 1982.
[66] J. Koenderink. “The structure of images”. Biological Cybernetics, 50:363–370,1984.
[67] J. Koh, M. Suk, and S. M. Bhandarkar. “A multilayer self-organizing featuremap for range image segmentation”. Neural Networks, 8(1):67–86, 1995.
[68] T. Kohonen. Self-Organizing Maps. Springer, Berlin, 1995.
[69] B. J. Krose. A description of visual structure. Ph.d. dissertation, Delft Univer-sity of Technology, Delft, The Netherlands, 1986.
[70] S. Kullback and R. A. Leibler. “On information and sufficiency”. Annals ofMathematical Statistics, 22:67–83, 1951.
[71] A. Leonardis, A. Gupta, and R. Bajcsy. Segmentation as the search for thebest description of the image in terms of primitives. In Proceedings of theInternational Conference on Computer Vision, pages 121–125, 1990.
[72] C. H. Li and C. K. Lee. “Image smoothing using parametric relaxation”. Graph-ical Models and Image Processing, 57:949–962, 1997.
[73] S. Z. Li. “Toward 3D vision from range images: An optimization framework andparallel networks”. Computer Vision, Graphics, and Image Processing: ImageUnderstanding, 55(3):231–260, 1992.
[74] T. Lindeberg and B. M. ter Haar Romeny. Linear scale-space. In B. M. terHaar Romeny, editor, Geometry-Driven Diffusion in Computer Vision, pages1–41. Kluwer Academic Publishers, Dordrecht, Netherlands, 19994.
[75] T. Linderberg. Scale-Space Theory in Computer Vision. Kluwer AcademicPublishers, Dordrecht, Netherlands, 1994.
234
[76] X. Liu. A prototype system for extracting hydrographic regions from DigitalOrthophoto Quadrangle images. In Proceedings of GIS/LIS’1998, pages 382–393, 1998.
[77] X. Liu, K. Chen, and D. L. Wang. “Extraction of hydrographic regions fromremote sensing images using an oscillator network with weight adaptation”.IEEE Transactions on Geoscience and Remote Sensing, under review.
[78] X. Liu and J. R. Ramirez. Automatic extraction of hydrographic features indigital orthophoto images. In GIS/LIS’1997, pages 365–373, 1997.
[79] X. Liu and D. Wang. Oriented Statistical Nonlinear Smoothing Filter. In Pro-ceedings of the 1998 International Conference on Image Processing, volume 2,pages 848–852, 1998.
[80] X. Liu, D. Wang, and J. R. Ramirez. “Boundary detection by contextual non-linear smoothing”. Pattern Recognition, in press.
[81] X. Liu and D. L. Wang. A boundary-pair representation for perception mod-eling. In Proceedings of the 1999 International Joint Conference on NeuralNetworks, 1999.
[82] X. Liu and D. L. Wang. Modeling perceptual organization using temporaldynamics. In Proceedings of the 1999 International Joint Conference on NeuralNetworks, 1999.
[83] X. Liu and D. L. Wang. “Range image segmentation using an oscillatory net-work”. IEEE Transactions on Neural Networks, 10(3):564–573, May 1999.
[84] X. Liu, D. L. Wang, and J. R. Ramirez. A two-layer neural network for ro-bust image segmentation and its application in revising hydrographic features.In International Archives of Photogrammetry and Remote Sensing, volume 32,pages 464–472, 1998.
[85] X. Liu, D. L. Wang, and J. R. Ramirez. Extracting hydrographic objects fromsatellite images using a two-layer neural network. In Proceedings of the 1998International Joint Conference on Neural Networks, volume 2, pages 897–902,1998.
[86] D. G. Lowe. Perceptual Organization and Visual Recognition. Academic Pub-lishers, Boston, 1985.
[87] J. Malik and P. Perona. “Preattentive texture discrimination with early visionmechanisms”. Journal of Optical Society of America A, 7(5):923–932, May1990.
235
[88] D. Marr. Vision: A computational investigation into the human representationand processing of visual information. W. H. Freeman and Company, New York,1982.
[89] D. Marr and E. Hildreth. “Theory of edge detection”. Proceedings of the RoyalSociety of London, Series B, 207:187–217, 1980.
[90] P. M. Milner. “A model for visual shape recognition”. Psychological Review,81(6):521–535, 1974.
[91] A. Mitiche and J. K. Aggarwal. “Detection of edges using range information”.IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(2):174–178, 1983.
[92] J. M. Morel and S. Solimini. Variational Methods for Image Segmentation.Birkhauser, Boston, 1995.
[93] C. Morris and H. Lecar. “Voltage oscillations in the barnacle giant musclefiber”. Biophysical Journal, 35:193–213, 1981.
[94] D. Mumford and J. Shah. “Optimal approximations of piecewise smooth func-tions and associated variational problems”. Communications on Pure and Ap-plied Mathematics, XLII(4):577–685, 1989.
[95] M. Nagao and T. Matsuyama. “Edge preserving smoothing”. Computer Graph-ics and Image Processing, 9:394–407, 1979.
[96] J. Nagumo, S. Arimoto, and S. Yoshizawa. “An active pulse transmissionline simulating nerve axon”. Proceedings of the Institute of Radio Engineers,50:2061–2070, 1962.
[97] K. Nakayama, Z. J. He, and S. Shimojo. Visual surface representation: a criticallink between lower-level and higher-level vision. In S. M. Kosslyn and D. N.Osherson, editors, Visual Cognition, pages 1–70. The MIT Press, Cambridge,Massachusetts, 1995.
[98] M. Nitzberg, D. Mumford, and T. Shiota. Filtering, Segmentation and Depth.Springer-Verlag, 1994.
[99] T. Ojala, M. Pietikainen, and D. Harwood. “A comparative study of texturemeasures with classification based on feature distributions”. Pattern Recogni-tion, 29(1):51–59, 1996.
[100] B. A. Olshausen and D. J. Field. “Emergence of simple-cell receptive fieldproperties by learning a sparse code for natural images”. Nature, 381:607–609,1996.
236
[101] B. A. Olshausen and D. J. Field. “Natural image statitics and efficient coding”.Network, 7(2):333–340, 1996.
[102] B. A. Olshausen and D. J. Field. “Sparse coding with an overcomplete basisset: A strategy employed by V1?”. Vision Research, 37(23):3311–3325, 1997.
[103] M. L. Padgett and J. L. Johnson. Pulse coupled neural networks (PCNN)and wavelets: Biosensor applications. In Proceedings of the IEEE InternationalConference on Neural Networks, volume 4, pages 2507–2512, 1997.
[104] P. Perona. “Deformable kernels for early vision”. IEEE Transactions on PatternAnalysis and Machine Intelligence, 17:488–499, 1995.
[105] P. Perona and J. Malik. “Scale space and edge detection using anisotropicdiffusion”. IEEE Transactions on Pattern Analysis and Machine Intelligence,12:16–27, 1990.
[106] P. Perona, T. Shiota, and J. Malik. Anisotropic diffusion. In B. M. terHaar Romeny, editor, Geometry-Driven Diffusion in Computer Vision, pages73–92. Kluwer Academic Publishers, Dordrecht, Netherlands, 1994.
[107] J. Puzicha, T. Hofmann, and J. M. Buhmann. Non-parametric Similarity Mea-sures for Unsupervised Texture Segmentation and Image Retrieval. In Proceed-ings of the IEEE International Conference on Computer Vision and PatternRecognition, pages 267–272, 1997.
[108] T. Randen and J. H. Husoy. “Filtering for texture classification: A compara-tive study”. IEEE Transactions on Pattern Analysis and Machine Intelligence,21(4):291–310, April 1999.
[109] J. A. Richards. Remote Sensing Digital Image Analysis. Springer-Verlag, Berlin,1993.
[110] E. Rignot and R. Chellappa. “Segmentation of polarimetric synthetic apertureradar data”. IEEE Transactions on Image Processing, 1:281–300, July 1992.
[111] I. Rock and S. Palmer. “The legacy of Gestalt psychology”. Scientific American,263:84–90, 1990.
[112] A. Rosenfeld, R. A. Hummel, and S. W. Zucker. “Scene labeling by relaxationoperations”. IEEE Transactions on Systems, Man, and Cybernetics, 6(6):420–433, 1976.
[113] P. Saint-Marc, J.-S. Chen, and G. Medioni. “Adaptive smoothing: a generaltool for early vision”. IEEE Transactions on Pattern Analysis and MachineIntelligence, 13:514–529, 1991.
237
[114] S. Sarkar and K. L. Boyer. “On optimal infinite impulse response edge detec-tion filters”. IEEE Transactions on Pattern Analysis and Machine Intelligence,13:1154–1171, 1991.
[115] J. Schurmann. Pattern Classification: A Unified View of Statistical and NeuralApproaches. John Wiley and Sons, New York, 1996.
[116] W. Singer and C. M. Gray. “Visual feature integration and the temporal cor-relation hypothesis”. Annual Review of Neuroscience, 18:555–586, 1995.
[117] S. M. Smith and J. M. Brady. “SUSAN - a new approach to low level imageprocessing”. International Journal of Computer Vision, 23:45–78, 1997.
[118] A. H. S. Solberg, T. Taxt, and A. K. Jain. “A Markov Random Field modelfor classification of multisource satellite imagery”. IEEE Transactions on Geo-science and Remote Sensing, 34(1):100–113, January 1996.
[119] H. Stark and J. W. Woods. Probability, Random Processes and EstimationTheory for Engineers. Prentice-Hall, Englewood Cliffs, NJ, 1994.
[120] M. Stoecker, H. J. Reitboeck, and R. Eckhorn. “A neural network for scenesegmentation by temporal coding”. Neurocomputing, 11:123–134, 1996.
[121] C. Sun, C. M. U. Neale, J. J. McDonnell, and H. D. Cheng. “Monitoring land-surface snow conditions from SSM/I data using an artificial neural networkclassifier”. IEEE Transactions on Geoscience and Remote Sensing, 35(4):801–809, July 1997.
[122] M. Tabb and N. Ahuja. “Multiscale image segmentation by integrated edge andregion detection”. IEEE Transactions on Image Processing, 6:642–655, 1997.
[123] D. Terman and D. L. Wang. “Global competition and local cooperation in anetwork of neural oscillators”. Physica D, 81(1-2):148–176, 1995.
[124] A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill-Posed Problems. Winston,Washington, D.C., 1977.
[125] F. Tomita and S. Tsuji. “Extraction of multiple regions by smoothing in selectedneighborhoods”. IEEE Transactions on Systems, Man, and Cybernetics, 7:107–109, 1977.
[126] D. Tsintikidis, J. L. Haferman, E. N. Anagnostou, W. F. Krajewski, and T. F.Smith. “A neural network approach to estimating rainfall from spacebornemicrowave data”. IEEE Transactions on Geoscience and Remote Sensing,35(5):1079–1093, September 1997.
238
[127] M. Unser. “Texture classification and segmentation using wavelet frames”.IEEE Transactions on Image Processing, 4(11):1549–1560, 1995.
[128] B. van der Pol. “On ‘relaxation oscillations’”. Philosophical Magazine,2(11):978–992, 1926.
[129] B. C. Vemuri, A. Mitiche, and J. K. Aggarwal. “Curvature-based representationof objects from range data”. Image and Vision Computing, 4(2):107–114, 1986.
[130] C. von der Malsburg. The Correlation Theory of Brain Function. InternalReport 81-2, Max-Planck-Institute for Biophysical Chemistry, 1981.
[131] D. C. C. Wang, A. H. Vagnucci, and C. C. Li. “Gradient inverse weightedsmoothing scheme and the evaluation of its performance”. Computer Graphicsand Image Processing, 15:167–181, 1981.
[132] D. L. Wang. Habituation. In M. A. Arbib, editor, The Handbook of BrainTheory and Neural Networks, pages 441–444. The MIT Press, Cambridge, Mas-sachusetts, 1995.
[133] D. L. Wang and X. Liu. “Scene analysis by integrating primitive segmentationand associate memory”. In preparation, 1999.
[134] D. L. Wang and D. Terman. “Locally excitatory globally inhibitory oscillatornetworks”. IEEE Transactions on Neural Networks, 6(1):283–286, 1995.
[135] D. L. Wang and D. Terman. “Image segmentation based on oscillatory correla-tion ”. Neural Computation, 9:805–836, 1997.
[136] M. A. Wani and B. G. Batchelor. “Edge-region-based segmentation of rangeimages”. IEEE Transactions on Pattern Analysis and Machine Intelligence,16(3):314–319, 1994.
[137] J. Weickert. Theoretical foundations of anisotropic diffusion in image process-ing. In W. Kropatsch, R. Klette, and F. Solina, editors, Theoretical Foundationsof Computer Vision, pages 231–236. Springer-Verlag, Wien, Austria, 1996.
[138] J. Weickert. A review of nonlinear diffusion filtering. In Proceedings of the FirstInternational Conference on Scale-Space, pages 3–28, 1997.
[139] R. Whitaker and G. Gerig. Vector-valued diffusion. In B. M. ter Haar Romeny,editor, Geometry-Driven Diffusion in Computer Vision, pages 93–134. KluwerAcademic Publishers, Dordrecht, Netherlands, 1994.
[140] D. Williams and B. Julesz. “Perceptual asymmetry in texture perception”.Proceedings of National Academy of Sciences, 89:6531–6534, July 1992.
239
[141] L. R. Williams and A. R. Hanson. “Perceptual completion of occluded surfaces”.Computer Vision and Image Understanding, 64:1–20, 1996.
[142] A. P. Witkin. Scale space filtering. In Proceedings of the Eighth InternationalConference on Artificial Intelligence, pages 1019–1021, 1983.
[143] Y. N. Wu and S. C. Zhu. “Equivalence of image ensembles and fundamentalbounds for texture discrimination”. Submitted to IEEE Transactions on PatternRecognition and Machine Intelligence, 1999.
[144] R. Yagel, D. Cohen, and A. Kaufman. Context sensitive normal estimationfor volume imaging. In N. M. Patrikalakis, editor, Scientific Visualization ofPhysical Phenomena, pages 211–234. Springer-Verlag, New York, 1991.
[145] Y. L. You, W. Xu, A. Tannenbaum, and M. Kaveh. “Behavioral analysis ofanisotropic diffusion in image processing”. IEEE Transactions on Image Pro-cessing, 5:1539–1553, 1996.
[146] S. C. Zhu. Embedding Gestalt laws in the Markov random fields. In IEEEComputer Society Workshop on Perceptual Organization in Computer Vision,1998.
[147] S. C. Zhu, X. Liu, and Y. N. Wu. “Statistics matching and model pursuitby efficient MCMC”. IEEE Transactions on Pattern Recognition and MachineIntelligence, in press, 1999.
[148] S. C. Zhu, Y. N. Wu, and D. Mumford. FRAME: Filters, random field andmaximum entropy. In Proceedings of the International Conference on ComputerVision and Pattern Recognition, pages 686–693, 1996.
[149] S. C. Zhu, Y. N. Wu, and D. Mumford. “Minimax entropy principles and its ap-plication to texture modeling”. Neural Computation, 9(8):1627–1660, November1997.
[150] S. C. Zhu, Y. N. Wu, and D. Mumford. “FRAME: Filters, random field andmaximum entropy - Towards a unified theory for texture modeling”. Interna-tional Journal of Computer Vision, 27(2):1–20, 1998.
[151] S. C. Zhu and A. Yuille. “Region competition: unifying snakes, region growing,and Bayes/MDL for multiband image segmentation”. IEEE Transactions onPattern Analysis and Machine Intelligence, 18:884–900, 1996.
[152] S. W. Zucker. “Region growing: Childhood and adolescence”. Computer Graph-ics and Image Processing, 5:382–399, 1976.
240