light field features for robotic vision in the presence … yu peng_tsai_thesis.pdf · game theory...
TRANSCRIPT
![Page 1: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/1.jpg)
LIGHT-FIELD FEATURES FOR ROBOTIC
VISION IN THE PRESENCE OF
REFRACTIVE OBJECTS
Dorian Yu Peng Tsai
MSc (Technology)
BASc (Enginnering Science with Honours)
Submitted in fulfillment
of the requirement of the degree of
Doctor of Philosophy
2020
School of Electrical Engineering and Computer Science
Science and Engineering Faculty
Queensland University of Technology
![Page 2: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/2.jpg)
Abstract
Robotic vision is an integral aspect of robot navigation and human-robot interaction, as well as
object recognition, grasping and manipulation. Visual servoing is the use of computer vision
for closed-loop control of a robot’s motion and has been shown to increase the accuracy and
performance of robotic grasping, manipulation and control tasks. However, many robotic vision
algorithms (including those focused on solving the problem of visual servoing) find refractive
objects particularly challenging. This is because these types of objects are difficult to perceive.
They are transparent and their appearance is essentially a distorted view of the background,
which can change significantly with small changes in viewpoint. What is often overlooked is
that most robotic vision algorithms implicitly assume that the world is Lambertian—that the
appearance of a point on an object does not change significantly with respect to small changes
in viewpoint1. Refractive objects violate the Lambertian assumption and this can lead to image
matching inconsistencies, pose errors and even failures of modern robotic vision systems.
This thesis investigates the use of light-field cameras for robotic vision to enable vision-based
motion control in the presence of refractive objects. Light-field cameras are a novel camera
technology that use multi-aperture optics to capture a set of dense and uniformly-sampled views
of the scene from multiple viewpoints. Light-field cameras capture the light field, which simul-
taneously encodes texture, depth and multiple viewpoints. Light-field cameras are a promising
alternative to conventional robotic vision sensors, because of their unique ability to capture
view-dependent effects, such as occlusion, specular reflection and, in particular, refraction.
First, we investigate using input from the light-field camera to directly control robot motion,
a process known as image-based visual servoing, in Lambertian scenes. We propose a novel
light-field feature for Lambertian scenes and develop the relationships between feature motion
and camera motion for the purposes of visual servoing. We also illustrate in both simulation
and using a custom mirror-based light-field camera, that our method of light-field image-based
visual servoing is more tolerant to small and distant targets and partially-occluded scenes than
monocular and stereo-based methods.
Second, we propose a method to detect refractive objects using a single light field. Specifi-
cally, we define refracted image features as those image features whose appearance have been
distorted by a refractive object. We discriminate between refracted image features and the
surrounding Lambertian image features. We also show that using our method to ignore the re-
fracted image features enables monocular structure from motion in scenes containing refractive
objects, where traditional methods fail.
We combine and extend our two previous contributions to develop a light-field feature capable
of enabling visual servoing towards refractive objects without needing a 3D geometric model of
the object. We show in experiments that this feature can be reliably detected and extracted from
the light field. The feature appears to be continuous with respect to viewpoint, and is therefore
be suitable for visual servoing towards refractive objects.
1This Lambertian assumption is also known as the photo-consistency or brightness constancy assumption.
![Page 3: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/3.jpg)
This thesis represents a unique contribution toward our understanding of refractive objects in
the light field for robotic vision. Application areas that may benefit from this research include
manipulation and grasping of household objects, medical equipment, and in-orbit satellite ser-
vicing equipment. It could also benefit quality assurance and manufacturing pick-and-place
robots. The advances constitute a critical step to enabling robots to work more safely and reli-
ably with everyday refractive objects.
![Page 4: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/4.jpg)
Statement of Original Authorship
The work contained in this thesis has not been previously submitted to meet requirements for an
award at this or any other higher education institution. To the best of my knowledge and belief,
the thesis contains no material previously published or written by another person except where
due reference is made.
Dorian Tsai
March 2, 2020
QUT Verified Signature
![Page 5: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/5.jpg)
Acknowledgements
To my academic advisors, Professor Peter Ian Corke, Dr. Donald Gilbert Dansereau and Asso-
ciate Professor Thierry Peynot, I would like to offer my most heartfelt gratitude. They shared
with me an amazing knowledge, insight, creativity and enthusiasm. I am grateful for the re-
sources and opportunities they provided, as well as their guidance, support and patience.
In addition, I wish to convey my appreciation to Douglas Palmer and Thomas Coppin who were
my fellow plenopticists for many helpful and stimulating discussions. Thanks to Dr. Steven
Martin who helped with much of the technical engineering aspects of building and mounting
light-field cameras to various robots over the years. Thanks to Dominic Jack and Ming Xu for
being an excellent desk buddy. Thanks to Prof. Tristan Perez, Associate Professor Jason Ford
and Dr. Timothy Molloy for helping to get me started on my PhD journey in inverse differential
game theory applied to the birds and the bees, until I changed topics to light fields and robotic
vision six months later.
Thanks to Kate Aldridge, Sarah Allen and all of the other administrative staff in the Australian
Centre for Robotic Vision (ACRV) for organising so many conferences and workshops, and
keeping things running smoothly.
This research was funded in part from the Queensland University of Technology (QUT) Post-
graduate Research Award, the QUT Higher Degree Tuition Fee Sponsorship, the QUT Excel-
lent Top-Up Scholarship, and the ACRV Top-Up Scholarship, as well as financial support in
the form of employment as a course mentor and research assistant. The ACRV scholarship was
supported in part by the Australian Research Council Centre of Excellence for Robotic Vision.
Lastly, a very special thanks goes to the many faithful friends and family and colleagues who’s
backing and constant encouragements sustained me through this academic marathon and grad-
uate with a degree. I am especially indebted to Robin Tunley and Miranda Cherie Fittock for
their camaraderie and steady moral support. Thank you all very much.
![Page 6: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/6.jpg)
Contents
Abstract
List of Tables vii
List of Figures ix
List of Acronyms xiii
List of Symbols xv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Limitations of Robotic Vision for Refractive Objects . . . . . . . . . . 3
1.1.2 Seeing and Servoing Towards Refractive Objects . . . . . . . . . . . . 5
1.2 Statement of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Background on Light Transport & Capture 15
2.1 Light Transport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
i
![Page 7: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/7.jpg)
ii CONTENTS
2.1.1 Specular Reflections . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.2 Diffuse Reflections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.3 Lambertian Reflections . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.4 Non-Lambertian Reflections . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.5 Refraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Monocular Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.1 Central Projection Model . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 Thin Lenses and Depth of Field . . . . . . . . . . . . . . . . . . . . . 23
2.3 Stereo Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Multiple Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Light-Field Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.1 Plenoptic Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.2 4D Light Field Definition . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.3 Light Field Parameterisation . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.4 Light-Field Camera Architectures . . . . . . . . . . . . . . . . . . . . 36
2.6 4D Light-Field Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.7 4D Light-Field Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.7.1 Geometric Primitive Definitions . . . . . . . . . . . . . . . . . . . . . 44
2.7.2 From 2D to 4D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.7.3 Point-Plane Correspondence . . . . . . . . . . . . . . . . . . . . . . . 56
2.7.4 Light-Field Slope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3 Literature Review 61
3.1 Image Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.1.1 2D Geometric Image Features . . . . . . . . . . . . . . . . . . . . . . 62
![Page 8: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/8.jpg)
CONTENTS iii
3.1.2 3D Geometric Image Features . . . . . . . . . . . . . . . . . . . . . . 65
3.1.3 4D Geometric Image Features . . . . . . . . . . . . . . . . . . . . . . 66
3.1.4 Direct Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.1.5 Image Feature Correspondence . . . . . . . . . . . . . . . . . . . . . . 70
3.2 Visual Servoing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.2.1 Position-based Visual Servoing . . . . . . . . . . . . . . . . . . . . . 73
3.2.2 Image-based Visual Servoing . . . . . . . . . . . . . . . . . . . . . . 75
3.3 Refractive Objects in Robotic Vision . . . . . . . . . . . . . . . . . . . . . . . 81
3.3.1 Detection & Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.3.2 Shape Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4 Light-Field Image-Based Visual Servoing 95
4.1 Light-Field Cameras for Visual Servoing . . . . . . . . . . . . . . . . . . . . . 95
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3 Lambertian Light-Field Feature . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.4 Light-Field Image-Based Visual Servoing . . . . . . . . . . . . . . . . . . . . 100
4.4.1 Continuous-domain Image Jacobian . . . . . . . . . . . . . . . . . . . 100
4.4.2 Discrete-domain Image Jacobian . . . . . . . . . . . . . . . . . . . . . 102
4.5 Implementation & Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 104
4.5.1 Light-Field Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.5.2 Mirror-Based Light-Field Camera Adapter . . . . . . . . . . . . . . . 105
4.5.3 Control Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.6.1 Camera Array Simulation . . . . . . . . . . . . . . . . . . . . . . . . 108
![Page 9: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/9.jpg)
iv CONTENTS
4.6.2 Arm-Mounted MirrorCam Experiments . . . . . . . . . . . . . . . . . 110
4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5 Distinguishing Refracted Image Features 119
5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.2 Lambertian Points in the Light Field . . . . . . . . . . . . . . . . . . . . . . . 126
5.3 Distinguishing Refracted Image Features . . . . . . . . . . . . . . . . . . . . . 128
5.3.1 Extracting Image Feature Curves . . . . . . . . . . . . . . . . . . . . . 130
5.3.2 Fitting 4D Planarity to Image Feature Curves . . . . . . . . . . . . . . 132
5.3.3 Measuring Planar Consistency . . . . . . . . . . . . . . . . . . . . . . 137
5.3.4 Measuring Slope Consistency . . . . . . . . . . . . . . . . . . . . . . 138
5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.4.2 Refracted Image Feature Discrimination with Different LF Cameras . . 141
5.4.3 Rejecting Refracted Image Features for Structure from Motion . . . . . 148
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6 Light-Field Features for Refractive Objects 157
6.1 Refracted LF Features for Vision-based Control . . . . . . . . . . . . . . . . . 158
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.3 Optics of a Lens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.3.1 Spherical Lens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.3.2 Cylindrical Lens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.3.3 Toric Lens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
![Page 10: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/10.jpg)
CONTENTS v
6.4.1 Refracted Light-Field Feature Definition . . . . . . . . . . . . . . . . . 166
6.4.2 Refracted Light-Field Feature Extraction . . . . . . . . . . . . . . . . 170
6.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
6.5.1 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
6.5.2 Feature Continuity in Single-Point Ray Simulation . . . . . . . . . . . 177
6.5.3 Feature Continuity in Ray Tracing Simulation . . . . . . . . . . . . . . 179
6.6 Visual Servoing Towards Refractive Objects . . . . . . . . . . . . . . . . . . . 186
6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
7 Conclusions and Future Work 191
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Bibliography 197
A Mirrored Light-Field Video Camera Adapter I
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II
A.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV
A.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V
A.3.1 Design & Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . V
A.3.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI
A.3.3 Decoding & Calibration . . . . . . . . . . . . . . . . . . . . . . . . . VIII
A.4 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . X
![Page 11: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/11.jpg)
vi CONTENTS
![Page 12: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/12.jpg)
List of Tables
2.1 Minimum Number of Parameters to Describe Geometric Primitives from 2D to
4D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1 Comparison of camera systems’ capabilities and tolerances for VS . . . . . . . 98
5.1 Comparison of our method and the state of the art using two LF camera arrays
and a lenslet-based camera for discriminating refracted image features . . . . . 145
5.2 Comparison of mean relative instantaneous pose error for unfiltered and filtered
SfM-reconstructed trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . 151
A.1 Comparison of Accessibility for Different LF Camera Systems . . . . . . . . . VI
vii
![Page 13: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/13.jpg)
viii LIST OF TABLES
![Page 14: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/14.jpg)
List of Figures
1.1 Robot applications with refractive objects . . . . . . . . . . . . . . . . . . . . 3
1.2 An example of unreliable RGB-D camera output for a refractive object . . . . . 5
1.3 Light-field camera as a array of cameras . . . . . . . . . . . . . . . . . . . . . 6
1.4 Gradual changes in a refractive object’s appearance in an image can be pro-
grammatically detected . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Surface reflections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Lambertian and non-Lambertian reflections . . . . . . . . . . . . . . . . . . . 18
2.3 Non-Lambertian reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Snell’s law of refraction at the interface of two media. . . . . . . . . . . . . . 19
2.5 The central projection model . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Image formation for a thin lens . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.7 Depth of field and focus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.8 Epipolar geometry for a stereo camera system . . . . . . . . . . . . . . . . . . 27
2.9 The plenoptic function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.10 The two-plane parameterisation of the 4D LF . . . . . . . . . . . . . . . . . . 34
2.11 Example 4D LF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.12 Light-field camera architectures . . . . . . . . . . . . . . . . . . . . . . . . . 37
ix
![Page 15: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/15.jpg)
x LIST OF FIGURES
2.13 Monocular versus plenoptic camera . . . . . . . . . . . . . . . . . . . . . . . 40
2.14 Raw plenoptic imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.15 Visualization of the light-field . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.16 A line in 3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.17 4D point example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.18 4D line example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.19 4D hyperplane example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.20 4D plane example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.21 Illustrating the depth of a point in the 2PP . . . . . . . . . . . . . . . . . . . . 58
2.22 Light-field slope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.1 Architectures for visual servoing . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.2 Light path through a refractive object . . . . . . . . . . . . . . . . . . . . . . . 87
4.1 MirrorCam setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.2 Visual servoing control loop . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.3 Simulation results for LF-IBVS . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.4 Simulation of views for LF-IBVS . . . . . . . . . . . . . . . . . . . . . . . . 110
4.5 Experimental results of LF-IBVS trajectories . . . . . . . . . . . . . . . . . . 112
4.6 Experimental results of stereo-IBVS . . . . . . . . . . . . . . . . . . . . . . . 113
4.7 Setup for occlusion experiment . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.8 Example views from occlusion experiments . . . . . . . . . . . . . . . . . . . 116
4.9 Experimental results from occlusion experiments . . . . . . . . . . . . . . . . 118
5.1 LF camera mounted on a robot arm . . . . . . . . . . . . . . . . . . . . . . . . 121
5.2 Lambertian versus non-Lambertian feature in the . . . . . . . . . . . . . . . . 130
![Page 16: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/16.jpg)
LIST OF FIGURES xi
5.3 Example epipolar planar images . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.4 Extraction of the image feature curve using the correlation EPI using simulated
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.5 Example Lambertian and refracted image feature curves . . . . . . . . . . . . 143
5.6 Example Lambertian and refracted feature curves from a small-baseline LF
camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.7 Discrimination of refracted image features . . . . . . . . . . . . . . . . . . . . 144
5.8 Refracted image features detected in sample images . . . . . . . . . . . . . . . 147
5.9 Rejecting refracted image features for SfM . . . . . . . . . . . . . . . . . . . . 150
5.10 Sample images where monocular SfM failed by not rejecting refracted features 151
5.11 Comparison of camera trajectories for monocular structure from motion . . . . 152
5.12 Point cloud reconstructions of scenes with refracted objects . . . . . . . . . . . 154
6.1 Toric lens cut from a torus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.2 The visual effect of the toric lens on a background circle . . . . . . . . . . . . 165
6.3 Light-field geometry depth and projections of a lens into a light field . . . . . . 167
6.4 Orientation for the toric lens . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6.5 Illustration of 3D line segment projected by a toric lens . . . . . . . . . . . . . 170
6.6 Ray tracing of a refractive object using Blender . . . . . . . . . . . . . . . . . 176
6.7 Single point ray trace simulation . . . . . . . . . . . . . . . . . . . . . . . . . 178
6.8 Slope estimates for changing z-translation of the LF camera . . . . . . . . . . 179
6.9 Orientation estimates for changing z-rotation of the LF camera . . . . . . . . . 179
6.10 Refracted LF feature approach towards a toric lens . . . . . . . . . . . . . . . 180
6.11 Refracted light-field feature slopes during approach towards a toric lens . . . . 181
![Page 17: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/17.jpg)
xii LIST OF FIGURES
6.12 Orientation estimate from a Blender simulation of an ellipsoid that was rotated
about the principal axis of the LF camera. . . . . . . . . . . . . . . . . . . . . 182
6.13 Refracted light-field features for a toric lens . . . . . . . . . . . . . . . . . . . 183
6.14 Refracted light-field features for different objects . . . . . . . . . . . . . . . . 185
6.15 Concept for visual servoing towards a refractive object . . . . . . . . . . . . . 187
A.1 MirrorCam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III
A.2 MirrorCam field of view overlap . . . . . . . . . . . . . . . . . . . . . . . . . VII
A.3 Rendering of MirrorCam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII
A.4 MirrorCam v0.4c Kinova arm mount diagrams . . . . . . . . . . . . . . . . . . IX
A.5 MirrorCam v0.4c mirror holder diagrams . . . . . . . . . . . . . . . . . . . . XII
A.6 MirrorCam v0.4c camera clip diagrams . . . . . . . . . . . . . . . . . . . . . XIII
![Page 18: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/18.jpg)
Acronyms
2PP two-plane parameterisation.
BRIEF binary robust independent elementary features.
CNN convolutional neural networks.
DOF degree-of-freedom.
DSP-SIFT domain-size pooled SIFT.
FAST features from accelerated segment test.
FOV field of view.
GPS global positioning system.
HoG histogram of gradients.
IBVS image-based visual servoing.
IOR index of refraction.
LF light-field.
LF-IBVS light-field image-based visual servoing.
LIDAR light detection and ranging.
xiii
![Page 19: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/19.jpg)
xiv Acronyms
M-IBVS monocular image-based visual servoing.
MLESAC maximum likelihood estimator sampling and consensus.
ORB oriented FAST and rotated BRIEF.
PBVS position-based visual servoing.
RANSAC random sampling and consensus.
RMS root mean square.
S-IBVS stereo image-based visual servoing.
SfM structure from motion.
SIFT scale invariant feature transform.
SLAM simultaneous localisation and mapping.
SURF speeded-up robust feature.
SVD singular value decomposition.
VS visual servoing.
![Page 20: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/20.jpg)
List of Symbols
θi angle of incidence
θr angle of reflection
N surface normal
n index of refraction
zi distance to image along the camera’s z-axis
zo distance to object along the camera’s z-axis
d disparity
b camera baseline
P 3D world point
Px 3D world point’s x-coordinate
Py 3D world point’s y-coordinate
Pz 3D world point’s z-coordinate
CP world point with respect to the camera frame of reference
p 2D image coordinate
p∗ initial/observed image coordinates
p# desired/goal image coordinates
p homogeneous image plane point
f focal length
R radius of curvature
K camera matrix
xv
![Page 21: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/21.jpg)
xvi Acronyms
T translation vector
R rotation matrix
F fundamental matrix
J Jacobian
J+ left Moore-Penrose pseudo-inverse of the Jacobian
ν camera spatial velocity
v translational velocity
ω rotational velocity
NMIN minimum number of sub-images in which feature matches must be found
KP proportional control gain
KI integral control gain
KD derivative control gain
s light-field horizontal viewpoint coordinate
t light-field vertical viewpoint coordinate
u light-field horizontal image coordinate
v light-field vertical image coordinate
D light-field plane separation distance
w light-field slope from the continuous domain (unit-less)
m light-field slope from the discrete domain (views/pixels)
σ light-field slope as an angle
s0 light-field central view horizontal viewpoint coordinate
t0 light-field central view vertical viewpoint coordinate
u0 light-field central view horizontal image coordinate
v0 light-field central view vertical image coordinate
L(s, t, u, v) 4D light field
I(u, v) 2D image
()∗
indicates a variable is fixed while others may vary
W Lambertian light-field feature
![Page 22: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/22.jpg)
Acronyms xvii
H intrinsic matrix for an LF camera
Rn real coordinate space of n dimensions
Π1 a plane
φ a ray
n normal of a 4D hyperplane
∆u pixel differences between u and u0
ξ singular vector from SVD
λ singular value from SVD
c slope consistency
tplanar threshold for planar consistency
tslope threshold for slope consistency
ei relative instantaneous pose error
etr instantaneous translation pose error
erot instantaneous rotation pose error
C the focal point or focal line
A the projection of point P through a toric lens
ΣA the scaling matrix, containing the singular values of A
![Page 23: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/23.jpg)
Chapter 1
Introduction
In this chapter, we introduce the motivation for this research and outline the research goals
and questions this thesis seeks to address. Then we provide our list of contributions, their
significance and an overview of this thesis.
1.1 Motivation
Robots are changing the world. Their use for automating the dull, dirty and dangerous tasks
in the modern world has increased economic growth, improved quality of life and empowered
people. For example, robots assist in manipulating heavy car components in the automotive
manufacturing industry. Robots are being used to survey underwater ruins, sewage pipes, col-
lapsed buildings, and other planets in space. At home, robots are also starting to be used for de-
livery services, home cleaning, and assisting people with reduced mobility [Christensen, 2016].
Traditionally, many robots have operated in isolation from humans, but through the gradual
availability of inexpensive computing, user interfaces, integrated sensors, and improved algo-
rithms, robots are quickly improving in function and capability. The confluence of technologies
is enabling a robot revolution that will lead to the adoption of robotic technologies for all as-
1
![Page 24: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/24.jpg)
2 1.1. MOTIVATION
pects of daily life. As such, robots are gradually venturing into less constrained environments to
work with humans and an entirely new set of challenging objects to interact with. These more
complex and unstructured working environments require richer perceptual information for safer
interaction.
Historically, roboticists have had success with a variety of sensing modalities, from light detec-
tion and ranging (LIDAR), global positioning system (GPS), radar, acoustic imaging, infrared
range-finding sensors, to time-of-flight and structured-light depth sensors, as well as cameras.
In particular, cameras uniquely measure both dense colour and textural detail that other sensors
do not normally provide, which enables robots to use vision to perceive the world. Vision as a
robotic sensor is particularly useful because it mimics human vision and allows for non-contact
measurement of the environment. Much of the human world has been engineered around our
sense of sight, and a significant amount of our communication and interaction relies on vision.
Robotic vision has proven effective in terms of object detection, localization, and scene un-
derstanding for robotic grasping and manipulation [Kemp et al., 2007]. Furthermore, directly
using visual feedback information extracted from the camera to control robot motion, a tech-
nique known as visual servoing (VS), has proven useful for real-time, high-precision robotic
manipulation tasks [Kragic and Christensen, 2002]. However, refractive objects, which are
common throughout the human environment, are one of the areas where modern robotic vision
algorithms and conventional camera systems still encounter difficulties [Ihrke et al., 2010a].
One novel camera technology that may enable robots to better perceive refractive objects is the
light-field (LF) camera, which uses multiple-aperture optics to implicitly encode both texture
and depth. In this thesis, we look at exploring LF cameras as a means of seeing and servoing
towards refractive objects. By seeing, we refer to detecting refractive objects using only the LF.
By servoing, we refer to visual servoing using LF camera measurements to directly control the
camera’s relative pose. Combining the two, this research may enable a robotic manipulator to
detect and move towards, grasp, and manipulate refractive objects—for example a glass of beer
or wine. The principal motivation for this topic lies in improving our understanding of how
![Page 25: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/25.jpg)
CHAPTER 1. INTRODUCTION 3
refractive objects behave in the LF and how to exploit this knowledge to enable more reliable
motion towards refractive objects.
1.1.1 Limitations of Robotic Vision for Refractive Objects
Robots for the real world will inevitably interact with refractive objects, as in Fig. 1.1. Future
robots will contend with wine glasses and clear water bottles in domestic applications [Kemp
et al., 2007]; glass objects and clear plastic packaging for quality assessment and packing in
manufacturing [Ihrke et al., 2010a]; glass windows throughout the urban environment, as well
as water and ice for outdoor operations [Dansereau, 2014]. For example, a household robot
must be able to pick up, wash and place glassware; a bartender robot must serve drinks from
bottles of wine and spirits; an outdoor maintenance robot may want to avoid falling into the
swimming pool or nearby fountains. Other examples of robots interacting with refractive ob-
jects include medical robots performing opthalmic (eye) surgery, or servicing satellites working
with telescopic lenses or shiny and transparent surface coverings. Automating these applica-
tions typically requires knowledge of either object structure and/or robot motion. Yet objects
such as those just described are particularly challenging for robots, largely because they are
transparent.
(a) (b)
Figure 1.1: Robots will have to interact with refractive objects. (a) In domestic applications,
such as cleaning and putting away dishes. (b) In manufacturing, assessing the quality of glass
objects, or picking and placing such objects in warehouses.
![Page 26: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/26.jpg)
4 1.1. MOTIVATION
Refractive objects are particularly challenging for robots primarily because these types of ob-
jects do not typically have texture of their own. Instead their appearance depends on the object’s
shape and the surrounding background and lighting conditions. Robotic methods for localiza-
tion, manipulation and control exist to deal with refractive objects when accurate 3D geometric
models of the refractive objects themselves are available [Choi and Christensen, 2012, Luo
et al., 2015, Walter et al., 2015, Zhou et al., 2018]. However, these models are often difficult,
time-consuming and expensive to obtain, or simply not available [Ihrke et al., 2010a]. When
3D geometric models of the refractive objects are not available, localization, manipulation and
control around refractive objects become much harder problems.
In robotic vision, a common approach when no models are available, regardless of whether the
scene contains refractive objects, is to use features. Features are distinct aspects of interest in the
scene that can be repeatedly and reliably identified from different viewpoints. Image features
are those features recorded in the image as a set of pixels by the camera. Image features can then
be automatically detected and extracted as a vector of numbers, which we refer to as the image
feature vector. Features are often chosen because their appearances do not change significantly
with small changes in viewpoint. The same features are matched from image to image as the
robot moves, which enables the robot to establish a consistent geometric relationship between
its observed image pixels and the 3D world.
Feature-based matching strategies form the basis for many robotic vision algorithms, such as
object recognition, image segmentation, structure from motion (SfM), VS, and simultaneous
localisation and mapping (SLAM). However, many of these algorithms implicitly assume that
the scene or object is Lambertian—that the object’s (or feature’s) appearance remains the same
despite moderate changes in viewpoint. Instead, refractive objects are non-Lambertian because
their appearance often varies significantly with viewpoint. The violation of the Lambertian as-
sumption can cause inconsistencies, errors and even failures for modern robotic vision systems.
![Page 27: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/27.jpg)
CHAPTER 1. INTRODUCTION 5
1.1.2 Seeing and Servoing Towards Refractive Objects
Humans are able to discern refractive objects visually by looking at the objects from different
perspectives and observing that the appearance of refractive objects change differently from the
rest of the scene. The robotic analogue to human eyes are cameras, which have proven ex-
tremely useful as a low-cost and compact sensor. Monocular vision systems are by far the most
common amongst robots today, but suffer from the ambiguity that small and close objects appear
the same size as distant and large objects. Moreover, a single view from a monocular camera
does not provide sufficient information to detect the presence of refractive objects. Stereo cam-
eras, which provide two views of a scene, do not work well with refractive objects without prior
knowledge of the scene, because triangulation relies heavily on appearance matching. RGB-D
cameras and LIDAR sensors do not work reliably on refractive objects because the emitted light
is either partially reflected or travels through these objects. Fig. 1.2 shows an example of unre-
liable depth measurements from an RGB-D camera for a refractive sphere. Robots can move to
gain better understanding of a refractive object over time; however, physically moving a robot
can be time-consuming, expensive and potentially hazardous. A more efficient approach would
be to instantaneously capture multiple views of the refractive object.
(a) (b)
Figure 1.2: An example of unreliable RGB-D camera (Intel Realsense D415) output for a re-
fractive sphere. This RGB-D camera uses the structured-light approach to measure depth, which
works well for Lambertian surfaces, but not for refractive objects. (a) The colour image, which
a monocular camera would also provide. (b) Incorrect and missing depth information around
the refractive sphere.
![Page 28: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/28.jpg)
6 1.1. MOTIVATION
The light field describes all the light flowing in every direction through every point in free
space at a certain instance in time [Levoy and Hanrahan, 1996]. LF cameras are a novel camera
technology that use multi-aperture optics to measure the LF by capturing a dense and uniformly-
sampled set of views of the same scene from multiple viewpoints in a single capture from a
single sensor position. We refer to a view as what one would see from a particular viewing pose
or viewpoint. Conceptually, unlike looking through a peephole, an LF “image” is similar to
an instantaneous window that one can look through to see how a refractive object’s appearance
changes smoothly with viewpoint. As illustrated in Fig. 1.3, compared to a monocular camera,
which uses a single aperture to capture a single view of the scene from a single viewpoint, an
LF camera is analogous to having an array of cameras all tightly packed together which provide
multiple views of the scene from multiple viewpoints. Within the LF camera array, a single
view can be selected and changed from one to the other in a way that can be described as virtual
motion within the single shot of the LF.
(a) (b) (c)
Figure 1.3: (a) A monocular camera acts as a peephole with a single aperture to capture a
single view of the scene. Light from the scene (yellow) passes through the aperture (red) and is
recorded on the image sensor (green). (b) A LF camera can be thought of as a window, or an
equivalent camera array that uses multi-aperture optics to capture multiple views of the scene.
As a result, the LF camera can capture much more information of the scene from a single sensor
capture than a monocular camera. (c) An example LF camera array by Wilburn [Wilburn et al.,
2004].
For example, compared to Fig. 1.4a, Fig. 1.4b shows the gradual change in appearance of the
refractive sphere from a much denser and more regular or uniform sampling of views from
an LF. Perhaps one of the reasons why humans can somewhat reliably perceiving refractive
objects is because we may unconsciously move a little bit side-to-side using our continuous
![Page 29: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/29.jpg)
CHAPTER 1. INTRODUCTION 7
stream of vision—which is a very dense sampling of the scene. Humans may be able to detect
the inconsistent motions of the background caused by the refractive object with respect to their
viewpoints.
Therefore, the dense sampling of the LF camera captures the behaviour of the refractive object
with a high level of redundancy that is needed to differentiate refractive objects from normal
scene content. The uniform sampling pattern of the LF camera induces patterns and algorithmic
simplifications that would be unavailable to a set of non-uniformly-sampled views1. Addition-
ally, while the same set of images could be obtained with a single moving camera, LF cameras
can capture this information from a single sensor position, reducing the amount of motion re-
quired by the robot to perceive a refractive object. Therefore, LF cameras could allow robots to
more reliably and efficiently capture the behaviour of refractive objects.
(a)
(b)
Figure 1.4: In this scene, a refractive sphere has been placed amongst cards. A camera has
captured images of the scene along a horizontal rail at (a) a 3 cm interval, and (b) 1 cm in-
tervals. The end images (blue border) are taken from the same positions. In (a), the change
in appearance of the refractive sphere is significant and perhaps very challenging to recognize
without the prior knowledge that there is a refractive sphere in the middle of the scene. In (b), a
more frequent sampling of the scene reveals the gradual change in appearance of the refractive
sphere, which may be programmatically detected. Images from the New Stanford Light Field
Archive.
1Consider a conventional monocular camera and its dense and uniformly-sampled array of pixels that produce a
detailed 2D image. Often, more pixels yield more detail in a single image. Additionally, if the pixels were oriented
in different directions and at a variety of positions, interpreting the scene would be a much more complex task.
![Page 30: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/30.jpg)
8 1.2. STATEMENT OF RESEARCH
Returning to the original theme of this section, robots must not just perceive refractive objects,
they must be able to precisely control their relative pose around these objects as well. In the
traditional open-loop “look then move” approach, the accuracy of the operation depends directly
on the accuracy of the visual sensor and robot end-effector. VS is a robot control technique that
uses the camera output to directly control the robot motion in a feedback loop, which is referred
to as a closed-loop approach. VS has proven to be reliable at controlling robot motion with
respect to visible objects without requiring a geometric model of the object, or an accurate robot.
While refractive objects are challenging because the objects are not always directly visible, they
leave fingerprints based on how the background is distorted. LF cameras capture some of this
distortion, which we show can be exploited to visual servo towards refractive objects for further
grasping and manipulation.
1.2 Statement of Research
Based on the previous section, there exists a clear opportunity to advance robotic vision in the
area of visual control with respect to refractive objects using LF cameras. The main research
question of this thesis is thus:
How can we enable robotic vision systems to visually control their motion around refractive
objects in real-world environments, using a light-field camera and without prior models of the
objects?
The primary research question can be decomposed into sub-questions:
1. How can we visually servo using a light-field camera?
2. How can we detect refractive objects using a light-field camera?
3. How can we servo towards a refractive object?
![Page 31: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/31.jpg)
CHAPTER 1. INTRODUCTION 9
Our hypothesis is that we can develop a novel light-field feature based on the dense and uniform
observations of a Lambertian point captured by an LF camera, which we refer to as a Lamber-
tian light-field feature. We can use this Lambertian light-field feature to perform visual servoing
in Lambertian scenes. We can observe how this light-field feature becomes distorted by a re-
fractive object and use these changes to distinguish refracted image features from Lambertian
image features. Using insight from visual servoing with the Lambetian light-field feature and
distinguishing refracted image features in the LF, we can propose a novel refracted light-field
feature to directly control the robot pose with respect to a refractive object, without needing a
prior model of the object. We define a refracted light-field feature as the projection of a feature
in the LF that has been distorted by a refractive object. The key challenges in showing this will
be in understanding how the Lambertian light-field feature changes with respect to camera pose,
how to characterise the changes in our light-field feature caused by a refractive object, and how
the LF changes as the robot moves towards a refractive object.
1.3 Contributions
The broad topics addressed in this thesis are (1) image-based visual servoing using a light-field
camera (2) detecting refracted features, and (3) visual servoing towards refracted objects. The
specific contributions are as follows:
Light-field image-based visual servoing – partially published as [Tsai et al., 2017]
1. We propose the first derivation, implementation and experimental validation of light-field
image-based visual servoing (LF-IBVS). In particular, we define an appropriate compact
representation of an LF feature that is close to the form measured directly by LF cameras
for Lambertian scenes. We derive continuous- and discrete-domain image Jacobians for
the light field. Our LF feature enforces LF geometry in feature detection and correspon-
dence. We experimentally validate LF-IBVS in simulation and on a custom LF camera
adapter, called the MirrorCam, mounted on a robot arm.
![Page 32: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/32.jpg)
10 1.4. SIGNIFICANCE
2. We show that our method of LF-IBVS outperforms conventional monocular and stereo
image-based visual servoing in the presence of occlusions.
Distinguishing refracted image features – partially published as [Tsai et al., 2019]
1. We develop an LF feature discriminator for refractive objects. In particular, we develop
a method to distinguish a Lambertian image feature from a feature whose rays have been
distorted by a refractive object, which we refer to as a refracted image feature. Our
discriminator can distinguish refractive objects more reliably than previous work. We also
extend refracted image feature discrimination capabilities to lenslet-based LF cameras
which typically have much smaller baselines than conventional LF camera arrays.
2. We show that using our method to reject most of the refracted image feature content
enables monocular SfM in scenes containing refractive objects, where traditional methods
otherwise fail.
Light-field features for refractive objects
1. We define a representation for a refracted LF feature that approximates the local surface
area of the refractive object as two orthogonal surface curvatures. We can then model the
local part of the refractive object as a toric lens. The properties of the local projections
can then be observed by and extracted from the light field.
2. We evaluate the feature’s continuity with respect to LF camera pose for a variety of dif-
ferent refractive objects to demonstrate the potential for our refracted LF feature’s use in
vision-based control tasks, such as for visual servoing.ee
1.4 Significance
This research is significant because it will provide robots with hand-eye coordination skills for
objects that are difficult to perceive. It is a critical step towards enabling robots to see and
![Page 33: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/33.jpg)
CHAPTER 1. INTRODUCTION 11
interact with refractive objects. Specifically, with an improved understanding of how refractive
objects behave in a single light field, robots can now distinguish refractive objects and reject
the refracted feature content. Robots can then move in scenes containing refractive objects
without having their pose estimates corrupted by the refracted scene content. Being able to
describe refractive objects in the light field and then servo towards them enables more advanced
grasping and manipulation tasks for robots.
Furthermore, applications of understanding the behaviour of refracted objects in the light field
as robots move is not limited to structure from motion and visual servoing. This theory could
help improve visual navigation and even SLAM applications for domestic and manufacturing
robots. Ultimately, this research will enable manufacturing robots to quickly manipulate objects
encased in clear plastic packaging. Domestic robots will be able to more reliably clean glasses
and serve drinks. Medical robots will more safely operate on transparent objects, such as human
eyes. Overall, this research is a significant step towards opening up an entirely new class of
objects for manipulation that have been largely ignored by the robotics community until now.
1.5 Structure of the Thesis
This thesis in robotic vision draws on theory from both computer vision and robotics research
communities. Chapter 2 provides the necessary background relevant to the remainder of this
thesis, including a description of light transport and light capture. Specifically, we explain the
difference between specular and diffuse reflections, as well as Lambertian and non-Lambertian
reflections and refraction. We discuss image formation with respect to monocular, stereo, mul-
tiple camera and LF camera systems. We then discuss visualization of 4D LFs and 4D LF
geometry, which are built on in the following chapters.
In Chapter 3, we provide a review of the relevant literature surrounding three topics, image
features, VS and refractive objects. Because VS systems typically rely on tracking image fea-
![Page 34: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/34.jpg)
12 1.5. STRUCTURE OF THE THESIS
tures in a sequence of images, we first include a review of image features, how they have been
used in VS systems and how image features have been used in LF cameras. Second, we discuss
the major classes of VS systems, position-based and image-based systems, in the context of LF
cameras and refractive objects. Third, we review a variety of methods that have been explored
to automatically detect and perceive refractive objects in both computer and robotic vision. All
together, this chapter explains how traditional image features are insufficient for dealing with
refractive objects, LF cameras have not yet been considered for VS systems, and for refractive
objects, other methods for perceiving these objects are impractical for most mobile robotic plat-
forms or rely on assumptions that significantly narrow their application window. Thus, there is a
gap for methods that do not rely on 3D geometric models of the refractive objects that can apply
to a wide variety of object shapes. Using LF cameras for VS towards refractive objects therefore
carves out a niche in the research community that leaves room for scientific exploration.
As mentioned previously, LF cameras are of interest for VS because they can capture the be-
haviour of view-dependent light transport effects, such as occlusions, specular reflections and
refraction within a single shot. However, VS with an LF camera for basic Lambertian scenes
has not yet been explored. As an initial investigation, we first focus on using an LF camera to
servo in Lambertian scenes. Chapter 4 develops a light-field feature for Lambertian scenes,
which we later refer to as a Lambertian light-field feature. This feature exploits the fact that a
Lambertian point in the world induces a plane in the 4D LF. Afterwards, we derive the relations
between differential feature changes and resultant robot motion. Using this feature, we then
present the first development of light-field image-based VS for Lambertian scenes and compare
its performance to traditional monocular and stereo VS systems.
Next, Chapter 5 presents our method to distinguish a Lambertian image feature from a feature
whose rays have been distorted by a refractive object, which we refer to as a refracted feature.
We do this by characterising the apparent motion of an image feature in the light field and
compare it to how well this apparent motion matches the model of an ideal Lambertian image
feature (which is based upon the plane in the 4D LF). We apply this method to the problem of
![Page 35: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/35.jpg)
CHAPTER 1. INTRODUCTION 13
SfM, allowing us to reject most of the refracted feature content, which enables monocular SfM
using the Lambertian parts of the scene, in scenes containing refractive objects where traditional
methods would normally fail.
In Chapter 6, we combine the Lambertian light-field feature definition and LF-IBVS frame-
work from Chapter 4, with the concept of the refracted image feature from Chapter 5, to explore
the concept of a refracted light-field feature. This chapter is largely focused on investigating
how the 4D planar structure of a light-field feature can be extended to refractive objects, ex-
tracted from a single light field, and how this structure changes with respect to viewing pose.
We demonstrate this feature’s suitability for VS with respect to pose change, and lay the ground-
work for a system to visual servo towards refractive objects.
The unifying theme underlying the contributions of this thesis is exploring and exploiting the
properties of the light field for robotic vision. In Chapter 4, we developed a Lambertian light-
field feature for visual servoing, and in Chapter 5, we propose a method to detect refractive
objects. Both of these investigations exploit the fact that a Lambertian point in the world in-
duces a plane in the 4D LF. In Chapter 6, we use the induced plane to propose a method that en-
ables visual servoing towards refractive objects. Throughout this thesis, the dense and uniform
sampling of the light field induce patterns that we exploit to improve robotic vision algorithms.
Finally, conclusions and suggestions for further work are presented in Chapter 7.
![Page 36: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/36.jpg)
14 1.5. STRUCTURE OF THE THESIS
![Page 37: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/37.jpg)
Chapter 2
Background on Light Transport &
Capture
This chapter begins with a background on how light is transported through scenes, including
reflection and refraction. We then discuss single image formation with conventional monocu-
lar cameras and extend the discussion to LF cameras. Finally, we illustrate how we typically
visualize LFs and discuss the theory of 4D LF geometry.
2.1 Light Transport
In order to understand refractive objects and LF cameras, it is important to first understand light
transport, the nature of light and how it interacts with matter. Light is an electromagnetic wave,
but when the wavelength of light is small relative to the size of the structures it interacts with,
we can neglect the more complex wave-like behaviours of light and focus on the particle-like
behaviours of light, where light can be described as rays that move in straight lines within a
constant medium [Pedrotti, 2008]. This approximation describes most phenomena measured
by human eyes, most cameras and most robotic vision systems.
15
![Page 38: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/38.jpg)
16 2.1. LIGHT TRANSPORT
2.1.1 Specular Reflections
When light rays hit a surface, light is reflected. The law of reflection states that for a smooth,
flat and mirror-like surface, the reflected light ray is on a plane formed by the incident light ray
and the surface normal. Additionally, the angle of reflection θr is equal to the angle of incidence
θi [Lee, 2005], as shown in Fig. 2.1a. If we know the surface geometry and the incident light
ray, then we can recover the direction of the reflected ray. Alternatively, if we know the incident
and reflected light rays, then we can determine the geometry (normals) of the reflective surface.
For surfaces that are not perfect mirrors, specular reflections can still occur, taking the form of a
narrow angular distribution of the reflected light. The ratio of reflected light to incident light is
known as the reflectance and values of more than 99% can be achieved through a combination of
surface polishing and advanced coatings [Freeman and Fincham, 1990]. Examples of specular
reflective materials are metal, mirrors, glossy plastics and shiny surfaces of transparent objects.
2.1.2 Diffuse Reflections
Most real surfaces are not perfect mirrors. Instead, they are often rough and produce diffuse
reflections. Light interacts with rough surfaces via penetration, scattering, absorption and being
re-emitted from the surface. These surfaces are commonly modelled using a distribution of
(a) (b)
Figure 2.1: (a) The law of reflection for a smooth surface. The angle of the incidence θi is
equal to the angle of reflection θr about the surface normal N on the plane of incidence. The
reflection off a smooth surface illustrates a specular reflection. (b) The reflections from a rough
surface of micro-facets illustrate a diffuse reflection.
![Page 39: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/39.jpg)
CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 17
micro-facets. Each facet acts like a small smooth surface that has its own, single surface normal,
which varies from facet to facet, as in Fig. 2.1b. The extent to which the micro-facet normals
differ from the smooth surface normal is a measure of surface roughness. The distribution of
the micro-facet normals create a broad angular distribution of reflected light, which is known as
a diffuse reflection. Some examples of diffuse materials include wood and felt.
2.1.3 Lambertian Reflections
The Lambertian surface model is often referred to as the isotropic radiance constraint, the
brightness constancy assumption, or the photo consistency assumption in computer graphics
and robotic vision. Each point on a Lambertian surface reflects light with a cosine angular
distribution, as shown in Fig. 2.2a, where θ is the viewing angle relative to the surface normal.
However, when a surface is viewed with a finite field of view (FOV), the surface area seen by the
observer is proportional to 1/ cos θ. As θ approaches 90◦, more surface points become visible
to the observer. The observed radiance (amount of reflected light) comes from the reflected in-
tensity from each surface point (∝ cos θ) multiplied by the number of points seen (∝ 1/ cos θ),
which cancels out and is thus independent of θ. This results in the observed radiance being
roughly equal in all directions [Lee, 2005], as shown in Fig. 2.2b. The Lambertian model is
very common in computer graphics, and often implicitly assumed in many robotic vision al-
gorithms. However, this assumption is invalid for specular reflections and refractive objects,
which motivates us to consider non-Lambertian reflections and refraction.
2.1.4 Non-Lambertian Reflections
Although the Lambertian assumption has been shown to work quite well in practice for most
scenes and surfaces [Levin and Durand, 2010], there remains a variety of surfaces, such as those
polished or shiny, that reflect light in a manner that does not follow the Lambertian model. These
![Page 40: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/40.jpg)
18 2.1. LIGHT TRANSPORT
(a) (b)
Figure 2.2: (a) The cosine distribution of a Lambertian reflection at a point with an observer at
viewing angle θ. (b) A Lambertian reflection has an observed radiance approximately equal in
all directions. The appearance of the ray stays the same regardless of viewing angle.
non-Lambertian reflections are when at least part of the reflected light is dependent on viewing
angle, as shown in Fig. 2.3. Such surfaces are not perfectly smooth because of the molecular
structure of materials; however, when the irregularities are less than the wavelength of incident
light, the reflected light becomes increasingly specular. This means that even rough surfaces can
exhibit some degree of non-Lambertian reflections when viewed at a sufficiently sharp angle.
Shiny surfaces involve both specular and diffuse surface reflections. A common approach to
dealing with these non-Lambertian surfaces is to use the dichromatic reflection model [Shafer,
1985], which separates the reflections into specular and diffuse components. The diffuse com-
ponent is modelled as Lambertian and the rest is attributed to a non-Lambertian reflection. The
relative amount of these two components depends on material properties, geometry of light
source, observer viewing pose and surface normal [Corke, 2017]. The model is valid for mate-
rials such as woods, paints, papers and plastics, but excludes materials, such as metals. In the
graphics community, there are more advanced models. Schlick [Schlick, 1994], Lee [Lee, 2005]
and Kurt [Kurt and Edwards, 2009] are good reference surveys of modelling non-Lambertian
light reflection.
![Page 41: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/41.jpg)
CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 19
Figure 2.3: A non-Lambertian reflection has an uneven reflected light distribution. The reflected
intensity, and thus appearance of the ray changes with viewing angle.
2.1.5 Refraction
Refractive objects pose a major challenge for robotic vision because they typically do not have
an appearance of their own. Rather, they allow light to pass through them and in the process
distort or change the direction of the light. When light passes through an interface—a boundary
that separates one medium from another—light is partially reflected and transmitted. Refraction
occurs when light rays are bent at the interface. Assuming the media are isotropic, the amount
of bending is determined by the media’s index of refraction (IOR) n and Snell’s Law.
Snell’s law of refraction, illustrated in Fig. 2.4, relates the sines of the angles of incidence θi
and refraction θr at an interface between two optical media based on their IOR,
ni sin θi = nr sin θr, (2.1)
Figure 2.4: Snell’s law of refraction at the interface of two media.
![Page 42: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/42.jpg)
20 2.1. LIGHT TRANSPORT
where θi and θr are measured with respect to the surface normal, and ni and nr are the IOR of
the incident and refracting medium, respectively. The IOR of a medium is defined as the ratio
of the speed of light in a vacuum c over the speed of light in the medium v, given by n = c/v.
For air and most gases, n is taken as 1.0, while for other solid materials such as glass, n = 1.52.
As light passes from a lower to higher n, the light ray is bent towards the normal, while light is
bent away from the normal when it passes from higher to lower n.
Because the bending of light depends on the surface normal, the shape of an object as well as
its IOR play an important role in the appearance of visual features on the surface of a refractive
object. The larger the angle between the incident light and the object’s surface normal, the
larger the change in direction of the refracted light. And the thicker an object is, the longer the
light can travel through the refractive medium, resulting in a larger change in appearance.
To complicate matters more, the surfaces of transparent objects are often both reflective and
refractive. This means that a portion of the light is reflected at the surface, while another portion
is refracted through the surface. Fresnel’s equations describe the reflection and transmission of
light at the boundary of two different optical media [Hecht, 2002]. The amount of reflected
light depends on the media’s n and angle of incidence.
Furthermore, when light travels from a medium of higher to lower n, internal reflections can
occur within refractive objects. Light moving perpendicular to the interface’s surface normal
does not change direction. Light moving at an angle large enough to cause the refracted ray
to bend 90◦ from the normal travels along the interface itself. Such an angle is referred to as
the critical angle θc. Any incident light that has an angle greater than θc is totally reflected
back into the original medium, as per the law of reflection. This phenomenon is known as total
internal reflection, which is typically exploited in propagating light through fibre optics. Internal
reflection can cause light sources to appear within transparent objects from unexpected angles
and even disappear entirely. This further adds to the viewpoint-dependent nature of refractive
objects.
![Page 43: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/43.jpg)
CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 21
In all, refractive objects are texture-less on their own. They can refract, magnify (scale), flip
and distort the background, and even cause it to vanish at certain angles. All of these effects
depend heavily on the refractive object’s surface normals, thickness, material properties, as well
as the object’s background, so it is no surprise that refractive objects easily confuse most robotic
vision techniques that do not account for more than simple Lambertian reflections.
2.2 Monocular Cameras
Cameras are excellent sensors for robotic vision; they are compact, affordable sensors that have
low power consumption and provide a wealth of visual information. Camera systems are very
flexible in application, owing to the variety of computer vision and image processing algorithms
available. In this section, we look at monocular cameras, the central projection model and the
loss of depth information. We do this to better understand LF cameras, which can sometimes
be considered as an array of monocular cameras.
2.2.1 Central Projection Model
Image formation using a conventional monocular camera projects a 3D world onto a 2D surface.
The central projection model is often used to perform this transformation and is also referred
to as the central perspective or pinhole camera model. An illustration of how the model works
is shown in Fig. 2.5. It assumes an infinitely small aperture for light to pass through to the
image plane and sensor. The camera’s optical axis is defined as the centre of the field of view.
The geometry of similar triangles describes the projective relationships for world coordinates
P = (Px, Py, Pz) onto the image plane p = (x, y) as
x = fPx
Pz
, y = fPy
Pz
. (2.2)
![Page 44: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/44.jpg)
22 2.2. MONOCULAR CAMERAS
The image plane point can be written in homogeneous form p = (x′, y′, z′) as
x′ = fPx
z′, y′ = f
Py
z′, z′ = Pz. (2.3)
If we consider the homogeneous world coordinates P ′ = (Px, Py, Pz, 1), then the central pro-
jection model can be written linearly in matrix form as
p =
f 0 0 0
0 f 0 0
0 0 1 0
Px
Py
Pz
1
= KP ′, (2.4)
where K is a 3×4 matrix known as the camera matrix and p is the coordinate of the point with
respect to the camera frame [Hartley and Zisserman, 2003].
Figure 2.5: The central projection model. The image plane is a focal length f distance in front of
the camera’s origin. A non-inverted image of the scene is formed as world-point P (Px, Py, Pz),and is captured at image point p(x, y) on the image plane.
![Page 45: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/45.jpg)
CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 23
The central projection model is relatively simple and no lenses are needed to focus the light,
and is thus commonly used throughout robotic vision [Siciliano and Khatib, 2016]. However,
the infinitely small aperture from the central projection model has the practical problem of not
letting much light from the scene onto the image sensor. This may result in dark images that
may not be useful, or impractically long exposure times for many robotic applications.
In practice, most modern cameras use optical lenses to achieve reasonable image exposure.
However, the central projection model does not include geometric distortions or blurred effects
caused by lenses and finite-sized apertures. Thus, the central projection model is often aug-
mented with additional terms to account for the image distortion caused by lenses. See Sturm et
al. for a survey on other camera models, including models for catadioptric and omnidirectional
cameras [Sturm et al., 2011].
2.2.2 Thin Lenses and Depth of Field
The infinitely small aperture of the central projection model is a mathematical approximation
only and does not physically exist. In fact, as the aperture size approaches a certain limit (related
to the wavelength of the observed light and the shape of the aperture), diffraction increases
proportionally (resulting in a blurrier image) and prevents infinitely small apertures. Thus all
apertures have a nonzero diameter. And in practice, optical lenses are used to allow for a much
larger aperture so that more light from the scene can reach the image sensor.
It is typical to assume that the axial thickness of the lens is small relative to the radius of
curvature of the lens, which means that the lens is “thin”. It is also common to assume that
the angles the light rays make with the optical axis of the lens are small, which is known as
the paraxial ray approximation. Thus, assuming thin lenses and paraxial rays, the mathematics
describing the behaviour of lenses can be significantly simplified. As shown in Fig. 2.6, the
light rays emitting from a point P in the scene pass through the lens and converge to a point
![Page 46: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/46.jpg)
24 2.2. MONOCULAR CAMERAS
behind the lens based on the thin lens formula,
1
zi+
1
zo=
1
f, (2.5)
where zo is the distance to the subject, zi is the distance to the image, and f is the focal length
of the lens. Therefore, we can determine the distance along the z-axis of P , given the lens’
focal length and the distance of the image formed by the lens.
The trade-off with the larger aperture from the lens is that the incoming light rays can only be
focused at a certain depth. Objects at different depths produce rays that converge at different
points behind the lens. A cone of light rays that converge to a point on the image plane are
considered to be in focus, or at the point of convergence, as shown in Fig. 2.7. When the point
of convergence does not lie on the focal plane, the rays occupy an area on the image plane and
appear blurred. This area is known as the circle of confusion c and is useful for describing how
“sharp” or in focus a world point appears in an image.
Real lenses are not able to focus all rays to perfect points and the smallest circle of confusion
that a lens can produce that is indistinguishable from a point on the image plane is often referred
to as the circle of least confusion, which also depends on pixel size. If c is smaller than a pixel, it
is usually indistinguishable from a point on the image plane and thus considered to be in focus,
even if the focused light does not converge to a point that strictly lies on the image plane. This
leads cameras to have a nonzero depth of field, the range of distances at which objects in the
scene appear in focus on a discrete, digital sensor.
A small aperture provides a large depth of field, which is desirable to keep more of the scene
in focus. However, the small aperture admits less light. Less light can lead to issues with noise
and these issues cannot always be compensated for by simply increasing the exposure time,
due to motion blur. There is therefore an integral link between depth of field, motion blur and
signal-to-noise ratio, with relationships determined by exposure duration and aperture diameter.
![Page 47: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/47.jpg)
CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 25
Figure 2.6: Image formation for a thin lens, shown as a 2D cross section. By convention, the
camera’s optical axis is the z-axis with the origin at the centre of the thin lens.
2.2.2.1 Monocular Depth Estimation
As the 3D world is projected onto a 2D surface, the mapping is not one-to-one and depth
information is lost. A unique inverse of the central projection model does not exist. Given an
image point p(x, y), we cannot uniquely determine its corresponding world point P (Px, Py, Pz).
In fact, P can lie at any distance along the projecting ray CP in Fig. 2.5. This is known as the
scale ambiguity and is a significant challenge for robots striving to interact in a 3D world using
only 2D images.
A variety of strategies can be applied to compensate for this loss, such as active vision [Krotkov
and Bajcsy, 1993], depth from focus [Grossmann, 1987,Krotkov and Bajcsy, 1993], monocular
SfM [Hartley and Zisserman, 2003,Schoenberger and Frahm, 2016], monocular SLAM [Civera
et al., 2008], and learnt monocular depth estimation [Saxena et al., 2006, Godard et al., 2017].
However, without prior geometric models, very few of these methods apply to refractive objects
as we will later discuss in Chapter 3. It is therefore worth considering stereo and other camera
systems to exploit more views for depth information.
![Page 48: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/48.jpg)
26 2.3. STEREO CAMERAS
Figure 2.7: Diagram illustrating the circle of confusion c for a point source passing through a
lens of diameter d. The point source is focused behind the lens (top), in focus (middle) and
focused in front of the lens (bottom).
2.3 Stereo Cameras
Stereo camera systems use two cameras and the known geometry between them to obtain depth
through triangulation in a single sensor capture. Given the corresponding image points p1, p2
and both camera poses, the 3D location of world point P can be determined. Epipolar geometry
defines the geometric relationship between the two images captured by the stereo camera system
is illustrated in Fig. 2.8, and can be used to simplify the stereo matching process required for
stereo triangulation.
As in Fig. 2.8, the centre of projection for each camera is given as {1} and {2}. The 3 points,
P , {1} and {2} define a plane known as the epipolar plane. The intersection of the epipolar
plane and the image plane for cameras 1 and 2 define the epipolar lines, l1 and l2, respectively.
These lines constrain where P is projected into each image at p1 and p2. Given p1, we seek p2
in I2. Rather than searching the entire image, we need only search along l2. Conversely, given
p2, we can find p1 on l1.
![Page 49: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/49.jpg)
CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 27
(a)
Figure 2.8: Epipolar geometry used for stereo camera systems. The epipolar plane is defined by
point P and camera centres of {1} and {2}. Note that {1} and {2} define the reference frames
of cameras 1 and 2, respectively. The intersection of the epipolar plane with the two image
planes I1 and I2 define the epipolar lines l1 and l2, respectively. Given 1p, knowledge of l1and l2 can reduce the search for the corresponding image point in the second image plane from
a 2D to a 1D problem. Seven pairs of corresponding image points p are required to estimate
the fundamental matrix, in order to recover the translation 1T2 and rotation 1R2 of {2} in the
reference frame of {1}.
This important geometrical relationship can be encapsulated algebraically in a single matrix
known as the fundamental matrix F [Corke, 2013],
F = K−1T×RK, (2.6)
where K is the camera matrix, T× is the skew symmetric matrix of the translation vector T ,
and R is the rotation between the two camera poses. For any pair of corresponding image points
1x and 2x, the F satisfies
2xTF 1x = 0. (2.7)
![Page 50: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/50.jpg)
28 2.4. MULTIPLE CAMERAS
The fundamental matrix is a 3 × 3 matrix with 7 degree-of-freedom (DOF). Thus a minimum
of 7 unique pairs of points are required to compute F .
A typical stereo camera arrangement is for two cameras with parallel optical axes, both or-
thogonal to their baseline. This yields horizontally-aligned epipolar lines and further simplifies
the correspondence search from image lines to image rows. For this setup, it is assumed that
both cameras have the same focal length f and a baseline separation b. In the case of a typical
horizontally-aligned stereo camera system, for image points p1(u1, v1) and p2(u2, v2), the dis-
parity d is given as d = u2 − u1. The disparity is a measure of motion parallax. Then the depth
Z can then be computed using
Z =fb
d, (2.8)
which shows that d is inversely proportional to depth.
However, stereo methods are limited to a single fixed baseline. In this configuration, edges
parallel to this baseline do not yield easily observable disparity (especially when the edge spans
the with of the image), and thus their depths are not easily determined. Additionally, only
two views of the same scene means that feature correspondence must be performed on just two
images. In Lambertian scenes, this is sufficient; however, stereo vision can fail under significant
appearance changes, for example in the presence of occlusions and non-Lambertian objects.
2.4 Multiple Cameras
Additional cameras yield more views and with them more redundancy; however, the configura-
tion of their relative poses is important. Introducing a third camera creates a trinocular camera
system. While stereo uses the 3×3 fundamental matrix, tri-camera methods use a 3×3×3 ten-
sor, known as the trifocal tensor. These tensors can be determined from a set of corresponding
image points from 2 and 3 views, respectively. These tensors can then be decomposed into the
cameras’ projection matrices, after which triangulation can be used to recover the 3D positions
![Page 51: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/51.jpg)
CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 29
of the points. According to Hartley and Zisserman, the quadrifocal tensor exists for 4 views,
but it is difficult to compute, and the tensor method does not extend to n views [Hartley and
Zisserman, 2003].
Furthermore, multi-camera vision systems are not necessarily limited to regularly-sampled grids
aimed at the same scene. For example, a common commercial multi-camera configuration is
to have six 90◦ FOV cameras mounted together to provide a 360◦ panoramic view. In this
configuration, there is very little scene overlap between the cameras, which provides very little
redundancy. As we will later show in Ch. 5, redundancy of views from different perspectives at
regular intervals is extremely important for characterising the appearance of refractive objects
as a function of viewpoint.
2.5 Light-Field Cameras
LF cameras are based on the idea of computational photography, in which a large part of the
image capture processes is performed by software rather than hardware. LF cameras belong to
the greater class of generalised cameras [Li et al., 2008, Comport et al., 2011]. In this section,
we begin by introducing the plenoptic function as a means of modelling light from all possible
views in space. Under certain assumptions and restrictions, we explain how we can reduce
the plenoptic function to a 4D LF that captures a more manageable representation of multiple
views. We then discuss the most common LF parameterisation and camera architectures of
cameras that capture LFs, how captured LFs are decoded from raw sensor measurements to its
parameterisation and why we expect light-field cameras to be suitable for dealing with refractive
objects.
![Page 52: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/52.jpg)
30 2.5. LIGHT-FIELD CAMERAS
2.5.1 Plenoptic Function
Light is more than a “2D image plus depth” for a single perspective. Light is a much higher-
dimensional phenomenon. Adelson and Bergen [Adelson and Bergen, 1991] introduced the
plenoptic function as a means of representing light and encapsulating all possible views in
space. The term “plenoptic” was coined from the root words for “all” and “seeing”, so the
plenoptic function conceptualizes all the properties of light in a scene. Light is modelled as
rays, each of which can be described using seven parameters: three spatial coordinates (x, y, z)
that define the ray’s position (of the source), two orientation coordinates (θ, φ) define the ray’s
elevation and azimuth, the specific wavelength λ accounts for the colour of light and time t.
Together, these 7 parameters yield the plenoptic function,
P (px, py, pz, θ, φ, λ, t), (2.9)
which is the intensity of the ray as a function of space, time and colour, also illustrated in
Fig. 2.9. Thus the plenoptic function represents all the light flowing through every point in
a scene through all space and time. The significance of the plenoptic function is best put in
Adelson and Bergen’s own words [Adelson and Bergen, 1991]:
The world is made of three-dimensional objects, but these objects do not commu-
nicate their properties directly to an observer. Rather, the objects fill the space
around them with the pattern of light rays that constitutes the plenoptic function,
and the observer takes samples from this function. The plenoptic function serves
as the sole communication link between physical objects and their corresponding
retinal images. It is the intermediary between the world and the eye.
To explain how cameras typically sample the plenoptic function, we consider a monocular,
monochrome camera. First, time is sampled by setting a small shutter time on the camera. The
camera’s photosensor integrates over a small amount of time as the photosites are exposed to
![Page 53: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/53.jpg)
CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 31
Figure 2.9: The plenoptic function models all the light flowing through a scene in 7 dimensions
of the plenoptic function, 3 for position, 2 for direction, 1 for time and 1 for wavelength.
the scene, and incoming photons are counted by the sensor. Exposure time and aperture size
directly affect the exposure of an image by establishing a trade-off between depth of field, image
brightness and motion blur.
Second, the wavelength is sampled from the plenoptic function by integrating the incoming
light over a small band of wavelengths. Each photosite uses a filter to select a specific range of
wavelengths (typically red, green, or blue), although in practice, the luminosity curves overlap,
especially for red and green.
Third, the position is sampled in a camera by setting an aperture. The aperture determines
the positions of the rays seen by the camera. This is typically idealized as an infinitely small
pinhole, whereby all the light in the scene passes through to project an inverted image of the
scene at one focal length from the pinhole. The location of this pinhole is the camera origin, a
3D point known as the nodal point.
Finally, direction is sampled from the plenoptic function in a conventional camera. Each pixel
integrates the scene luminance over a range of both direction angles. The range of directions
that the camera can capture is called the FOV. The parameters that determine the FOV are the
focal length and the pixel size and number in both the x- and y-directions. As we integrate
over the directions, we also project the scene onto the camera sensor, which is where scale
![Page 54: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/54.jpg)
32 2.5. LIGHT-FIELD CAMERAS
information is lost. Additionally, only unoccluded objects are projected onto the sensor in a
Lambertian scene. Occluded objects are therefore not measured by a conventional camera, with
the exception of transparency and translucency.
Although not part of the original 7D plenoptic function, the 8D plenoptic function includes
polarisation and is worth mentioning [Georgiev et al., 2011]. Linear polarisation is when light
waves have vibrations only in a single plane, while unpolarised light have vibrations that occur
in many planes, or equivalently many directions that can also change rapidly. Unpolarised
light may be better thought of as a mixture of randomly polarised light. In terms of sampling
polarisation, a camera can measure different polarisations by employing polarisers that sample
only certain polarisations. Since humans do not have any natural polarisation filters built into
their eyes, regular cameras are not typically built with polarisers. Thus most cameras measure
unpolarised light by integrating over all the different polarisations.
Therefore, with a monocular camera we are integrating over small intervals of position, time and
wavelength and over a larger range of direction. This means we are sampling only 2 dimensions
of the 7D plenoptic function, which results in a 2D image. We note that RGB cameras provide
colour images, which may seem like a sampling over wavelength. Wavelength is measured at
3 (relatively) small intervals that are unevenly spaced and overlap to some degree; however,
wavelength is not sampled in the signal processing sense of multiple, regularly-spaced mea-
surements along the spectrum of wavelengths. Thus we consider images from RGB cameras
as 2D images. In order to overcome the aforementioned limitations of 2D images, we must
consider how to capture multiple views within a single sensor capture. This can be achieved by
capturing the light field.
2.5.2 4D Light Field Definition
The light field was first defined by Gershun in 1936 as the amount of light travelling in every
direction through every point in space [Gershun, 1936], but was only reduced from the 7D
![Page 55: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/55.jpg)
CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 33
plenoptic function to the more tractable 4D LF as a function of both position and direction
in free space by Levoy and Hanrahan [Levoy and Hanrahan, 1996] and Gortler et al. [Gortler
et al., 1996] in 1996. Interestingly, the 4D LF was initially developed by the computer graphics
community to render new views of a scene given several views of the existing scene, without
involving the complexities associated with geometric, lighting and surface models [Levoy and
Hanrahan, 1996]. However, the 4D LF has recently proved useful for computer vision and
robotics for solving the inverse problem: extracting scene structure given several images of the
scene.
In order to reduce the 7D plenoptic function to a 4D light field, we first integrate over small
intervals of time and wavelength. This reduces the plenoptic function to 5D. Since the radiance
along rays in free space is constant in non-attenuating media, we can further reduce the plenoptic
function from 5D to 4D. This means that the light rays do not change their value as they pass
through the scene, implying that rays do not pass through objects1 and do not change in their
intensity as they pass through the air.
An alternate geometric way of understanding the dimensional reduction from 5D to 4D is to
consider a ray defined by a point in 3D space, and a normalized direction. The ray is thus
defined by 5 parameters. In free space, the value of the ray does not change as we move the
point along the ray’s axis and thus the value of the plenoptic function is the same for many
combinations of these 5 parameters. If we fix our point to be on the xy-plane, i.e. set z = 0,
then we have 4 independent parameters that describe the ray; thus we have reduced the 7D
plenoptic function to a 4D LF representation.
The 4D LF is the smallest sample of the plenoptic function needed to encode multiple views
of the scene. Multiple views are of interest because they contain much more information about
the scene. As illustrated in Fig. 1.3, the classic pinhole camera model can be considered a tiny
1Seemingly, this implies that the 4D LF cannot capture the behaviour of refractive objects; however, in the
subsequent chapters of this thesis, we will show that this is not the case. We can look at the relative changes
between views in the LF to infer the distortion caused by refractive objects.
![Page 56: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/56.jpg)
34 2.5. LIGHT-FIELD CAMERAS
peephole through a wall that grants only a single view of the scene from a single viewpoint,
while an LF can be thought of as a window through the wall that grants multiple views of the
scene as we move behind the window. In relation to conventional cameras, an LF image can be
thought of as a set of 2D images of the same scene, taken from a range of 3D positions in space.
Typically these 3D positions are constrained to a planar array for simplicity. The LF is valid for
non-attenuating medium. Novel views can be rendered from the LF. Occlusions are reproduced
correctly in the LF, but we cannot render views behind occluding objects.
2.5.3 Light Field Parameterisation
There are many different parameterisations of the LF, but the simplest and most common is
the two-plane parameterisation (2PP) [Levoy and Hanrahan, 1996]. With this parameterisation,
a ray of light is described by a set of coordinates φ = [s, t, u, v]T, which are the ray’s points
of intersection with two parallel reference planes separated by an arbitrary distance D. The T
represents the vector transpose. The two reference planes are denoted by (s, t) and (u, v). By
convention, the (s, t) plane is closest to the camera and the (u, v) plane is closer to the scene [Gu
et al., 1997], shown in Fig. 2.10.
Figure 2.10: The two-plane parameterisation (2PP) of the 4D LF. Shown here is the relative
parameterisation, where u, v are defined relative to s, t between the two planes, separated by
distance D. From a Lambertian point P , a light ray φ passes through both planes, and can be
represented by the four coordinates from the two planes, s, t, u and v.
![Page 57: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/57.jpg)
CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 35
In the relative parameterisation, u and v are expressed relative to s and t, respectively. In the
absolute parameterisation, u and v are expressed in absolute coordinates. We note that all four
dimensions are required to define position and direction. It is a matter of convention to discuss
which plane defines position or direction. For the purposes of this work, we choose (s, t) as
spatial (position) and (u, v) as angular (direction) dimensions, respectively. In this sense, s, t
fix a ray’s position and u, v fix its direction. A convenient way to interpret the 2PP is as an array
of cameras with parallel optical axes and orthogonal baselines, as illustrated in Fig. 1.3. The
camera apertures are on the s, t plane facing the u, v plane. The s, t plane can be thought of
as a collection of all the viewpoints available within the LF camera. If the separation distance
D is chosen to be the focal length f of the cameras, then (u, v) correspond to the image plane
coordinates of the physical camera sensor.
Therefore, we can consider the 4D LF as a 2D array of 2D images, as shown in Fig. 2.11. In the
literature, these 2D images are sometimes referred to as sub-views, sub-images, or sub-aperture
images in the LF. Each view looks at the same scene, but from a slightly shifted viewpoint. The
key intuition is that in comparison to other robotic vision sensors like monocular cameras, stereo
cameras and RGB-D cameras, LFs can more efficiently capture the behaviour of refractive
objects in these multiple views that we might exploit.
There are alternate parameterisations to characterise the LF, such as the spherical-Cartesian
parameterisation [Neumann and Fermuller, 2003]. This method describes the ray position based
on its point of intersection with a plane and its direction with two angles, which also yields a
4D LF. The advantage of this parameterisation is that it can describe all rays passing through
in all directions, and may be well-suited for the design of wide FOV cameras. Although the
2PP cannot describe rays that pass parallel to the reference planes, the 2PP is most common
because of its simplicity and the parameterisation is easily transferable to traditional camera
design and robotic vision [Chan, 2014]. A possible solution to this limitation is to use multiple
2PP’s, oriented perpendicular to each other.
![Page 58: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/58.jpg)
36 2.5. LIGHT-FIELD CAMERAS
Figure 2.11: Example 4D LF as a 2D array of 2D images for a refractive sphere amongst a pile
of cards. Here, only 3 × 3 images are shown, while the actual light-field is a 17 × 17 array of
2D images. The views are indexed by s and t. Typically, we refer to the view for s0 = 0 and
t0 = 0 as the central view of the LF camera. The pixels within each view are indexed by their
u, v image coordinates. Therefore, a single light ray emitting from the scene can be indexed by
four numbers, s, t, u and v. Light field courtesy of the New Stanford Light Field Archive.
2.5.4 Light-Field Camera Architectures
Light-field cameras capture multiple views of the same scene from slightly different viewpoints
in a dense and regularly-sampled manner. The most common LF camera architectures are the
light-field gantry, the camera array and the plenoptic camera, shown in Fig. 2.12.
![Page 59: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/59.jpg)
CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 37
2.5.4.1 Light-Field Camera Gantry
The camera gantry captures the LF using a single camera, moving it to different positions over
time. Thus the positions of the camera map to s, t and the image coordinates of each 2D image
map to u, v. Although Yamamoto was one of the earliest to consider camera gantries for 3D
reconstruction [Yamamoto, 1986]. Levoy and Hanrahan were some of the first to consider
computer-assisted camera gantries for recording light fields [Levoy and Hanrahan, 1996]. The
camera gantry used to help digitize ten statues by Michelangelo is shown in Fig. 2.12a [Levoy
et al., 2000]. The camera gantry can offer much finer angular resolution in the LF than camera
arrays, because camera positioning is only limited by the mechanical precision of its actuators,
while the spatial sampling interval in a camera array is limited by the physical size of the
cameras. Additionally, there is only one camera to calibrate. However, there are high precision
requirements for camera placement and in particular, the LF is not captured within a single shot.
This usually limits the camera gantry to be used in static scenes.
(a) (b) (c)
Figure 2.12: Different light-field camera architectures, (a) a camera gantry [Levoy et al., 2000],
(b) a camera array [Wilburn et al., 2005], and (c) a lenslet-based camera [Ng et al., 2005]. These
architectures all capture 4D LFs.
![Page 60: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/60.jpg)
38 2.5. LIGHT-FIELD CAMERAS
2.5.4.2 Light-Field Camera Array
The camera array is probably the most easily understood architecture for light-field cameras.
The array uses multiple cameras arranged in a grid to capture the LF. 2D images are collected
in an array, which straightforwardly maps to a 4D LF with camera position s, t and pixel position
in each camera image from u, v. A typical configuration is to arrange the cameras on a plane
with regular spacing. This architecture was first developed by Wilburn et al. [Wilburn et al.,
2005], shown in Fig. 2.12b. Camera arrays do not require special optics like plenoptic cameras;
however, there are synchronization, bandwidth, calibration and image correction challenges to
contend with. The discrete nature of the image capture can also cause aliasing artefacts in the
rendered images. Camera arrays have been historically physically large, requiring several dis-
crete sensors, although this also allows for relatively large baselines in comparison to plenoptic
cameras.
Additionally, arrays of cameras can be created virtually. A single monocular camera pointed at
an array of mirrors has been used to capture LFs [Fuchs et al., 2013, Song et al., 2015]. This
LF camera design trades-off mass, bandwidth and synchronization issues with a different set
of calibration issues and limited FOV, depending on the design. In Chapter 4, we use an array
of planar mirrors to make a virtual array of cameras to collect LFs for visual servoing. The
use of additional an array of small lenses to create virtual camera arrays leads the discussion to
plenoptic cameras.
2.5.4.3 Lenslet-based LF Camera
The lenslet-based LF camera, which is sometimes referred to as a plenoptic camera, is a type
of light-field camera that has an array of micro-lenses, often referred to as lenslets, mounted
between the main lens and the image sensor, which split the image from the main aperture into
smaller components, based on the incoming direction of the light rays, as shown in Fig. 2.13.
![Page 61: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/61.jpg)
CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 39
Lippman first proposed to use microlenses to create crude integral photographs in 1908 [Lipp-
mann, 1908]. It was not until 1992 that Adelson and Wang placed the microlenses at the focal
plane of the camera’s main lens [Adelson and Wang, 1992]. Ng et al. [Ng et al., 2005] designed
and commercialized the “standard plenoptic camera” design of the lenslet-based LF camera,
making it hand-held and accessible to a large user base.
In the standard plenoptic camera, the main lens focuses the scene onto the lenslet array and the
lenslet array focuses the pixels at infinity. Fig 2.13b shows how the angular components of the
incoming light rays are divided by the lenslets. Each pixel underneath each lenslet corresponds
to part of the image from a particular direction. This arrangement results in a virtual camera
array in front of the main lens. For a camera with N ×N pixels underneath each lenslet, there
are N ×N virtual cameras. This yields a series of lenslet images, as in Fig. 2.14, which can be
decoded into discrete sub-views to obtain the 4D LF structure previously discussed [Dansereau
et al., 2013].
One of the drawbacks of the standard plenoptic camera is that the final resolution of each de-
coded sub-view is limited by the number of lenslets. Georgiev and Lumsdaine developed the fo-
cused plenoptic camera, also known as the plenoptic camera 2.0, which also places the lenslets
behind the main lens, but the main lens focuses the scene inside the camera before the light
reaches the lenslets [Lumsdaine and Georgiev, 2008]. The focused plenoptic camera displays a
focused sub-image on the sensor, allowing for higher spatial resolutions at the cost of angular
resolution. Equivalently, there are fewer s, t for higher u, v. Although the lower angular resolu-
tion can produce undesirable aliasing artefacts, the key contribution of this camera design was
to decouple the trade-off between the number of lenslets and the achievable resolution [Lums-
daine and Georgiev, 2009]. Commercial cameras that utilize the plenoptic camera 2.0 design
include the Raytrix [Perwass and Wietzke, 2012].
![Page 62: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/62.jpg)
40 2.5. LIGHT-FIELD CAMERAS
point
source
main lens image sensorfocal plane
pixels are the sum
of all the rays
(a)
point
source
main lens image
sensorfocal plane microlens
array
decoded
sub-images
(b)
Figure 2.13: (a) For a conventional monocular camera, light rays from a point source are all
integrated over all the directions that pass through the main aperture into a single pixel value,
such that the pixel’s value depends only on pixel position. (b) For a lenslet-based LF camera, a
microlens array is placed in front of the sensor, such that pixel values depend on pixel position
as well as incoming ray angle. Decoded sub-images equivalent to that of an LF camera array can
be obtained by combining pixels from similar ray directions behind each microlens (or lenslet).
2.5.4.4 Light-Field Cameras vs Stereo & Multi-Camera Systems
The difference between LF cameras compared to general multi-camera systems and stereo sys-
tems is not apparent at first glance. Stereo, multi-camera systems and LF camera arrays typically
use multiple cameras to capture multiple views of the scene. The main difference is the level of
sampling of the plenoptic function. Stereo only samples the plenoptic function twice along one
direction with a fixed baseline. This means stereo can only measure motion parallax (and thus
depth) along the direction of its fixed baseline. Multi-camera systems and LF cameras sample
the plenoptic function from multiple viewpoints, and so have both small and long baselines,
as well as baselines in multiple directions (typically vertically and horizontally). This yields
depth measurements with more redundancy and thus more reliability. However, the density and
uniformity of sampling the plenoptic function matters. Multi-camera systems are not limited to
physical camera configurations where each camera is aimed at the same scene in a regular and
tightly-spaced manner.
On the other hand, LF cameras sample densely and uniformly. This simplifies the processing in
the same way that uniformly sampled signals are easier to sample than non-uniformly sampled
signals. For example, consider 2D imaging devices: non-uniform 2D imaging devices are ex-
tremely rare. A few designs have been proposed, such as the foveated vision sensor, where the
![Page 63: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/63.jpg)
CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 41
(a) (b) (c)
Figure 2.14: (a) A raw plenoptic image of a climber’s helmet captured using a Lytro Illum. This
cropped section consists roughly of 100×100 lenslets. (b) Zoomed in on the raw plenoptic im-
age, each lenslet is visible. Each pixel in the lenslet image (roughly 13×13 pixels) corresponds
to the directional component of a measured light ray. (c) A decoded 100×100 pixel sub-image
from the light-field. The central view of the 4D light-field, which is roughly comprised of the
central pixel from each lenslet image across the entire raw image. There are 13×13 decoded
sub-images in this 4D light-field.
pixel density is varied similar to the non-uniform distribution of cones in the human eye [Yeasin
and Sharma, 2005]. However, such designs are not common in industrial applications or the
consumer marketplace. The dominant 2D imaging devices use a rectangular, uniform distribu-
tion of pixels, which are much simpler to manufacture and process algorithmically. Therefore,
LF cameras can be considered a specific class of multi-camera systems that exploit the camera
geometry to simplify the image processing.
In particular, the dense and regular sampling of LF cameras motivates their use for visual ser-
voing and dealing with refractive objects. As we will show in Ch 4, LF cameras can be used
for visual servoing towards small and distant targets in Lambertian scenes and enable better
performance in occluded scenes. Later in Ch 5, we show that capturing these slightly differ-
ent views is sufficient to differentiate changes in texture from camera motion versus distortion
from refractive objects. Finally in Ch 6, we show that LF cameras can be used to servo towards
refractive objects.
![Page 64: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/64.jpg)
42 2.6. 4D LIGHT-FIELD VISUALIZATION
2.6 4D Light-Field Visualization
Visualizing the data is an important part of understanding the problem. While visualizing 2D
and 3D data has become common in modern robotics research, visualizing 4D data is signif-
icantly less intuitive. In order to examine the characteristics of the 4D LF, the conventional
approach is to slice the LF into 2D images. For example, a u, v slice of the LF fixes s and t to
depict the LF as u varies with respect to v. Recalling the 2PP in Fig. 2.10, it is clear that this
2D slice is analogous to viewpoint selection and corresponds to what is captured by a single
camera in a camera array. Nine different examples of u, v slices depicting the 4D LF as a 3× 3
grid of 2D images are shown in Fig. 2.15 for different values of s and t, although the actual LF
is comprised of 17× 17 2D images.
Further insight can be gained from the LF by considering different pairings of dimensions from
the LF. Consider the horizontal s, u slice, shown at the top of Fig. 2.15. This 2D image is taken
by stacking the rows of image pixels (all the u) from the highlighted yellow, red and green lines
(all the s), while holding t and v constant. Similarly, the vertical t, v slice is taken by stacking
all of the columns of pixels (all the v) from the highlighted turquoise, blue and purple lines (all
the t), while holding s and u constant, shown on the right side of Fig. 2.15.
Visualizing slices using this stacking approach is only meaningful due to the uniform and dense
sampling of the LF. This method was employed by Bolles et al. [Bolles et al., 1987] for a single
monocular camera with linear translation and capturing images with a uniform and dense sam-
pling. Their volume of light was 3D and they referred to the 2D slices of light as epipolar planar
images (EPIs). They were able to simplify the image feature correspondence problem from per-
forming multiple normalized cross-correlation searches across each image, to simply finding
lines in the EPIs. Furthermore, for Lambertian scenes, these lines are characteristic straight
lines with slopes, which, as discussed in Section 2.7, reflect depth in the scene. However, as we
will show in Ch. 5, these lines can become distorted and nonlinear by refractive objects, which
can be exploited for refractive object detection.
![Page 65: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/65.jpg)
CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 43
Figure 2.15: Visualizing subsets of the 4D LF as 2D slices. The u, v slice can the seen as a
conventional image from a camera positioned at the s, t plane. In this figure, there are 3× 3u, v slices for different s, t depicting the 4D LF; the full LF is a 17× 17 grid of images. The
s, u slice, illustrated by stacking the yellow/red/pink rows of image pixels of u with respect
to s, and the t, v slice, depicted by stacking the turquoise/blue/purple columns of image pixels
of v with respect to t, are sometimes referred to as EPIs. For Lambertian scenes, EPIs show
characteristic straight lines with slopes that reflect depth. However, these lines can become
distorted and nonlinear by refractive objects, such as the refractive sphere in the centre of these
images. LF courtesy of the New Stanford Light-Field Archive.
![Page 66: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/66.jpg)
44 2.7. 4D LIGHT-FIELD GEOMETRY
2.7 4D Light-Field Geometry
In this section, we discuss the geometry of the LF. We start by defining geometric primitives in
2D and follow their extensions to 3D and 4D. We then go into detail with the 2PP of the LF
and discuss the point-plane correspondence and the concept of slope and depth in the LF. We
show that a ray in 3D intersects the two planes of parameterisation twice, defined by two pairs
of image coordinates in s, t, u and v, and subsequently that a Lambertian point in 3D induces a
plane in the 4D LF. This theory serves as the basis for understanding the properties of the 4D LF,
which we exploit throughout this thesis for the purposes of visual servoing and discriminating
against refractive objects.
2.7.1 Geometric Primitive Definitions
First, we provide the definitions of several typical geometric primitives, including a point, a
line, a plane and a hyperplane. The definitions of dimensions and manifolds are also included
for clarity.
• Dimension: The definition of dimension, or dimensionality, varies somewhat across
mathematics. The dimension of an object is often thought of as the minimum number
of coordinates needed to specify any point within the object. More formally, dimension is
defined in linear algebra as the cardinal number of a maximal linearly independent subset
for a vector space over a field, i.e. the number of vectors in its basis.
• Degree(s) of Freedom: (DOF) The number of degrees of freedom in a problem is the
number of parameters which may be independently varied. Informally, degrees of free-
dom are independent ways of moving, while dimensions are independent extents of space.
Thus, a rigid three-dimensional object can have zero DOF if it is not allowed to change
it’s pose, six DOF if it is allowed to translate and rotate, or anything combination of
translation and rotation.
![Page 67: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/67.jpg)
CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 45
• Point: A point is a 0-DOF object that can be specified in n-dimensions as an n-tuple of
coordinates. For example, a 2D point is defined as (x, y), a 3D point as (x, y, z), and
a 4D point as (x, y, z, w), which can be described by a minimum of 2,3 and 4 param-
eters, respectively. Points are synonymous with coordinate vectors. Basic structures of
geometry (eg. line, plane, etc) are built from an infinite number of points in a particular
arrangement. One might go as far as to say life without geometry is pointless.
• Manifold: A manifold is a topological space that is locally Euclidean, in that around
every point there is a neighbourhood that has the same properties as the point itself.
• Line: A line is a 1-DOF object that has no thickness and extends uniformly and infinitely
in both directions. A line is a specific case of a 1D manifold. Informally, a line extends
in both directions with no wiggles.
• Plane: A plane is a 2-DOF object that is spanned by two linearly independent vectors. A
plane is a specific case of a 2D manifold.
• Hyperplane: In an n-dimensional space, a hyperplane is any vector subspace that has
n− 1 dimensions [Weisstein, 2017]. For example, in 1D, a hyperplane is a point. In 2D,
a hyperplane is a line. In 3D, a hyperplane is a plane. In 4D, the hyperplane has 3 DOF
and the standard form of a hyperplane is given as
ax+ by + cz + dw + e = 0. (2.10)
In n-dimensions, for a space X = [x1, x2, · · · , xn], xi ∈ R, let a1, a2, . . . an be scalars
not all equal to 0. Then the hyperplane in R⋉ is given as
a1x1 + a2x2 + . . .+ anxn = c, (2.11)
where c is a constant. There are n + 1 parameters, but we can divide though by xi for a
minimum of n parameters to describe the hyperplane in nD.
![Page 68: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/68.jpg)
46 2.7. 4D LIGHT-FIELD GEOMETRY
2.7.2 From 2D to 4D
In this section, we describe the geometry of primitives in increasing dimension and discuss the
minimum number of parameters each primitive can be described by in the different dimensions.
The parameters and their primitives are categorized in Table 2.1. In the rest of the section, we
explain why each primitive requires a certain number of parameters to be described. Note that
the minimum number of parameters to fully describe a primitive is different from its DOF. For
example, a point has 0 DOF. In 2D, a point requires a minimum of two parameters, but in 4D, a
point requires four parameters to describe. We also discuss the equations used to describe these
geometric primitives.
Table 2.1: Minimum Number of Parameters to
Describe Geometric Primitives from 2D to 4D
Primitive 2D 3D 4D
Point 2 3 4
Line 2 4 6
Plane — 3 6
Hyperplane 2 3 4
2.7.2.1 2 Dimensions
A Point in 2D In 2D, the space is defined by x and y. A 2D point is defined with two
equations
x = a, y = b, (2.12)
where a, b ∈ R. Thus a point in 2D requires a minimum of two parameters to be fully described.
![Page 69: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/69.jpg)
CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 47
A Line in 2D A line in 2D has the standard form of
ax+ by + c = 0, (2.13)
where a, b, c ∈ R are three parameters. We can re-write (2.13) as
a
cx+
b
cy = −1, (2.14)
which has two free parameters if we consider a/c and b/c to be two parameters. Thus a line in
2D requires a minimum of two parameters.
Intersection of 2D Hyperplanes We note that a line in 2D is a hyperplane. Consider the
two lines,
ax+ by + c = 0, (2.15)
and
dx+ ey + f = 0, (2.16)
where a, b, c, d, e, f ∈ R. Thus, we can describe a 2D point by the intersection of 2 lines.
a b
d e
x
y
=
−c
−f
. (2.17)
This is a 2 × 2 system of equations for the 2D intersection of two 2D hyperplanes. Assuming
that these two lines are neither collinear nor parallel, we can solve this system of equations to
yield a 2D point. We will refer back to this observation as we journey through the intersection
of three 3D hyperplanes, and four 4D hyperplanes.
![Page 70: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/70.jpg)
48 2.7. 4D LIGHT-FIELD GEOMETRY
2.7.2.2 3 Dimensions
A point in 3D A point in 3D is defined by three equations as
x = a, y = b, z = c, (2.18)
where a, b, c ∈ R. Thus a point in 3D requires a minimum of three parameters to be completely
described.
A Line in 3D As illustrated in Fig. 2.16, a line in 3D can be described by two points p1 and
p2
x = p1 + (p2 − p1)k, (2.19)
where x = [x, y, z] ∈ R3, p1,p2 ∈ R
3 and k ∈ R. With p1 and p2, we have six parame-
ters to describe the line; however, these parameters are not independent. Since the line is one
dimensional, we can imagine sliding either p1 or p2 along the line and retaining the line’s def-
inition. There are infinitely many pairs of 3D points along the line that can describe the line.
Thus for each point, we can hold one of its three coordinates constant to describe the same line
without any lose of generality. Therefore, a line in 3D can be described by a minimum of four
parameters.
We can also describe a line in 3D as a point p1 and a direction (vector) r. In this case, a similar
argument holds: both the point and direction can be reduced to two parameters each, yielding a
total of four parameters.
Plücker coordinates have also been used to specify a line in 3D. Two points on the line can
specify the direction (vector) of the line d. Another vector describes the direction to a point
on the line from the origin p. The cross product between these two vectors is independent of
the chosen point, and uniquely defines the line. Plücker coordinates are defined as the line’s
![Page 71: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/71.jpg)
CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 49
direction vector d together with the cross product, given by
(d;p× d), (2.20)
where d is normalised to unit length, p× d is the cross product, often known as the ‘moment’,
and p is an arbitrary point on the line. We can then select this point relative to a constant without
any loss of generality. Thus, even with Plücker coordinates, four parameters are required to
describe a line in 3D.
(a) (b) (c)
Figure 2.16: Describing a line in 3D with (a) two points, (b) a point and a vector, and (c) the
intersection of two planes.
A Plane in 3D The standard form for a plane in 3D is
ax+ by + cz + d = 0, (2.21)
where a, b, c, d ∈ R. Similar to a line in 2D (2.14), we can describe the plane in 3D with
a minimum of three parameters. The direction of a plane’s normal can be described by two
parameters and the plane must intersect some axis at some scalar distance from the origin;
therefore, we have three parameters.
![Page 72: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/72.jpg)
50 2.7. 4D LIGHT-FIELD GEOMETRY
Intersection of 3D Hyperplanes We also note that a hyperplane in 3D is a 2D plane. As
with the 2D case, if we consider two two hyperplanes in 3D as
ax+ by + cz + d = 0, (2.22)
and
ex+ fy + gz + h = 0, (2.23)
where a, b, c, d, e, f, g, h ∈ R. We can then describe the intersection of these two hyper-
planes in 3D, as a 2× 3 system of equations
a b c
e f g
x
y
z
=
−d
−h
. (2.24)
We can row-reduce this system of equations to
1 0 ca− b
a
(ga−ce
fa−be
)
0 1 ga−ce
fa−be
x
y
z
=
−da− b
a
ga−ce
fa−be
−ha−defa−be
. (2.25)
If we let
α =c
a−
b
a
(ga− ce
fa− be
)
(2.26)
β =−d
a−
b
a
ga− ce
fa− be(2.27)
γ =ga− ce
fa− be(2.28)
η =−ha− de
fa− be, (2.29)
then we can rewrite (2.25) as
![Page 73: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/73.jpg)
CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 51
1 0 α
0 1 γ
x
y
z
=
β
η
. (2.30)
Clearly, the intersection of two non-coplanar, non-parallel planes in 3D describes a line in 3D,
which depends on a minimum of four parameters. Fig. 2.16c shows the intersection of two such
planes in 3D, Π1 and Π2 ∈ R3, forming a line in 3D.
Another way to consider the minimum number of parameters to describe a line in 3D is that
from (2.21), each plane can be described by three parameters. If we consider the intersection
of Π1 and Π2, we have two equations (six parameters total); however, each equation constrains
the system by one. Thus, six minus two yields four parameters that are required to describe a
line in 3D.
Additionally, similar to the 2D case in Section 2.7.2.1, we can also describe the intersection of
three 3D hyperplanes as a 3× 3 system of equations, which intersect at a 3D point.
a1 b1 c1
a2 b2 c2
a3 b3 c3
x
y
z
=
d1
d2
d3
, (2.31)
where the three hyperplanes are their subscripts. In other words, three planes that have unique
normals intersect at a point in 3D. In Section 2.7.2.3, we will show that the intersection of two
hyperplanes in 4D describes a plane in 4D, and that the intersection of four hyperplanes in 4D
forms a 4D point.
![Page 74: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/74.jpg)
52 2.7. 4D LIGHT-FIELD GEOMETRY
2.7.2.3 4 Dimensions
The journey to the fourth dimension can be intimidating and shrouded in mystery due to the
limitations of our perception and understanding. If we can only perceive the world in 3 spatial
dimensions, how can we understand a fourth spatial dimension? In physics and mathematics,
4D geometry is often discussed in terms of a fourth spatial dimension. Consider the axes of x,
y and z form a basis for R3. The fourth spatial dimension along w in 4D is orthogonal to the
other 3 axes. Such discussions lead to speculations of the limitations on human perception and
the 4D equivalent of a cube, known as a tesseract [Hinton, 1884]. Fortunately in this thesis,
we are not concerned with a 4th spatial dimension, but rather 4 dimensions with respect to the
sampling of light, as per the plenoptic function in Section 2.5.1. Our 4 dimensions in the LF
s, t, u and v differ slightly from the four spatial dimensions x, y, z and w, in that s, t, u and v
are constrained via the 2PP. However, much of the geometry from 4 spatial dimensions carries
over to dealing with the 4D LF. In this section, we illustrate the geometric primitives in 4D with
respect to the 2PP for light fields. We illustrate these primitives as projections on a grid of 2D
images, similar to how they would appear in the LF.
A Point in 4D A point in 4D can be described by four equations as
x = a, y = b, z = c, w = d, (2.32)
where a, b, c, d ∈ R. A point in 4D requires a minimum of four parameters to be completely
described. Examples of two 4D points are shown in Fig. 2.17. In the 2PP, a point in 4D
describes a ray in 3D. However, not all rays can be represented by the 2PP, because the 2PP
cannot describe rays that are parallel to the two planes.
A Line in 4D Similar to the 3D case, a line in 4D can be written as a function of two 4D
points, which require eight parameters in total. There are an infinite number of pairs of 4D
![Page 75: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/75.jpg)
CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 53
(a) (b)
Figure 2.17: The projection of two different points in 4D using the 2PP, shown in red. The 2PP
is illustrated as a grid of squares. Each square is considered to be a view. Each view has its own
set of coordinates that describe a location within the view. Both (a) and (b) show 4D points,
defined for specific values of s, t, u and v. Note that s and t values correspond to which view,
while u and v correspond to a specific view’s coordinates (similar to image coordinates). In our
case, a single 4D point must be defined by its view and its coordinates within the view; hence,
a minimum of 4 parameters to describe a 4D point.
points that can describe the 4D line. Each of these points can be “fixed” in the same manner as
a line in 3D (Section 2.7.2.2), reducing the minimum number of parameters to six to describe a
line in 4D. Several examples of 4D lines are shown in Fig. 2.18.
(a) (b) (c) (d)
Figure 2.18: The projection of four different lines in 4D using the 2PP. A 4D line still has one
DOF. (a) t, u and v are held constant, while s is allowed to vary. (b) s, u, and v are held constant
while t is allowed to vary. (c) s, t and u are held constant, while v is allowed to vary. (d) s and
t are held constant, while u and v vary linearly.
A Hyperplane in 4D A hyperplane in 4D is given as
ax+ by + cz + dw + e = 0, (2.33)
![Page 76: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/76.jpg)
54 2.7. 4D LIGHT-FIELD GEOMETRY
where a, b, c, d, e ∈ R. In the 2PP, we can write as + bt + cu + dv + e = 0. Similar to
the 3D case, this equation can be divided by e, yielding a minimum of four parameters to
describe the hyperplane in 4D. Alternatively, the hyperplane in 4D can be described by its
normal n = [a, b, c, d] and a distance to the origin. Four examples of 4D hyperplanes are shown
in Fig. 2.19.
(a) (b) (c) (d)
Figure 2.19: The projection of four different hyperplanes in 4D using the 2PP. A hyperplane
is only constrained along 1 dimension. (a) The hyperplane is constrained along u, such that
cu+e = 0. (b) The hyperplane is constrained along v, such that dv+e = 0. (c) The hyperplane
is constrained along u and v through a linear relation, such that cu + dv + e = 0. (d) The
hyperplane is constrained along s, such that as+ e = 0.
A Plane in 4D & Intersection of 4D Hyperplanes Similar to the analogy of a line in 3D
that can be represented by the intersection of two hyperplanes in 4D that have unique normals.
With unique normals, we can say that the hyperplanes are not parallel and one hyperplane is not
entirely contained within the other hyperplane. Mathematically, let us assume we have two 4D
hyperplanes, given as
ax+ by + cz + dw + e = 0, (2.34)
and
fx+ gy + hz + iw + j = 0, (2.35)
![Page 77: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/77.jpg)
CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 55
where a, b, c, d, e, f, g, h, i ∈ R. From (2.34), we can isolate w as
w =−1
d(e+ ax+ by + cz), (2.36)
and substitute this expression into (2.35) as
fx+ gy + hz + i(−1
d(e+ ax+ by + cz)
)
+ j = 0, (2.37)
which can be simplified to
(
f −ia
d
)
x+(
g −ib
d
)
y +(
h−ic
d
)
z +(
j −ie
d
)
= 0. (2.38)
This equation matches the standard form of a plane in 3D, given in (2.21). From this, it is clear
that the intersection of two hyperplanes in 4D forms a plane in 4D. Each hyperplane can be
described using four parameters, to a total of eight. In Section 5.2, we show equivalently that
the intersection of two 4D hyperplanes can be described with two equations in (5.5).
We note that (2.38) appears to imply that we can describe a plane in 4D with just four numbers;
however, each coefficient in (2.38) is an equation that has two constraints in total, d 6= 0 and at
least two of the coefficients in front of x, y or z must be non-zero. We can further illustrate this
relation by illustrating two hyperplanes in the 2PP, as in Fig. 2.20. Two different hyperplanes
are pictured in green and purple. Their intersections, highlighted in red, represent the plane in
the 4D LF.
Additionally, we can show that the intersection of four hyperplanes in 4D intersect at a point.
In 2D, the intersection of two 2D hyperplanes resulted in a 2 × 2 system of equations, which
could be solved for a 2D point. In 3D, the intersection of 3 3D hyperplanes resulted in a 3 × 3
system of equations, which could be solved for a 3D point. In 4D, the intersection of four 4D
![Page 78: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/78.jpg)
56 2.7. 4D LIGHT-FIELD GEOMETRY
(a) (b)
Figure 2.20: The projection of two different planes in 4D using the 2PP. In both (a) and (b), the
two hyperplanes are shown in green and purple. Their intersection in red represents the plane
in the 4D LF.
hyperplanes results in a 4× 4 system of equations, which results in a 4D point,
a1 b1 c1 d1
a2 b2 c2 d2
a3 b3 c3 d3
a4 b4 c4 d4
x
y
z
w
=
e1
e2
e4
e4
, (2.39)
where the four hyperplanes are represented by their subscripts.
2.7.3 Point-Plane Correspondence
A particularly relevant question for robots striving to interact in a 3D world is how do observa-
tions in the LF translate to the 3D world? In this section, we will further discuss the intersections
of hyperplanes in 4D to show that a point in 3D manifests itself as a plane in the 4D LF. This
manifestation was coined the point-plane correspondence in [Dansereau and Bruton, 2007],
although a similar relationship was determined for translating monocular cameras in [Bolles
et al., 1987].
Recall the relative two-plane parameterisation (2PP) [Levoy and Hanrahan, 1996]. A ray with
coordinates φ = [s, t, u, v], is described by two points of intersection with two parallel reference
![Page 79: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/79.jpg)
CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 57
planes. An s, t plane is conventionally closest to the camera, and a u, v plane is conventionally
closer to the scene, separated by arbitrary distance D. The rays emitting from a Lambertian
point in 3D space, P = [Px, Py, Pz]T can be illustrated in the xz-plane, shown in Fig. 2.21. The
same ray can be shown in the su-plane in Fig. 2.22.
For the xz-plane, if we define θ as the angle between the intersecting ray and the z-axis direc-
tion, then by similar triangles, we have
tan θ =Px − u− s
Pz −D=
Px − s
Pz
. (2.40)
Then
u = Px −(Px − s
Pz
)
(Pz −D) + s
u =D
Pz
(Px − s). (2.41)
We can also plot (2.41) to yield projections in the 2PP similar to Fig. 2.19a. Plotting (2.42)
yields projections similar to Fig. 2.19b.
We can follow a similar procedure for the yz-plane, resulting in
v =D
Pz
(Py − t). (2.42)
We can combine (2.41) and (2.42) into a single equation as
u
v
=
(D
Pz
)
Px − s
Py − t
. (2.43)
We can recognize (2.43) as two hyperplanes in 4D that intersect to describe plane in 4D, as well
as a point in 3D. Therefore, light rays from a Lambertian point in 3D manifests as a plane in the
4D LF.
![Page 80: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/80.jpg)
58 2.7. 4D LIGHT-FIELD GEOMETRY
Figure 2.21: Light-field geometry for a point in space for a single view (black), and other views
(grey), whereby u is defined relative to s and varies linearly with s for all rays originating from
P (Px, Pz).
We can re-write (2.43) into the form,
DPz
0 1 0
0 DPz
0 1
s
t
u
v
=
DPx
Pz
DPy
Pz
. (2.44)
From (2.44), we note that the hyperplane normals only depend on Pz, and not Px or Py. The
normals are similar for both hyperplanes for a Lambertian point in that their elements have the
same values, but in different columns (such that the two normals are still linearly-independent)
in su and tv, respectively. Equation (2.43), and thus (2.44), map out the ray space (all rays)
emitting from point P .
2.7.4 Light-Field Slope
In 2D, a line’s direction and steepness, i.e. its rate of change of one coordinate with respect
to the other coordinate, is referred to as the slope. In the 4D LF, if we consider two different
measurements from a Lambertian point P (Px, Py, Pz) as (s1, t1, u1, v1) and (s2, t2, u2, v2),
![Page 81: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/81.jpg)
CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 59
the difference between these two measurements for the xz-plane can be written as
u2 − u1 =D
Pz
(Px − s2)−D
Pz
(Px − s1), (2.45)
which simplifies to
u2 − u1 = −D
Pz
(s2 − s1). (2.46)
We then refer to the rate of change between the linear relation of u with respect to s, as slope
w, which is often visualized in a 2D EPI slice of the LF, as in Fig. 2.15 and is given as
w =u2 − u1
s2 − s1= −
D
Pz
. (2.47)
We note that a similar procedure follows for the y − z plane and yields an identical expression,
w =v2 − v1t2 − t1
= −D
Pz
. (2.48)
The slope w relates the image plane coordinates for all rays emitting from a particular 3D point
in the scene. Fig. 2.21 shows the geometry of the LF for a single view of P . As the viewpoint
changes, that is, s and t change, the image plane coordinates vary linearly according to (2.43).
In Fig. 2.22, we show how u varies as a function of s, noting that v varies as a similar function
of t. The slope of this line w, comes directly from (2.43), and is given by
w = −D
Pz
. (2.49)
By working with slope, akin to disparity from stereo algorithms, we deal more closely with the
structure of the light field.
In this section, we explored geometric primitives such as points, lines, planes and hyperplanes
from 2D to 4D. We explained the number of parameters that are required to describe the prim-
itives. By describing the underlying mathematics behind these geometric primitives, we gain
![Page 82: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/82.jpg)
60 2.7. 4D LIGHT-FIELD GEOMETRY
u
s
Figure 2.22: For the situation illustrated in Fig. 2.21, the corresponding line in the s, u plane
has a slope w.
insight into how light rays emit from a Lambertian point are represented in the 2PP of the LF.
We showed that a light ray emitting from a Lambertian point can be described as a 4D point
in the LF. A Lambertian point induces a plane in the 4D LF, and a plane in the 4D LF can be
described by the intersection of two 4D hyperplanes. In future chapters, we will use these re-
lations to propose light-field features for visual servoing and detecting refracted image features
using an LF camera, and servoing towards refractive objects.
![Page 83: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/83.jpg)
Chapter 3
Literature Review
In this chapter, we provide a review of the literature relevant to this thesis. First, we introduce
image features, from 2D to 4D. Then we review visual servoing in the context of LF cameras
and refractive objects. Third, we investigate the state of the art for how refractive objects are
handled in robotics. Finally, we summarize the review by identifying the research gaps that this
thesis seeks to address.
3.1 Image Features
Features are distinct aspects of the scene that can be reliably and repeatedly identified from dif-
ferent viewpoints and/or across different viewing conditions. Image features are those features
recorded in the image as a set of pixels by the camera that can then be automatically detected
and extracted as a vector of numbers, which is referred to as an image feature vector. Image fea-
ture vectors abstract raw and dense image information into a simpler, smaller and more compact
relevant representation of the data. Much of the literature does not make a significant distinc-
tion between these three concepts. Good image features to track are those that can repeatedly
be detected and matched across multiple images [Shi and Tomasi, 1993]. There are typically
61
![Page 84: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/84.jpg)
62 3.1. IMAGE FEATURES
two aspects to finding an image feature vector, an image feature detector and a image feature
descriptor. For brevity, we refer to both as a detector and descriptor, respectively. The detector
is a method of determining whether there is a suitable image feature at a given image location.
The detector is usually represented by a pair of image coordinates, a set of curves, a connected
region, or area [Corke, 2013]. The descriptor is a method of describing the image feature’s
neighbourhood. The descriptor typically takes the form of a vector for correspondence. In
this section, we review geometric image features, as well as photometric image features in the
context of refractive objects and light fields from 2D to 4D. We then briefly discuss image fea-
ture correspondence and why refractive objects are particularly challenging for image feature
correspondence.
3.1.1 2D Geometric Image Features
Traditionally, the most common image features are geometric image features that represent a 2D
or 3D geometric shape in a 2D image. Most robotic vision methods use 2D geometric image
features, such as regions and lines [Andreff et al., 2002], line segments [Bista et al., 2016],
moments (such as the image area, the coordinates of the centre of mass and the orientation of an
image feature) [Mahony et al., 2002,Tahri and Chaumette, 2003,Chaumette, 2004], and interest
points (sometimes referred to as keypoints) [Chaumette and Hutchinson, 2006,McFadyen et al.,
2017]. For image points, Cartesian coordinates are normally used, though polar and cylindrical
coordinates have also been developed [Iwatsuki and Okiyama, 2005]. Interest points are better
suited to handle large changes in appearance, which may be caused by refractive objects. One
of the earliest and most popular interest point detectors is the Harris corner detector [Harris and
Stephens, 1988]; however, Harris corners do not distinguish interest points of different scale—
they operate at a single scale, determined by the internal parameters of the detector. In the
context of wide baseline matching and object recognition, there is an interest in features that
can cope with scale and viewpoint changes. Harris corners are computationally-cheap, but do
![Page 85: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/85.jpg)
CHAPTER 3. LITERATURE REVIEW 63
not provide accurate feature matches across different scales and viewpoints [Tuytelaars et al.,
2008, Le et al., 2011].
To achieve scale invariance, a straightforward approach is to extract points over a range of scales
and use all of these points together to represent the image, giving rise to multi-scaled features.
Of particular note, Lowe developed the scale invariant feature transform (SIFT) feature detector
based on finding the extrema of a multi-scaled pyramid of the Difference of Gaussian (DoG)
responses [Lowe, 2004]. Bay et al. further reduced the computational cost of SIFT features by
considering the Hessian of Gaussians and other numerically-efficient approximations to create
speeded-up robust feature (SURF) feature detectors [Bay et al., 2008].
SIFT and SURF features also include descriptors that are based on using histograms. These
histograms describe the distribution of gradients and orientations of the feature’s support re-
gion in the image for illumination and rotational invariance. Dalal et al. developed the more
advanced histogram of gradients (HoG) feature descriptor, which uses normalized weights
based on nearby image gradients for each sub-region, making HoG descriptors less sensitive
to changes in contrast than SIFT and SURF, and better at matching in cluttered scenes [Dalal
and Triggs, 2005]. While Lowe’s SIFT descriptor was limited to a single scale, Dong et al.
recently improved the SIFT descriptors by pooling (combining) the gradient histograms over
all the sampled scales, calling the new descriptor domain-size pooled SIFT (DSP-SIFT) [Dong
and Soatto, 2015], which represents the state of the art in terms of point feature descriptors for
SfM tasks [Schoenberger et al., 2017].
Features from accelerated segment test (FAST) features were developed by Rosten [Rosten
et al., 2009] as an exceptionally cheap binary feature detector that exploits the relative rela-
tionship of nearby feature pixel values directly. binary robust independent elementary features
(BRIEF) descriptors selects random pixels within the neighbourhood of the feature to make
binary comparisons in sequence as a binary descriptor, which was computationally cheap and
reliable except for in-plane rotation [Calonder et al., 2010]. Oriented FAST and rotated BRIEF
![Page 86: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/86.jpg)
64 3.1. IMAGE FEATURES
(ORB) features were developed as computationally cheaper alternatives to SIFT and SURF fea-
ture detectors for real-time robotics. The ORB features build on the FAST detector by using
Harris corner strength for ranking, and SIFT’s multi-scale pyramids for scale invariance [Rublee
et al., 2011]. ORB descriptors use the BRIEF descriptor and augment it with an intensity-
weighted centroid. This assumes a small offset between the image’s corner intensity and the
corner centre which defines a measure of orientation to provide rotational invariance. Over-
all, the result is that ORB is a much more computationally efficient detector and descriptor
with comparable performance to SURF, which has proven to be very successful in the robotics
literature.
Recent work in machine learning and convolutional neural networks (CNN)s have also given
rise to learned 2D features. Verdie et al. [Verdie et al., 2015] developed a temporally invari-
ant learned detector (TILDE) that detects keypoints for outdoor scenes, despite lighting and
seasonal changes. They demonstrated better repeatability over 3 different datasets than hand-
crafted feature detectors, such as SIFT. Unfortunately, Verdie’s approach was only done for a
single scale and without any viewpoint changes.
Yi et al. trained a deep network to learn the thresholds of the entire SIFT feature detection
and description pipeline in a unified manner [Yi et al., 2016]. They called this method the
Learned Invariant Feature Transform (LIFT). LIFT out-performed all other hand-crafted fea-
tures in terms of repeatability and the nearest neighbour mean average precision, a metric that
captures how discriminating the descriptor is by evaluating it at multiple descriptor distance
thresholds.
However, an extensive experimental evaluation of hand-crafted to learned local feature descrip-
tors showed that learned descriptors often surpassed basic SIFT and other hand-crafted descrip-
tors on all evaluation metrics in SfM tasks [Schoenberger et al., 2017]. However, more advanced
hand-crafted descriptors such as DSP-SIFT performed on par, or better than the state-of-the-art
![Page 87: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/87.jpg)
CHAPTER 3. LITERATURE REVIEW 65
learned feature descriptors including LIFT, for tasks in SfM, which showed a high variance
across different datasets and applications, unlike hand-crafted features.
Many robotic vision systems have used hand-crafted and learned features [Kragic and Chris-
tensen, 2002,Bourquardez et al., 2009,Low et al., 2007,Tsai et al., 2017,Lee et al., 2017,Pages
et al., 2006]. The majority of them assume a similar appearance for the support region during
correspondence. For non-Lambertian scenes where the support regions change significantly in
appearance with respect to viewing pose, incorrect matches can occur. Moreover, refractive ob-
jects can cause features to distort, rotate, scale and flip. Feature descriptors that only account for
scale and rotation will not reliably match refracted content because the additional distortion and
flips caused by refractive objects change the very neighbourhood that the descriptors attempt to
describe. Thus, these 2D image features may not perform well for scenes containing refractive
objects.
3.1.2 3D Geometric Image Features
The fundamental limitation with using 2D features to describe the 3D world is that significant
information is lost during the image formation process of conventional cameras. The perspec-
tive transformation is an irreversible process that projects the 3D world into a 2D image. Full
3D information can greatly improve robot vision algorithms to more reliably handle changes
due to viewing position and lighting conditions. We refer to incorporating 3D information into
image features as 3D geometric image features.
Measurements of 3D data can come from a variety of sensors, including stereo, RGB-D or
LIDAR. Sensor measurements are then turned into one of many different 3D feature represen-
tations. Most conventional 3D feature descriptors are based on histograms, similar to SIFT’s
2D gradient-based histogram descriptors. The most common is Johnson’s spin image [Johnson
and Hebert, 1999]. For a given point, a cylindrical support volume is divided into volumetric
ring slices. The number of points in each slice are counted and summed about the longitudinal
![Page 88: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/88.jpg)
66 3.1. IMAGE FEATURES
axis of the volume. This makes the spin image rotationally invariant about this axis. Finally,
the spin image is binned into a 2D histogram.
Tombari et al. built on spin images by using a spherical support volume and examining the
surface normals of all the points within the support, referred to as the Signature of Histograms
of OrienTations (SHOT) feature descriptor [Tombari et al., 2010]. All of these approaches use a
similar strategy, geometric measurements are taken about a support volume and are binned into
a histogram. The shape of the histogram is used to compare similarity to given points. Salti et al.
extended the SHOT descriptors to include both surface geometry as well as colour texture [Salti
et al., 2014]. They demonstrated improved repeatability by including texture; however, their
method remains untested for refractive objects and we anticipate reduced performance since
colour texture may change with viewpoint for refractive objects.
Quadros recently developed 3D features from LIDAR, defined by ray-tracing a set of 3D line
segments in space [Quadros, 2014]. If these lines reach behind a surface or encounter a large
gap in the data, unobserved space is registered by the method. Unobserved space is assumed to
be occlusions in 3D point clouds. The authors report that accounting for occlusions facilitates
more robust object recognition, although their method does not consider refractive objects.
Recently, Gupta et al. learned 3D features from RGB-D images for object detection and seg-
mentation [Gupta et al., 2014] and Gao et al. for SLAM [Gao and Zhang, 2015]. However, none
of these methods have been implemented for visual servoing and these methods rely on 3D data
from RGB-D and LIDAR sensors, which return erroneous measurements for refractive objects
and other view-dependent effects.
3.1.3 4D Geometric Image Features
All of the previous features have been developed for 2D images or 3D representations. LFs
are parameterised in 4D, which requires a re-evaluation of feature detectors and descriptors.
![Page 89: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/89.jpg)
CHAPTER 3. LITERATURE REVIEW 67
Most previous work using LFs have only used 2D image features [Johannsen et al., 2015,Smith
et al., 2009], or simply use the LF camera as an alternate 3D depth sensor in Structure from
Motion and SLAM-based applications [Dong et al., 2013, Marto et al., 2017]. These works do
not take advantage of all the information contained within the full 4D LF, which can capture
not only shape and texture, but also elements of occlusion, specular reflection and in particular,
refraction.
Ghasemi et al. proposed a global feature using a modified Hough transform to detect changes
in the lines of slope within an EPI [Ghasemi and Vetterli, 2014]. However, their method is
a global feature used to describe the entire scene, which is inappropriate for most SfM and
IBVS methods that require local features. More recently, Tosic et al. focused on developing
a SIFT-like feature detector for LFs by incorporating both scale-invariance and depth into a
combined feature space, called LISAD space [Tosic and Berkner, 2014]. Extrema of the first
derivative of the LISAD space was taken as 3D feature points, yielding a feature described
by image position (u, v), scale and slope (equivalently depth). However, we note that Tosic’s
work assumes no occlusions or specular reflections and does not discuss feature description to
facilitate correspondence over multiple light fields. Furthermore, Tosic’s choice of using an
edge-detector in the epipolar plane images (EPIs) amounts to a 3D edge detector in Cartesian
space, which is a poor choice when unique points are required by SfM and IBVS. Edge points
are not unique and are easily confused with their neighbours. Additionally, we anticipate these
LF features may not perform well for refractive objects, because the depth analysis assumes
Lambertian scenes.
Also pursuing more reliable LF features, Texeira et al. found SIFT features in all sub-views of
the LF and projected them into their corresponding EPIs [Teixeira et al., 2017]. These projec-
tions were filtered and grouped into straight lines in their respective EPIs, and then counted.
Features with higher counts were observed in more views and thus considered more reliable.
In other words, Teixeira imposed 2D epipolar constraints on 2D SIFT features, which does not
take full advantage of the geometry of the 4D LF.
![Page 90: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/90.jpg)
68 3.1. IMAGE FEATURES
Similarly, Johannsen et al. considered 3D line features based on Plücker coordinates and im-
posed 4D light-field constraints in relation to LF-based SfM [Johannsen et al., 2015]. Zhang et
al. considered the geometry of 3D points and lines transforming under light field pose changes
[Zhang et al., 2017]. They derived line and plane-based correspondence methods between sub-
views of the LF and imposed these correspondences in LF-based SfM. Doing so resulted in im-
proved accuracy and reliability over conventional SfM, especially in challenging scenes where
image feature points were sparse, but lines and planes were still visible. These previous LF-
based works largely focused on matching between large differences in viewpoint. However,
incremental pose changes, such as those found in visual servoing and video applications, also
warrant consideration. How the LF changes with respect to these small pose changes is similar
in concept to the image Jacobian for IBVS, which has not yet been well-explored.
In considering LF cameras with respect to refractive objects, Maeno et al. proposed to model
an object’s refraction pattern as image distortion and developed the light-field distortion (LFD)
feature based on the differences in corresponding points in the 4D LF [Maeno et al., 2013]. The
authors used the LFD for transparent object recognition. However, their method did not impose
any LF geometry constraints, leading to poor performance with respect to changes in camera
position. Xu et al. built on Maeno’s LFD to develop a method for refractive object image
segmentation [Xu et al., 2015]. Each pixel was matched between each sub-view of the light
field and then fitted to a single normal of a 4D hyperplane that is characteristic for a Lambertian
point. A threshold was applied to this error to distinguish a refracted pixel. However, we will
show in Chapter 5 that a 3D point is not described by a single hyperplane in 4D. Rather a 3D
point manifests as a plane in 4D, which can be described as the intersection of two hyperplanes.
Both of these must be considered when considering a feature’s potentially refractive nature
when it passes through a refractive object.
![Page 91: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/91.jpg)
CHAPTER 3. LITERATURE REVIEW 69
3.1.4 Direct Methods
In contrast to geometric image features that represent a geometric primitive from the image(s),
direct methods establish some geometrical relationship between two images using pixel intensi-
ties directly. For this reason, they are also known as featureless, intensity-based, or photometric
methods [Collewet and Marchand, 2011]. These methods avoid image feature detection, extrac-
tion and correspondence entirely by directly using the image intensities by way of minimising
the error between the current and desired image to servo towards the goal pose. A common
measure of photometric image error is the sum of squared differences. Although this operation
involves many calculations over the image as a whole, it involves very few calculations per
pixel, each of which are relatively simple and easily computed in parallel. This allows many
direct methods to potentially run faster that feature-based VS methods.
Despite these benefits, VS methods using photometric image features typically suffer from
small convergence domains compared to geometric feature-based methods [Collewet and Marc-
hand, 2011]. Recently, to improve the convergence domain, Bateux et al. projected the current
image to a several poses, which were tracked by a particle filter. The error between the pro-
jected and current images drove the robot towards the convergence area, whereupon the method
switched to conventional photometric VS [Bateux and Marchand, 2015]. Although the error
was minimised between the current and next images, the poses projected by the particle fil-
ter were random, resulting in a path towards the goal pose that was not necessarily smooth or
optimal with respect to the amount of physical motion required to reach the goal.
Furthermore, photometric image features typically assume that the scene’s appearance does not
change significantly with respect to viewpoint. Thus, they do not perform well for changes in
pose which result in large changes in scene appearance [Irani and Anandan, 1999]. Refractive
objects tend to have large changes in appearance with respect to viewing pose and therefore
photometric VS methods are ill-suited for scenes with refractive objects.
![Page 92: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/92.jpg)
70 3.1. IMAGE FEATURES
Collewet et al. recently extended photometric IBVS to scenes with specular reflections [Collewet
and Marchand, 2009]. This was accomplished by considering the Phong light reflection model,
which provides image intensity as a function of a diffuse, specular and ambient component,
given a point light source [Phong, 1975]. Collewet’s approach compared the derivative of
the light reflection model to the image Jacobian from photometric VS [Collewet and Marc-
hand, 2011] to arrive at an analytical description of the image Jacobian relating pixel values to
the light reflection from the Phong model. However, their approach requires a light reflection
model, which in other words requires complete knowledge of all the lighting sources and their
relative geometry. A similar strategy would likely only be viable for refractive objects if a 3D
geometric model of the object was available.
3.1.5 Image Feature Correspondence
The classical approach to many robotic vision algorithms involves detecting, extracting and then
image features to compare the current and some goal image feature sets. Often, the success of
the algorithm depends significantly on accurate feature correspondence. Correspondence is a
data association problem, finding the same set of image features in a pair of images. Cor-
responding image features is typically divided into two categories: large-baseline and small-
baseline matching. Large-baseline matching aims to correspond features between two images
that were taken from relatively different viewpoints, large baselines, or different viewing condi-
tions. Small-baseline matching aims to correspond features between two images that were taken
from relatively similar viewpoints, or narrow baselines. While both approaches aim to match
image features between two images, the general assumptions and approaches differ. However,
large-baseline matching can also apply to small-baseline situations, and in the context of VS,
the image feature error that VS seeks to minimise, relies on corresponding image features be-
tween the current and goal images. The goal image may have been captured from a relatively
different viewpoint. Thus, in this thesis we focus on large-baseline image feature matching.
![Page 93: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/93.jpg)
CHAPTER 3. LITERATURE REVIEW 71
For matching, the nearest neighbour distance between two feature descriptor vectors is com-
monly used as putative matches; however, exhaustive methods are inefficient for large feature
databases. Advanced features, like SIFT use search data structures, such as k-d trees to more
efficiently find matches [Lowe, 2004]. Muja et al. proposed multiple randomized k-d trees to
approximately find the nearest neighbour with much faster speeds than linear search, with only
a minor loss in accuracy [Muja and Lowe, 2009]. However, the actual comparison of tradi-
tional image feature correspondence methods is often based on some abstraction of the image
feature’s appearance. This inherently assumes that the appearance of the image feature does not
change significantly between views. Refracted objects can significantly change the appearance
of a feature, which makes matching based on appearance particularly challenging.
To reduce the possibility of mismatches and remove outliers, putative matches are refined ac-
cording to some consistency measure (or model). For example, in two-view geometry, the
image reprojection error from the fundamental matrix is used. The standard approach is ran-
dom sampling and consensus (RANSAC) [Bolles and Fischler, 1981], where candidate points
are randomly chosen to form a hypothesis, which is tested for consistency against the remain-
ing data. The hypothesis process is iteratively repeated until a thresholded number of inliers is
reached. Building on RANSAC, Torr et al. proposed maximum likelihood estimator sampling
and consensus (MLESAC) that maximises the likelihood that the data was generated from the
hypothesis [Torr and Zisserman, 2000].
Most outlier rejection methods, such as RANSAC and MLESAC, are based on two assump-
tions: first, that there are sufficient data to describe the model and second, that the data are
mostly inliers—there are few outliers. Most robotic vision algorithms do not account for re-
fraction and thus rely on these outlier rejection methods to remove these inconsistent features
(such as refracted features) from the inlier set. In a scene that has mostly Lambertian features
with only a small number of refracted features, outlier rejection methods work well. However,
for scenes that are mostly covered by a refractive object, such as when a robot or camera di-
rectly approaches a refractive object, outlier rejection methods are much less reliable because
![Page 94: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/94.jpg)
72 3.2. VISUAL SERVOING
the second assumption is broken [Kompella and Sturm, 2011]. Therefore, traditional feature
correspondence methods may not work reliably for features that pass through refractive objects.
3.2 Visual Servoing
Visual servoing (VS) is a form of closed-loop feedback control that uses a camera in the loop
to directly control robot motion. The term VS was introduced by Hill & Park to distinguish
their approach from the common “look-then-move”, or equivalently “sense-then-act”, approach
to robotics in 1979 [Hill, 1979], and has since covered a wide range of applications, from
controlling robot manipulators in manufacturing and agricultural fruit/vegetable picking [Mehta
and Burks, 2014,Baeten et al., 2008,Han et al., 2012], to flying quadrotors [Bourquardez et al.,
2009], and even docking of planetary rovers [Tsai et al., 2013]. VS is a promising technique
for robotics because it does not necessarily require a 3D geometric model of its target, the
accuracy of the operations do not entirely depend on accurate robot control and calibration,
and historically, the simplicity of the VS approach has led to faster interaction in docking,
manipulation and grasping tasks, as well as shorter time cycles in sensing the environment,
which have translated to more reliable robot performance.
Hutchinson et al. were some of the first researchers to clearly distinguish the different types of
VS systems in 1996 [Hutchinson et al., 1996]. This classification was based on how the visual
input was used and what computation was involved, grouping them into either position-based
visual servoing (PBVS) or image-based visual servoing (IBVS) systems. In this section, we
provide a comparison and review of PBVS and IBVS systems.
![Page 95: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/95.jpg)
CHAPTER 3. LITERATURE REVIEW 73
3.2.1 Position-based Visual Servoing
The purpose of PBVS is to minimise the relative pose error between the target (some desired
pose), and the camera’s pose. Image features are extracted from the image and used with a
geometric model of the target and a known camera model to estimate the relative pose of the
target with respect to the camera, as shown in Fig. 3.1a. Feedback is then computed to reduce
the error in the estimated relative pose. PBVS is traditionally referred to as position-based
VS, although the approach may be more realistically referred to as pose-based VS. The main
advantage of PBVS is that it is straight-forward to incorporate physical constraints, spatial
knowledge and direct manoeuvres (such as obstacle avoidance).
PBVS requires an estimate of the target object pose in order to derive feedback control in the
task space. The approach can be computationally demanding, sensitive to noise and highly
dependent on camera calibration. Most research involving PBVS has focused on Lambertian
scenes, i.e. scenes that are predominantly Lambertian and so do not contain refractive objects,
specular reflections, or other surfaces or materials that cause non-Lambertian light transfer.
PBVS has been demonstrated in full 6 DOF control by Wilson et al. [Wilson et al., 1996] and
in real time using object models with a monocular camera by Drummond et al. [Drummond
and Cipolla, 1999]. More recently Tsai et al. implemented PBVS using a stereo camera for a
tether-assisted docking system [Tsai et al., 2013]. Teuliére et al. demonstrated successful PBVS
(a) (b)
Figure 3.1: Architectures for (a) position-based visual servoing and (b) image-based visual
servoing, which does not require explicit pose estimation. Image courtesy of [Corke, 2017].
![Page 96: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/96.jpg)
74 3.2. VISUAL SERVOING
using an RGB-D camera even when partial occlusions were present [Teulière and Marchand,
2014].
PBVS towards refractive objects was recently considered by Mezouar et al. for transparent pro-
tein and crystal manipulation under a microscope [Mezouar and Allen, 2002]. However, the 2D
nature of the microscope workspace greatly simplified the visual servoing process. More im-
portantly, the microscope and the backlighting reduced the image processing to a thresholding
problem, making the objects’ refractive nature irrelevant.
Recently, Bergeles et al. used PBVS for controlling the pose of a microrobotic device inside
a transparent human eye for surgery by accounting for the visual distortions caused by the
eye [Bergeles et al., 2012]. Their method required extremely precise model calibration of both
the eye and the robot in order to avoid potential injury. In our application of servoing towards
refractive objects of more general shapes, models of the cameras are not always accurate, and
prior models of the objects are not necessarily available or can be difficult to obtain. For this
reason, PBVS methods are sometimes referred to as model-based VS [Kragic and Christensen,
2002].
PBVS is not commonly used in practice because the visual features used for servoing are not
guaranteed to stay in the FOV during the approach, and more importantly, it requires estimation
of the target pose, which in turn requires a geometric model of the target object and model of
the camera. As we will discuss in Section 3.3.2, 3D information of refractive objects, and in
particular their 3D models and 3D pose information is extremely difficult to obtain. Experi-
mental setups that can obtain the required 3D measurements on the refractive objects are likely
too bulky for mobile robot applications such as VS. Additionally, monocular pose estimation
is poorly conditioned numerically [Kragic and Christensen, 2002]. Therefore, there is real in-
terest in compact IBVS systems that tend to keep the target in the FOV by the very nature of
the algorithm, that avoid the ill-conditioned pose estimation, and do not necessarily require 3D
geometric models of the refractive objects.
![Page 97: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/97.jpg)
CHAPTER 3. LITERATURE REVIEW 75
3.2.2 Image-based Visual Servoing
In IBVS, robot control values are directly computed based on image features, as shown in
Fig. 3.1b. Typically, image features from the current view of the robot are detected and ex-
tracted. These image feature vectors are matched to a set of goal image feature vectors. The
image feature error is computed as the difference between the two image feature sets. Then the
estimated camera velocity that attempts to drive the image feature error to zero is computed.
This cycle is repeated until the image feature error is sufficiently small. The negative feedback
helps to reduce system fluctuations and promotes settling to equilibrium, which makes IBVS
more robust to uncertainty, noise and camera/robot modelling and calibration errors that often
plague traditional open-loop sense-then-act approaches. IBVS works because the camera pose
is implicit in the image feature values. This eliminates the need for an explicit 3D geomet-
ric model of the goal object, as well as an explicit pose-based motion planner [Chaumette and
Hutchinson, 2006].
3.2.2.1 Image Jacobian for Monocular IBVS
At the core of IBVS systems is the interaction matrix, which is sometimes referred to as a visual-
motor model, but more commonly referred to as an image Jacobian [Kragic and Christensen,
2002]. The image Jacobian J represents a first-order partial derivative function that relates the
rate of change of image features to camera velocity. Consider
p = J(p, cP ;K)cν, (3.1)
where cP ∈ R3 is the coordinate of a world point in the camera reference frame, p ∈ R
2 is
its image plane projection, K ∈ R3×3 is the camera intrinsic matrix, cν = [v; ω] ∈ R
6 is
the camera’s spatial velocity in the camera reference frame, which is the concatenation of the
![Page 98: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/98.jpg)
76 3.2. VISUAL SERVOING
camera’s translational velocity v = [vx, vy, vz]T and rotational velocity ω = [ωx, ωy, ωz]
T in the
camera reference frame.
The control problem is defined by the initial (observed) and desired image coordinates, p# and
p∗ respectively, from which the required optical flow
p∗ = λ(p∗ − p#) (3.2)
can be determined, where λ > 0 is a constant. This equation implies straight line motion in the
image because the image feature error is only taken as the difference between initial and desired
image coordinates. Combining both equations we can write
J(p, cP ;K)ν = λ(p∗ − p#), (3.3)
which relates camera velocity to observed and desired image plane coordinates. It is important
to note that VS is a local method based on J , the linearisation of the perspective projection
equation. In practice it is found to have a wide basin of attraction.
The monocular image-based Jacobian for image point features p = (u, v), is given as [Chaumette
and Hutchinson, 2006]
J =
− fxPz
0 uPz
uvfy
−f2x+u2
fx
fyv
fx
0 − fyPz
vPz
f2y+v2
fy−uv
fx−fxu
fy
, (3.4)
where fx, fy are the x and y focal lengths1, respectively, and Pz is the depth of the point. We
note that the first three columns of J depend on depth, implying that image feature velocity in
the image plane is inversely proportional to depth, while the feature velocity due to the angular
velocity of the camera is largely unaffected by depth.
1Typically, fx and fy are equal. These terms are in units of pixels, i.e. pixel size is included.
![Page 99: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/99.jpg)
CHAPTER 3. LITERATURE REVIEW 77
Equation (3.3) suggests we can solve for camera velocity ν, but for single point, the system
is under-determined. Thus it not possible to uniquely determine the elements of ν for a single
observation p. To address this issue, the typical approach is to stack (3.4) for each of N image
features,
J(p1,cP1;K)
...
J(pN ,cPN ;K)
ν = λ
p∗1 − p
#1
...
p∗N − p
#
N
(3.5)
and if N ≥ 3 we can solve uniquely for ν
ν = −λ
J1
...
JN
+
p1 − p∗1
...
pN − p∗N
, (3.6)
where J+ represents the left Moore-Penrose pseudo-inverse of J . Equation (3.6) is similar
to the classical proportional control law for VS [Hutchinson et al., 1996], except that we use
the pseudo-inverse because we may have noisy observations forming a non-square matrix; the
pseudo-inverse finds a solution that minimises the norm of the camera velocity. The constant λ
is the control loop’s gain, which serves to amplify the resulting control.
There are two important issues with (3.4) and (3.6) with respect to lack of depth information
and stability. First, (3.4) depends on depth Pz. Any method that uses this form of Jacobian
must therefore estimate or approximate Pz. However monocular cameras do not measure depth
directly. A common assumption is to fix Pz, which then is a control gain for the translational
velocities [Chaumette and Hutchinson, 2006]. A variety of other approaches exist to estimate
depth online [Papanikolopoulos and Khosla, 1993, Jerian and Jain, 1991, De Luca et al., 2008];
however, monocular depth estimation techniques are often nonlinear and difficult to solve be-
cause they are typically ill-posed [Kragic and Christensen, 2002]. Moreover, visual servoing
stability issues can arise from these approaches because variable depth can lead to local minima
and ultimately unstable behaviour of the robot system [Chaumette, 1998].
![Page 100: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/100.jpg)
78 3.2. VISUAL SERVOING
Second, Chaumette showed that the IBVS system is only guaranteed to be stable near Pz, since
J is a linear approximation to the nonlinear robotic vision system [Chaumette and Hutchinson,
2006]. Local asymptotic stability is possible for IBVS, but global asymptotic stability cannot
be ensured. Determining the size of the neighbourhood where stability and convergence are
ensured is still an open issue, even though this neighbourhood is large in practice. Furthermore,
the stacking in (3.6) relies on stacking N image point feature Jacobians Ji, each of which may
have different Pz,i, depending on the scene geometry. Malis et al. showed that incorrect Pz,i can
cause the system to fail [Malis and Rives, 2003]. In other words, the depth distribution affects
IBVS convergence and stability, and in the case of unknown target geometry, accurate depth
estimates are actually needed.
One example of undesirable behaviours in IBVS is camera retreat, where the camera may move
backwards for large rotations [Chaumette, 1998]. Camera retreat is caused by the coupled
nature of the rotation and translation components in the image Jacobian. This poses a perfor-
mance issue because in real systems, such backwards manoeuvres may not be feasible. Corke
et al. showed that camera retreat was a consequence of requiring straight line motion on the
image plane with a rotating camera (as in (3.3)). This was then addressed by decoupling the
translation motion components from the z-axis rotation components into two separate image
Jacobians [Corke and Hutchinson, 2001]. Recently, Keshmiri et al. proposed to decouple all
six of the camera’s velocity screw elements [Keshmiri and Xie, 2017]. Their approach enables
better Cartesian trajectory control compared to traditional IBVS systems at the cost of more
computation.
Almost all IBVS methods rely on accurate image feature correspondence in order to accurately
compute image feature error. McFayden et al. recently proposed an IBVS method that jointly
solves the image feature correspondence and motion control problem as an optimal control
framework [McFadyen et al., 2017]. Image feature error is computed for different feature cor-
respondence permutations. As the robot moves closer to the desired pose, the system converges
towards smaller error and the correct permutation. However, their approach focused on an ex-
![Page 101: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/101.jpg)
CHAPTER 3. LITERATURE REVIEW 79
haustive approach for the number of image features and thus does not scale well for a large
number image features, such as those detected when using natural features typically found in
most robotic vision algorithms.
3.2.2.2 IBVS on Non-Lambertian Objects
An interesting approach to IBVS on featureless objects was proposed by Pages et al. whereby
coded, structured light was projected into the scene to create geometric visual features for fea-
ture correspondence [Pages et al., 2006]. By defining the projection pattern as a particular grid
of coloured dots, many point features were quickly and unambiguously detected and matched.
However, the structured light required that the ambient light did not overpower the projector,
limiting usage to indoor applications. Additionally, this method may not work reliably for re-
fractive objects, because the projected pattern would be severely distorted, scaled, or flipped,
which would greatly complicate the feature detection and correspondence problem.
Recently, Marchand and Chaumette used planar mirrored reflections to overcome the limited
FOV of a single camera in IBVS [Marchand and Chaumette, 2017]. They derived the image
Jacobian for servoing the mirror relative to the camera to track an object. However, only Lam-
bertian features were tracked through the mirror and it was assumed that the image features
were always within the mirror (thus all the reflected features always showed consistent motion).
Furthermore, image feature distortion that could arise from non-planar mirrors, somewhat simi-
lar to the distortion from refractive objects, was also not considered. In summary, this approach
may not be directly transferable to tracking image features through refractive objects because
nonlinear image feature motion—potentially caused by inconsistent feature/mirror motion or
non-planar mirrors—was not considered in their approach.
![Page 102: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/102.jpg)
80 3.2. VISUAL SERVOING
3.2.2.3 IBVS using Multiple Cameras
IBVS has been extended to stereo and multi-camera vision systems. Assuming the pose be-
tween both cameras is known, each camera’s Jacobian can be transformed into a common ref-
erence frame. Stacking the same type of image features from both cameras and solving the
system yields camera motion [Chaumette and Hutchinson, 2007]. Malis et al. [Malis et al.,
2000] extended this concept to multiple cameras with a similar stacking of image features;
more cameras yielded more features. Comport et al. derived an IBVS framework for gener-
alised cameras [Comport et al., 2011], though the focus was on non-overlapping FOV camera
configurations, rather than the overlapping FOV camera configurations of LF cameras. Ad-
ditional IBVS systems were discussed in Section 3.1.2. All of these previous works rely on
accurate feature correspondence. They assume Lambertian point correspondences, which do
not necessarily apply in the case of refractive objects. Therefore, we expect that none of these
systems would perform reliably in the presence of refractive objects.
In the area of VS, Malis et al. first proposed to use 3D geometric image features, but referred
to this concept as 2.5D visual servoing [Malis et al., 1999]. Using 3D geometric image features
does not necessarily require any geometric 3D model of the target object and is less limited by
the relatively small convergence domain and depth issues that plague monocular image-based
visual servoing (M-IBVS). In a slightly different manner, Chaumette proposed that it may be
advantageous for robot systems to plan large steps using PBVS, while small intermediate steps
are maintained by IBVS [Chaumette and Hutchinson, 2007].
For stereo vision systems, Cervera et al. used the 3D coordinates of points [Cervera et al., 2003],
and Bernardes et al. used 3D lines [Bernardes and Borges, 2010] for visual servoing. Malis et
al. used homographies [Malis et al., 1999,Malis and Chaumette, 2000] and both Mariottini et al.
and Cai et al. used epipolar geometry [Mariottini et al., 2007,Cai et al., 2013] in visual servoing.
Both homography and epipolar-based approaches determine a geometric relationship between
the current and desired views to control robot motion. The geometric relationship is either
![Page 103: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/103.jpg)
CHAPTER 3. LITERATURE REVIEW 81
the homography matrix or the fundamental matrix, both of which can be determined using
corresponding feature points from different views. However, decomposing the homography
matrix only applies to planar scenes and stereo epipolar geometry becomes ill-conditioned for
short baselines as well as planar scenes [López-Nicolás et al., 2010].
Recently, Zhang et al. developed a trifocal tensor-based approach for visual servoing [Zhang
et al., 2018]. In simulation, a trinocular camera system was used to estimate the trifocal ten-
sor based on point feature correspondences, as in [Hartley and Zisserman, 2003]. Instead of
directly computing the camera pose via singular value decomposition (SVD), the authors chose
to use elements of the trifocal tensor, augmented with elements of scale and rotation, as visual
features. However, these methods relied on accurate feature correspondences, which are fun-
damentally based on Lambertian assumptions. Therefore, these approaches are not likely to
perform reliably in the presence of non-Lambertian scenes, such as those containing refractive
objects.
3.3 Refractive Objects in Robotic Vision
Refractive objects are particularly challenging in computer and robotic vision because these
objects do not have any obvious visible features of their own. Their appearance tends to be
largely dependent on the background, the object’s shape and the lighting conditions. Although
refractive objects have been largely ignored by the bulk of the robotics community, we review
the previous research in detecting, recognizing refractive objects and reconstructing their shape.
Although shape reconstruction is not an explicit goal of this thesis, observed structure and
camera motion are integrally linked, and it is important to review what information has been
extracted from refractive objects.
![Page 104: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/104.jpg)
82 3.3. REFRACTIVE OBJECTS IN ROBOTIC VISION
3.3.1 Detection & Recognition
There have been a variety of approaches to detecting and recognizing refractive objects. In
this review, we have divided the different approaches into model- and image-based approaches,
based on whether or not the method in question relies on a prior 3D geometric model of the
refractive objects.
3.3.1.1 Model-based Approaches
One of the earliest model-based approaches to refractive object detection was proposed by Choi
and Christensen, where a database of 2D edge templates of projected 3D refractive object mod-
els with known poses, was used to match edge contours from 2D images [Choi and Chris-
tensen, 2012]. Image edges were extracted and matched using particle filters to provide coarse
pose estimates, which were refined via RANSAC. The authors achieved real-time refractive
object detection and tracking with 3D pose information. However, this approach required a
large database of edge templates for every conceivable model and pose, which does not scale
well for general purpose robots, although this is becoming less significant with the increasing
computational abilities of modern computers.
Most subsequent approaches adopted RGB-D cameras as a means of making putative refractive
object detections. While depth measurements of refractive objects from RGB-D cameras were
known to be inconsistent, partial depth around the refractive objects was usually observed in
the RGB-D images. Luo et al. applied a variety of morphological operations to identify 3D
regions of inconsistent depth, which were assumed to be refractive [Luo et al., 2015]. These 3D
regions were then compared to 3D object models for recognition. However, Luo obtained the
3D models of the refractive objects by first painting them so that the refractive objects became
Lambertian, which is not a practical approach for most robotic applications.
![Page 105: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/105.jpg)
CHAPTER 3. LITERATURE REVIEW 83
Recently, LF cameras have been considered for refractive object recognition with models. Wal-
ter et al. also used an RGB-D camera for object recognition, but combined their system with
an LF camera array to detect and replace the inconsistent depth estimates caused by specular
reflections on glass objects [Walter et al., 2015]. This was accomplished by comparing a known
3D model of the refractive object to the observed depth measurements in order to identify the in-
consistent depths. Given that LF cameras implicitly encode depth, it is possible that the RGB-D
camera was redundant in this approach.
In a particularly recent and interesting work, Zhou et al. developed an LF-based depth descriptor
for object recognition and grasping [Zhou et al., 2018]. For a Lambertian point, the light field
yields one highly redundant depth estimate, but for a refracted image feature, the light field
can yield a wide distribution of depths. Zhou proposed to use a 3D array of depth likelihoods
within a certain image region and depth range, creating a 3D descriptor for the refractive object.
By comparing this depth-based descriptor to a 3D geometric model, refractive object pose was
estimated using Monte Carlo localization. This method was sufficiently accurate for coarse
manipulation of glass objects in water and Lambertian objects behind a stained-glass window.
However, all of the previously-mentioned methods required prior accurate 3D geometric models
of the refractive objects. For a small set of simple objects this approach may be feasible, but
in general, models of refractive objects are challenging to acquire, potentially time-consuming
and expensive to obtain, or simply not available [Ihrke et al., 2010a]. Therefore, there is great
interest in methods that do not rely on 3D geometric models.
3.3.1.2 Image-based Approaches
Early work on detecting refractive objects in 2D images started with Adelson and Andandan
in 1990 by focusing on finding occluding edges caused by refractive objects [Adelson and
Anandan, 1990]. However their method was limited to 2D layered scenes with no visual texture
on planar refractive shapes, such as circles and triangles. Szeliski et al. extended this concept of
![Page 106: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/106.jpg)
84 3.3. REFRACTIVE OBJECTS IN ROBOTIC VISION
layered depth images to detect reflective and refractive objects in more general images [Szeliski
et al., 2000]; however, their approach was still limited to scenes that could be described as a
collection of planar layers. McHenry et al. noted that refracted objects tended to distort and blur
image edges, as well as appear slightly darker in the image [McHenry et al., 2005]. Thus their
method focused on finding image edges and then compared the image gradients and overall
intensity values on either side of the edge to detect refractive parts of the image. Snake contours
were then used to merge components of refractive object edges into overall refractive object
segments. However, their method assumed that the background was similar on all sides of the
glass edges, which was not true for very refractive elements or those containing bright surface
highlights.
Kompella et al. extended this work by finding regions in the image that contained even more
visual characteristics related to refracted objects [Kompella and Sturm, 2011]. In addition to the
reduced image intensity and blurred image gradients, their method also searched for an abun-
dance of highlights and caustics caused by the specular surface of most refractive objects and
lower saturation values as some light and colour is lost as it passes through refractive objects.
Characteristics were combined into a function to detect and avoid refractive objects during navi-
gation. However, their method only provided extremely coarse estimates of where the refractive
objects were located in the image and still assumed that the background was similar on all sides
of the glass edges. Therefore we anticipate that their approach would not perform well if the ob-
ject was not in front of a uniform background, which is not practical for mobile robots working
in cluttered scenes.
Recently, Klank et al. used an RGB-D camera for detection, but noted that most refractive
objects appeared much darker in the depth images when placed on a flat table [Klank et al.,
2011]. This was likely due to the different absorption properties of glass. They segmented dark
regions in the depth images as candidate refractive objects and then identified depth inconsis-
tencies within the dark regions as refractive. However, dark regions in depth images do not
necessarily correspond to glass objects. Depending on the type of RGB-D camera used, dark
![Page 107: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/107.jpg)
CHAPTER 3. LITERATURE REVIEW 85
regions in depth images can also appear at regions that are actually farther away (since intensity
is correlated to depth), as well as other material types, such as felt, and sometimes at occlu-
sion boundaries. Thus their algorithms may not perform well in cluttered and occluded scenes
containing refractive objects.
LF cameras have only recently been considered for image-based refractive object detection.
Maeno et al. proposed to model an object’s refraction pattern as image distortion, based on
differences in corresponding points in the 4D LF [Maeno et al., 2013]. However, the authors
noted poor performance due to changes in appearance from both the specular reflections on the
refractive objects and the camera viewing pose. Xu et al. built on Maeno’s work to complete a
transparent object image segmentation method from a single light field capture [Xu et al., 2015].
However, as we will discuss in more detail in Ch. 5, their method does not fully describe how
a 3D point manifests in the light field. We address this to improve detection and recognition
rates.
3.3.2 Shape Reconstruction
Although shape reconstruction is not an explicit goal of this thesis, observed structure and mo-
tion are intricately linked; thus it is important to understand what has been done in this area.
Shape reconstruction of refractive objects is a particularly challenging task. Ihrke et al. proposed
a taxonomy of objects according to their increasing complexity with respect to light transport
(reflections, refractions, sub-surface scattering, etc. . . ) [Ihrke et al., 2010a]. Most techniques
have focused on opaque objects (Class 1) and have demonstrated good performance using a
sequence of images from a monocular camera relying on dense pixel correspondences [Engel
et al., 2014, Newcombe et al., 2011]. However, shiny and transparent objects are still diffi-
cult for the state-of-the-art because these methods assume Lambertian surfaces. Additionally,
traditional methods rely on rejecting inconsistent correspondences using RANSAC [Fischler
and Bolles, 1981], which can be robust to a few small specular highlights; but are insufficient
![Page 108: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/108.jpg)
86 3.3. REFRACTIVE OBJECTS IN ROBOTIC VISION
for dealing with more complex light transport phenomena (Class 3+), including refractive ob-
jects [Ihrke et al., 2010a,Tsai et al., 2019], as we will show in Ch 5. In order to reliably deal with
shiny and transparent objects, researchers have developed a variety of methods to reconstruct
the shape of refractive objects.
3.3.2.1 Shape from Light Path Triangulation
Kutulakos et al. presented the seminal work on using light-ray correspondences to estimate the
shape of refractive objects [Kutulakos and Steger, 2007]. The shape of specular and transparent
objects, defined by depths and surface normals, can be estimated by mapping the light rays that
enter and exit the object. As shown in Fig. 3.2, we can consider a convex hull two-interface
refractive object and draw a ray originating from background point P (two parameters) in some
direction r (2)2 for some distance dPA (1). At A0, the ray intersects with the refractive object
and changes direction. We estimate this direction change using Snell’s Law, which requires an
estimate of the surface normal Ni (2) and the ratio of refractive indices n1/n2 (1). Through
the object, the light ray travels for distance dAB (1) through the object, and changes direction
at the exiting interface at B0, which is defined by surface normal N (2). The light ray then
travels for distance dBL (1) to the camera. All together, a basic light path can be described by
a minimum of twelve characteristics of the scene. Alternatively, one can describe the light path
as three rays (four parameters each) linked in series, which also requires a minimum of twelve
parameters. As we will describe below, many approaches, such as shape from distortion [Ben-
Ezra and Nayar, 2003] and shape from reflection [Han et al., 2015], apply assumptions which
limit or define many of these parameters to simplify shape recovery.
Shape from distortion is an approach based on capturing multiple images from known poses,
finding visual features that correspond to the same 3D point from behind the refractive object,
and then examining how the light path has been distorted by the refractive object. For example,
2Recall that a ray can be described by four parameters.
![Page 109: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/109.jpg)
CHAPTER 3. LITERATURE REVIEW 87
Figure 3.2: Light paths can describe the behaviour of light as it passes through a refractive ob-
ject. Most methods rely on light path correspondence and triangulation to solve for the depths
and surface normals of the refractive object. In general for 2-interface refractive objects, light
paths are described by over twelve characteristics from the point of origin to the intersecting
lines at the refractive object boundaries, to the camera sensor. Many approaches apply assump-
tions or constraints to simplify the problem.
Kim et al. acquired the shape of axially-symmetric transparent objects, such as wine glasses,
by placing an LCD display monitor in the background and emitting several known lighting
patterns [Kim et al., 2017]. However, most methods rely on a bulky device to project a calibrated
pattern through the object [Murase, 1990, Hata et al., 1996, Kim et al., 2017] and so are not
immediately applicable to mobile robotics. Recently Ben-Ezra et al. tracked features over a
sequence of monocular camera images to capture the distortion pattern [Ben-Ezra and Nayar,
2003]. Starting with an unknown parametric model, shape and pose were simultaneously found
in an iterative, nonlinear, multi-parameter optimisation scheme. However their method could
only handle quadratic-shaped refractive objects and importantly, the features were manually
tagged because it was seen as a very hard problem to automatically detect and match image
points through refractive medium from single images.
Alternatively shape from reflection or refraction approaches typically solve light ray correspon-
dences by controlling the background behind the refractive object. Han et al. used a single cam-
era fixed in position with a refractive object placed in front of a checkerboard background [Han
![Page 110: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/110.jpg)
88 3.3. REFRACTIVE OBJECTS IN ROBOTIC VISION
et al., 2015, Han et al., 2018]. The method only required two images with the background
pattern in different known positions; however, change of refractive index was required, which
meant immersing the object in water, which is a major limitation for most robots.
In addition to background scene control, constraints on the refractive object itself can further
simplify the light path correspondence problem. For example, Tsai et al. imposed a planar
surface constraint to one side of a refractive object. With a monitor controlling the background
image, they were able to reconstruct a diamond’s shape with a single monocular image [Tsai
et al., 2015] without having to place the object in water.
Without explicit control of the background, shape can also be obtained by controlling the incom-
ing light rays using a mobile light source. Morris et al. used a static monocular camera with a
grid of known moving lights to map different reflectance values to the same surface point, from
which they reconstructed very challenging shiny and transparent structures [Morris and Kutu-
lakos, 2007]. Miyazaki and Ikeuchi used a rotating polariser in front of a monocular camera
to capture multiple images of different polarisation settings, but also required a known back-
ground surface and known lighting distribution to estimate the shape of the transparent object
[Miyazaki and Ikeuchi, 2005]. However, both Morris’ and Miyazaki’s methods require known
light sources with bulky configurations that are impractical for mobile robotic applications.
The majority of the state-of-the-art methods for refractive object shape reconstruction based
on light paths roughly rely on feature correspondence between multiple views to find common
features for triangulation. Because of the complexity and sheer number of unknowns of the
problem, most of these approaches apply assumptions and constraints to the problem to make
it more tractable. In doing so, the application window of their methods become too narrow,
making them fragile and unreliable to be considered for practical robot applications that must
contend with many conditions or environments, or the methods require equipment too bulky to
be considered for most mobile robot applications.
![Page 111: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/111.jpg)
CHAPTER 3. LITERATURE REVIEW 89
3.3.2.2 Shape from Learning
Recent work in robotics has seen an explosion in the area of learning features using CNN.
CNNs use a large number of images to train several layers of parameters to minimise some
cost function. CNNs use the convolution operation on images that are input to the network to
approximate how neurons from the brain respond to visual stimulus in the receptive field of
the visual cortex [Krizhhevsky et al., 2012]. By feeding it large training sets of images and an
objective function, the CNN is able to “learn” the visual stimulus relevant for a given task (such
as image classification or object detection).
Deep learning approaches use more layers than CNNs to handle more complex tasks, and ad-
vanced recognition performance. Deep learning has achieved state-of-the-art performance for
many classification and recognition tasks, but few have explored their use for refractive objects.
Saxena demonstrated a data-driven method for recognizing grasping points on a variety of ob-
jects, including some refractive objects [Saxena et al., 2008]. However, recovering the shape
of such objects still remains a challenge due to the large amount of ground-truthed images re-
quired to train CNNs. For learning approaches on opaque objects, ground truth comes from
RGB-D cameras; however, RGB-D cameras are unable to provide reliable depth information
on refractive objects and 3D models of refractive objects are not always available.
3.3.2.3 Shape (Structure) from Motion
Shape estimation techniques based on multiple viewpoints are closely related to structure from
motion techniques [Wei et al., 2013]. For shape estimation, scene depth is usually determined
given the viewing pose for each viewpoint (although surface normals are also often computed).
On the other hand, for SfM, scene depth and viewing pose are simultaneously computed from
multiple 2D images3. SfM is generally considered to be a well understood problem in the-
3SfM is also closely related to visual servoing, which we review in Section 3.2.
![Page 112: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/112.jpg)
90 3.3. REFRACTIVE OBJECTS IN ROBOTIC VISION
ory [Hartley and Zisserman, 2003]. The typical pipeline of SfM includes detecting image
features, establishing image correspondences, filtering outliers, estimating camera poses and
locations of 3D points, followed by optional refinement with bundle adjustment [Triggs et al.,
2000]. However, classical SfM does not produce reliable results for refractive objects because
of poor performance with feature correspondence [Ihrke et al., 2010b].
Ham et al. presents a shape estimation method that may be loosely described as structure from
motion on occluding edges of refractive objects [Ham et al., 2017]. The authors use multiple
views with known pose to extract the position and orientation of occluded edge features. Oc-
cluding edge features are visible edges in an image that lie on the boundary of an occlusion or
depth discontinuity. They appear as edges in the image and unlike textural edges (flat patterns
on a surface), they are view-dependent and their surfaces are tangential to the camera view.
Ham’s method can handle very general object shapes and does not require pre-existing knowl-
edge of the object. Bulky equipment setups are not required. However, their method relies on
a monocular camera, which must be moved to different poses to acquire multiple views, which
may make dynamic scenes more challenging. An LF camera may be able to capture a sufficient
number of multiple views in a single shot from a single sensor position (ie, without moving
the camera). Furthermore, Ham’s method is focused on reconstructing the scene. While Ham’s
method requires full pose information, our methods look to detect refracted features and servo
towards them, which are entirely image-based, and thus do not require full pose information.
3.3.2.4 Shape from Light Fields
Using LF cameras for estimating the shape of objects is a relatively recent development. Most
research has been focused on reconstructing Lambertian objects. Tao et al. recently used cues
from both defocus and correspondence within a single LF exposure to obtain depth. The two
measures of depth were combined to provide more accurate dense depth maps than from either
method alone [Tao et al., 2013]. Luke et al. provided a framework to estimate depth by working
![Page 113: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/113.jpg)
CHAPTER 3. LITERATURE REVIEW 91
directly with the 4D LF in terms of gradients, as opposed to other methods that only exploited
2D epipolar slices of the 4D LF [Luke et al., 2014]. Wanner & Goldluecke formulated a struc-
ture tensor for each pixel to give local estimates of the lines of slopes from the epipolar plane
images. A global optimisation method was used to combine these local depth estimates in a
consistent manner [Wanner and Goldluecke, 2012]. Their approach yielded high quality, dense
depth maps, but required significant computation time, easily over four hours for a single light
field, which may not be practical for online robotics applications. Recently, Strecke et al. devel-
oped a method to jointly estimate depth and normal maps from a 4D light field on Lambertian
surfaces using focal stacks generated by a single light field [Strecke et al., 2017]. However,
none of these methods considered refractive objects.
Wanner et al. were the first to recover the shape of planar specular and transparent surfaces
from an LF [Wanner and Golduecke, 2013]. They assumed that the observed light was a linear
combination of the real surface and the reflected or refracted image. Then the epipolar planar
image can be described as a super-position of two lines of slope that are related to depth. Both
depths were determined and used to separate the scene into a layer closer to the camera and a
layer farther from the camera. However, Wanner’s method was limited to single reflection cases
and planar reflective or transparent surfaces. Our interest is in interacting with more general
object shapes.
Furthering the work on slightly more general shapes, Wetzstein et al. reconstructed the shape
of transparent surfaces based on the distortion of the light field’s background light paths [Wet-
zstein et al., 2011]. Their method relied on a light-field probe that consisted of a lenslet array in
front of a monitor to encode two dimensions in position and two dimensions in direction. Thus
a monocular camera could measure a 4D LF in a single 2D image. The thin refractive object
was placed between the probe and the monocular camera. Since the start of the light paths
were known by calibration, the difference between incoming and exiting angles θi and θo were
computed assuming known refractive indices of the two media. The surface normals were sub-
sequently determined. However, this approach relied on placing the light-field probe behind the
![Page 114: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/114.jpg)
92 3.4. SUMMARY
object, while photographing its front and only applied to thin objects. Thus the general place-
ment of refractive objects in cluttered scenes and mobile applications would be problematic for
this approach.
Recently, Ideguchi et al. proposed an interesting approach to transparent shape estimation based
on comparing the different disparities between sub-images for a given visual feature in the light
field, which they called light-field convergency [Ideguchi et al., 2017]. It is known that as a
visual feature approaches an occluding edge of a refractive object in a light field, it appears in-
creasingly Lambertian. A deeper analysis of their approach suggests that their method performs
an approximate Lambertian depth estimate similar to focus stacking and then fills in inconsis-
tent or missing depths using traditional hole-filling methods that assume smooth surfaces. This
approximation is only valid near occluding edges of refractive objects; thus, their method was
unable to handle thick and wide shapes, such as spheres.
Overall, the bulk of shape estimation techniques using LF cameras has been focused on Lam-
bertian cases, leaving the topic of refractive objects little explored. Those works that have
addressed refractive objects have been limited in terms of the types of objects they apply to, or
require bulky equipment that is not practical for mobile robots.
3.4 Summary
In summary, we have reviewed the topics have been explored in the realm of features and visual
servoing in the context of LF cameras and refractive objects. Our motivation is to enable visual
control around refractive objects using LF cameras.
Most image features in robotic vision have been limited to 2D and 3D and rely heavily on the
Lambertian assumption. Recent 4D LF-specific features have been proposed, but still predomi-
nantly only consider Lambertian or occluded scenes. LF features in relation to refractive objects
are still not yet well-explored.
![Page 115: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/115.jpg)
CHAPTER 3. LITERATURE REVIEW 93
For visual servoing, PBVS methods appear to be impractical because they require a model of the
refractive object. Various IBVS methods have been developed, but the focus has been largely
on Lambertian scenes. To the best of our knowledge, IBVS in the context of refractive objects
or LF cameras remains unexplored.
Finally, model-based solutions for refractive object detection have been explored; however, 3D
geometric models of refractive objects are time-consuming and difficult to obtain accurately or
simply not available. Thus there is interest in approaches that do not require models. Image-
based detection methods are so far limited in their application, unreliable for changes in viewing
pose, or incomplete in describing a refracted feature’s behaviour in the light field. Additionally,
most solutions require bulky equipment that is impractical for mobile robotic platforms, while
others rely on assumptions that significantly narrow their application window. Clearly there is
a gap for methods that are compact and apply to a wide variety of object shapes.
![Page 116: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/116.jpg)
94 3.4. SUMMARY
![Page 117: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/117.jpg)
Chapter 4
Light-Field Image-Based Visual
Servoing
In the background section, we introduced LF cameras and saw that they were good for capturing
scene texture, depth and view-dependent lighting effects, such as occlusion, specular reflection
and refraction. In the following chapters, we will elaborate on how we will use them to reliably
perceive refractive objects and servo towards them for grasping and manipulation. However,
the first practical issue that must be addressed is how to actually perform visual servoing (VS)
with an LF camera in Lambertian scenes. This chapter focuses on how to directly control
robot motion using observations from an LF camera via image-based visual servoing (IBVS)
for Lambertian scenes. This work was published in [Tsai et al., 2017].
4.1 Light-Field Cameras for Visual Servoing
VS is a robot control technique that makes direct use of visual information by placing the camera
in the control loop. VS is widely applicable and generally robust to errors in camera calibration,
robot calibration and image measurement [Hutchinson et al., 1996, Chaumette, 1998, Cai et al.,
95
![Page 118: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/118.jpg)
96 4.1. LIGHT-FIELD CAMERAS FOR VISUAL SERVOING
2013]. Most VS techniques fall into one of two categories. Position-based visual servoing
(PBVS) uses observed image features and a geometric object model to estimate the camera-
object relative pose and adjust the camera pose accordingly; however, geometric object models
are not always available. In contrast, Image-based visual servoing (IBVS) uses observed image
features and a reference image, from which a set of reference image features are extracted, to
directly estimate the required rate of change of camera pose, which does not necessarily require
a geometric model.
However, most IBVS algorithms are focused on conventional monocular cameras that inher-
ently suffer from lack of depth information, provide limited observations of small or distant
targets relative to the camera’s FOV, and struggle with occlusions, specular highlights and re-
fractive objects. LF cameras offer a potential solution to these problems. As a first step in
exploring LF for IBVS, this chapter considers the multiple views and depth information im-
plicit in the LF structure. To the best of our knowledge, light-field image-based visual servoing
(LF-IBVS) has not yet been proposed.
The main contribution of this chapter are as follows:
• We provide the first derivation, implementation and experimental validation of LF-IBVS.
• We derive image Jacobians for the LF.
• We define an appropriate compact representation for LF features that is close to the form
measured directly by LF cameras.
• In addition, we take a step towards truly 4D plenoptic feature extraction by enforcing LF
geometry in feature detection and correspondence.
We assume a Lambertian scene and sufficient scene texture for classical 2D image features,
such as SIFT and SURF. We validate our proposed method for LF-IBVS using both a simu-
lated camera array and a custom LF camera adapter, shown in Fig. 4.1, which we refer to as
![Page 119: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/119.jpg)
CHAPTER 4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING 97
(a) (b) (c)
Figure 4.1: (a) MirrorCam mounted on the Kinova MICO robot manipulator. Nine mirrors
of different shape and orientation reflect the scene into the upwards-facing camera to create 9
virtual cameras, which provides video frame-rate LFs. (b) A whole image captured by the Mir-
rorCam and (c) the same decoded into a light-field parameterisation of 9 sub-images, visualized
as a 2D tiling of 2D images. The non-rectangular sub-images allow for greater FOV overlap.
MirrorCam, mounted on a robot manipulator. We describe MirrorCam in detail in AppendixA.
Finally, we show that LF-IBVS outperforms conventional monocular and stereo IBVS for ob-
jects occupying the same FOV and in the presence of occlusions.
The remainder of this chapter is organized as follows. Section 4.2 discusses the related work,
formulates the VS problem and explains the LF parameterisation. Section 4.4 explains the
derivations for LF image Jacobians, features, correspondence and the control system. Sec-
tion 4.5 describes our experimental setup with the MirrorCam. Section 4.6 shows our results,
and provides a comparison to conventional monocular and stereo IBVS. Lastly, in Section 4.7,
we conclude the chapter and explore future work for LF-IBVS.
4.2 Related Work
LF cameras offer extra capabilities for robotic vision. Table 4.1 compares conventional and LF
camera systems for different capabilities and tolerances related to VS, given similar configu-
rations, such as sensor size and number of pixels. Notably, stereo provides depth for a single
baseline along a single direction (typically horizontally), but multi-camera and LF systems pro-
![Page 120: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/120.jpg)
98 4.2. RELATED WORK
vide more detailed depth information. They can have both small and long baselines, and have
baselines in multiple directions (typically vertically and horizontally). LF cameras have an ad-
vantage over conventional multi-camera systems for tolerating occlusions and specular reflec-
tions (or more generally non-Lambertian surfaces). This is largely due to the regular sampling,
and because only LF cameras capture the refraction, transparency and specular reflections na-
tively. As such, LF cameras can benefit from methods that exploit these capabilities [Dansereau,
2014].
Table 4.1: Comparison of camera systems’ capabilities and tolerances for VS
Systems Perspectives Field Baseline Baseline Aperture Occlusion Specular
of View Direction Problem Tolerance Tolerance
Conventional Cameras
Mono 1 wide zero none significant no no
Stereo 2 wide wide single moderate weak no
Trinocular 3 wide wide two moderate moderate no
Multiple cameras n wide wide multiple minor moderate no
Light-Field Cameras
Array n2 wide wide multiple minor strong yes
MLA a n2 wide narrow multiple minor moderate yes
MirrorCam b n2 narrow wide multiple minor strong yes
a Based on n2 pixels per lensletb Based on n2 mirrors
Johannsen et al. recently applied LFs in structure from motion [Johannsen et al., 2015]. They
derived a linear relationship using the LF to solve the correspondence problem and compute a
3D point cloud. They achieved an increase in accuracy and robustness, although their 3D-3D
approach did not take full advantage of the 4D LF. Dong et al. focused on Simultaneous Lo-
calization and Mapping (SLAM), and demonstrated that an optimally-designed low-resolution
LF camera allowed them to develop a SLAM implementation that is more computationally
efficient, and more accurate than SLAM for a single high-resolution camera [Dong et al.,
2013]. Dansereau et al. derived “plenoptic flow” for closed-form, computationally efficient
visual odometry with a fixed operation time regardless of scene complexity [Dansereau et al.,
2011]. Zellet et al. extended Dansereau’s plenoptic flow to narrow FOV visual odometry and
![Page 121: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/121.jpg)
CHAPTER 4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING 99
showed how LF cameras can enable SLAM for narrow FOV systems, where monocular SLAM
normally fails [Zeller et al., 2015]. That work also showed that using LF cameras with their
visual odometry method improved the depth estimation error by an order of magnitude. Re-
cently, Walter et al. used LF cameras to analyse specular reflection and detect features specific
to specular reflections, which enabled robots to interact with glossy objects, and outperform
their stereo counterparts [Walter et al., 2015]. These motivate the application of LFs for robotics
and LF-IBVS.
4.3 Lambertian Light-Field Feature
Recall from 2.7, the rays emitting from a point in space, cP = [Px, Py, Pz]T follow a pair of
linear relationships [Bolles et al., 1987, Dansereau and Bruton, 2007], as shown in Fig. 2.21
and 2.22,
u
v
=
(
DPz
)
Px − s
Py − t
, (4.1)
where each equation describes a hyperplane in 4D, F(s, t, u, v) ∈ R3, and their intersection
describes a plane L(s, t, u, v) ∈ R2.
We define our LF feature with respect to the central view of the LF as W = [u0, v0, w]T, where
u, v is the direction of the ray entering the central view of the LF, i.e.
u0
v0
=
u
v
s,t=0
=
(
DPz
)
Px
Py
. (4.2)
As discussed in Section 2.7.4 , the slope w relates the image plane coordinates for all rays
emitting from a point in the scene. Fig. 2.21 shows the geometry of the LF for a single view
of cP . As the viewpoint changes, that is, s and t change, the image plane coordinates vary
lineararly according to (4.1), as in Fig. 2.22. The slope of this line w, comes directly from (4.1),
![Page 122: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/122.jpg)
100 4.4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING
and is given by
w = −D
Pz
, (4.3)
noting that this slope is identical in the s, u and t, v planes. We exploit this aspect of the LF in
the feature matching and correspondence process, described in Section 4.5.1. By working with
slope, akin to disparity from stereo algorithms, we deal more closely with the structure of the
LF.
Our LF feature is similar to the Augmented Image Space of [Jang et al., 1991] for perspective
images where the image plane coordinates are augmented with Cartesian depth. Also similar are
the plenoptic disk features developed for the calibration of lenslet-based LF cameras in [O’Brien
et al., 2018]. In plenoptic disk features, image feature coordinates are augmented with the radius
of the plenoptic disk, which is related (by similar triangles) to a Lambertian point’s depth.
4.4 Light-Field Image-Based Visual Servoing
In this section, we derive the image Jacobians for our LF feature, which are used for image-
based visual servoing. Image Jacobians relate image feature velocity (in image space) to camera
velocity in translation and rotation. We first consider the continuous-domain, where s, t, u, v are
distances. Then we consider the discrete-domain, where i, j and k, l are discrete versions of s, t
and u, v, and typically correspond to different views and pixels, respectively.
4.4.1 Continuous-domain Image Jacobian
Following the derivation for conventional IBVS, we wish to relate the camera’s velocity to the
resulting change in an observed feature W through a continuous-domain image Jacobian JC
W = JCν, (4.4)
![Page 123: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/123.jpg)
CHAPTER 4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING 101
where ν = [v;ω] ∈ R6 is the camera spatial velocity in the camera reference frame. ν is the
concatenation of the camera’s translational velocity v = [vx, vy, vz]T and rotational velocity
ω = [ωx, ωy, ωz]T in the camera reference frame.
Differentiation of (4.2) and (4.3) yields
u0 = D(PxPz − PxPz)/P2z , (4.5)
v0 = D(PyPz − PyPz)/P2z , (4.6)
w = DPz/P2z , (4.7)
where u0, v0 and w are the feature positions and velocities with respect to the central camera
frame.
We can write the apparent motion of a 3D point as
cP = −(ω × cP )− v, (4.8)
yielding three components cP expressed in terms of cP and ν. Substituting these expressions
into (4.5)–(4.7) allows us to factor out the continuous-domain Jacobian
JC =
w 0 −wu0
Du0v0D
−D −u2
0
Dv0
0 w −wv0D
D +v20
D−u0v0
D−u0
0 0 −w2
Dwv0D
−wu0
D0
. (4.9)
While conventional image Jacobians require an estimate of depth, we note that JC instead has
slope w—an inverse measure of depth, which we can observe directly in the LF. The slope w
is explicit in all columns of (4.9) except the last one, because the LF camera array spans both
the x- and y-axes, and can therefore only observe motion parallax with respect to the camera’s
x- and y-axes. The optical flow for the final column is due to rotation about the optical axis,
and is therefore invariant to depth. In contrast, depth is not explicit in the monocular image
![Page 124: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/124.jpg)
102 4.4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING
Jacobian for rotations about the x- and y-axes. Trinocular and multi-camera system image
Jacobians would have similar depth dependencies to JC . Multiple views make parallax, and
thus depth, observable in rotations about the x- and y-axes for the LF camera array. We note
that the derivation for JC is for the central view of the LF camera array. Jacobians derived for
the off-axis in-plane views would contain elements of slope in the last column. Additionally, JC
has a rank of three, which implies that the stacked image Jacobian (as in (3.5)) will be full rank
with a minimum of two points for LF-IBVS, in contrast to a minimum of three image points for
M-IBVS.
4.4.2 Discrete-domain Image Jacobian
In the discrete domain, we define to i, j and k, l as the discrete versions of s, t and u, v, as units
of “views” and pixels, respectively. We observe our discrete-domain feature M as the discrete
position and slope M = [k0, l0,mx,my]T, where [k0, l0] are observations taken from the central
view in i, j, and separate slopes mx in the i, k dimensions and my in j, l. The general plenoptic
camera is described by an intrinsic matrix H relating a homogeneous ray φ = [s, t, u, v, 1]T to
the corresponding sample in the LF n = [i, j, k, l, 1]T as in
φ = Hn, (4.10)
where in general H is of the form
H =
h11 0 h13 0 h15
0 h22 0 h24 h25
h31 0 h33 0 h35
0 h42 0 h44 h45
0 0 0 0 1
, (4.11)
and the matrix H is found through plenoptic camera calibration [Dansereau et al., 2013].
![Page 125: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/125.jpg)
CHAPTER 4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING 103
However, we limit our development to the case of a rectified camera array, for which only
diagonal entries and the final column are nonzero [Dansereau, 2014]. In this case h11 and h22
are the horizontal and vertical camera array spacing, in meters, and h33 and h44 are given by
D/fx and D/fy, i.e. the inverse of the horizontal and vertical focal lengths of the cameras,
expressed in pixels, scaled by the reference plane separation. The final column encodes the
centre of the LF, e.g. for Nk samples in k, h15 = -h11(Nk/2 + 1/2) and k = Nk/2 + 1/2 is the
centre sample in k. We also note that mx and my encode the same information following the
relationship
mx =h11h44
h22h33
my. (4.12)
We wish to express the image Jacobian of (4.4) in the discrete domain,
M = [ ˙k0,˙l0, mx]
T = JDν, (4.13)
where the observation is expressed relative to the LF centre, k0 = k0+h35/h33, l0 = l0+h45/h44.
From (4.10), we can relate the discrete and continuous-domain observations as
u0 = h33k0, v0 = h44l0, w =h33
h11
mx =h44
h22
my, (4.14)
from which it is trivial to express the derivatives of the discrete observation in terms of the
continuous variables:
˙k0 = h-133u0,
˙l0 = h-144v0, mx =
h11
h33
w, my =h22
h44
w. (4.15)
![Page 126: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/126.jpg)
104 4.5. IMPLEMENTATION & EXPERIMENTAL SETUP
Substituting the continuous-domain derivatives in (4.4), and (4.9) and discrete/continuous rela-
tionships in (4.14) into (4.15) allows us to factor out the discrete-domain Jacobian
JD =
mx
h11
0 -h33
h11
k0mx
Dh44
k0 l0D
-h33k20
D− D
h33
h44
h33
l0
0 my
h22
-h44
h22
l0my
Dh44
l20
D+ D
h44
-h33k0 l0D
-h33
h44
k0
0 0 -h33
h11
m2x
Dh44
l0mx
D-h33
k0mx
D0
. (4.16)
4.5 Implementation & Experimental Setup
In this section, we discuss the implementation details of our LF-IBVS approach, including
how we exploit the LF structure for feature matching and correspondence. We then validate our
proposed derivation of LF-IBVS using closed loop control and the experimental setup described
below.
4.5.1 Light-Field Features
To our knowledge all prior work on LF features operates by applying 2D feature detectors to
2D slices in the u, v dimensions [Johannsen et al., 2015]. In this paper, we do the same. Our
implementation employs Speeded-Up Robust Features (SURF) [Bay et al., 2008], though the
proposed method is agnostic to feature type. However, as a first step towards truly 4D features,
we augment the 2D feature location with the local light-field slope, implicitly encoding depth.
Operating on 2D slices of the LF, feature matches are found between the central view and
all other sub-images. Each pair of matched 2D features is treated as a potential 4D feature. A
single feature pair yields a slope estimate, which defines an expected feature location in all other
sub-images. We introduce a tunable constant that determines the maximum distance between
observed and expected feature locations, in pixels, and reject all matches exceeding this limit.
We also reject features that break the point-plane correspondence discussed in Section 4.3. By
![Page 127: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/127.jpg)
CHAPTER 4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING 105
selecting only features that adhere to the planar relationship (4.1), we can remove spurious and
inconsistent detections.
A second constant NMIN imposes the minimum number of sub-images in which feature matches
must be found. In the absence of occlusions, this can be set to require feature matches in all
sub-images. Any feature that is below the maximum distance criterion in at least NMIN images is
accepted as a 4D feature, and a mean slope estimate is formed based on all passing sub-images.
NMIN was set to 4 out of 8 sub-image matches for our experiments.
Feature matching between two LFs again starts with conventional 2D methods. A conventional
2D feature match finds putative correspondences between the central sub-images of the two
LFs. Outlier rejection is performed using the M-estimator Sample Consensus algorithm [Torr
and Zisserman, 2000].
4.5.2 Mirror-Based Light-Field Camera Adapter
There is a scarcity of commercially available LF cameras appropriate for robotics applications.
Notably, no commercial camera delivers 4D LFs at video frame rates. Therefore, we constructed
our own LF video camera, the MirrorCam, by employing a mirror-based adapter based on pre-
vious work [Fuchs et al., 2013,Song et al., 2015]. The MirrorCam is depicted in Fig. 4.1a. The
MirrorCam design, optimisation, construction, calibration, and image decoding processes are
described in the Appendix A [Tsai et al., 2016]. This approach splits the camera’s field of view
into sub-images using an array of planar mirrors, as shown in Fig. 4.1c. By appropriately posi-
tioning the mirrors, a grid of virtual views with overlapping fields of view can be constructed,
effectively capturing an LF. We 3D-printed the mount based on our optimization, and populated
it with laser-cut acrylic mirrors. Note that the LF-IBVS method described in this chapter does
not rely on this particular LF camera design, and applies to 4D LFs in general.
![Page 128: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/128.jpg)
106 4.5. IMPLEMENTATION & EXPERIMENTAL SETUP
4.5.3 Control Loop
The proposed LF-IBVS control loop is depicted in Fig. 4.2. Notably, this control loop is similar
to that of standard VS. Goal light-field features f ∗ ∈ R3 are compared to observed light-field
features f ∈ R3 to produce a light-field feature error. The camera spatial velocity ν can then be
calculated as in (3.6) by multiplying the light-field feature error with the pseudo-inverse of the
stacked image Jacobians and then multiplying it by a gain λ.
Velocity control is formulated in (3.6). We assume infinitesimal motion to convert ν into a
homogeneous transform cT that we use to update the camera’s pose. A motion controller moves
the robot arm. After finishing the motion, a new light field is taken and the feedback loop repeats
until the light-field feature error converges to zero.
An important consideration in LF-IBVS is the feature representation, because the choice of
feature representation in IBVS influences the Cartesian motion of the camera [Mahony et al.,
2002]. We have the option of computing the 3D positions of the points obtained from the
LF; however, this would be very similar to PBVS. Instead, we chose to work more closely
to the native LF representation, working with projected feature position, augmented by slope.
Doing so avoids unnecessary computation, and is more numerically stable as depth computation
involves inverting slope.
InverseJacobian
Grab Image
DecodeExtract
λMotion
controller
f*
f
+
-
ν CT
I
Figure 4.2: The control loop for the VS system. Goal features f ∗ are given. Then f ∗ and f are
compared, the J+ is computed, and camera velocity ν is determined with gain λ and converted
into a motion cT . A motion controller moves the robot arm. After finishing the motion, a new
image is taken and the feedback loop repeats until image features match.
![Page 129: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/129.jpg)
CHAPTER 4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING 107
We define the terminal condition for LF-IBVS as a threshold on the root mean square (RMS)
error between all of the observed LF features and the goal LF features. We combine all of M ,
and note that (u0, v0) are in meters, and (k0, l0) are in pixels, but the slope w is unit-less. This
issue can be addressed by weighting the components; however, for the discrete case, in practice
we found that mx and my had similar relative magnitudes. The relative magnitudes of the light-
field feature elements are important because they define the error term, which in turn drives
the system to minimise light-field feature error. Extremely large magnitudes for slope could
potentially place more emphasis on more z-axis or depth-related camera motion, compared to
x- or y-axis camera motion. Additionally, we typically use a small λ of 0.1 in order to generate
a smooth trajectory towards the goal view.
For the robotic manipulator, we found that the manufacturer’s built-in inverse kinematics soft-
ware became unresponsive for small pose adjustments1. Therefore we implemented a resolved-
rate motion control method using a manipulator Jacobian to command camera spatial velocities
to desired joint velocities [Corke, 2013]. We also changed the proportional, integral and deriva-
tive controller gains for all joints to KP = 2.0, KI = 4.8, and KD = 0.0, respectively. With
these implementations, we achieved sufficient positional accuracy and resolution to demonstrate
LF-IBVS.
4.6 Experimental Results
In this section, we present our experimental results in from camera array simulation and arm-
mounted experiments using a custom mirror-based light-field camera. First, we show LF camera
and light-field feature trajectories over the sequence of a typical visual servoing manoeuvre in
simulation. Second, we compare LF-IBVS to monocular and stereo IBVS in a typical unoc-
cluded scene. Finally, we compare the three same VS systems in an occluded scene.
1Limits were determined experimentally and confirmed by the manufacturer.
![Page 130: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/130.jpg)
108 4.6. EXPERIMENTAL RESULTS
4.6.1 Camera Array Simulation
In order to verify our LF-IBVS algorithm, we first simulated a 3 × 3 array of pinhole camera
models from the Machine Vision Toolbox [Corke, 2013]. Four planar world points in 3D were
projected into the image planes of the 9 cameras. A typical example of LF-IBVS is shown in
Fig. 4.3. For this example, a small gain λ = 0.1 was used to enforce small steps and produce
smooth plots as shown in Fig. 4.3a. The Cartesian positions and orientations relative to the goal
pose converge smoothly to zero, as shown in Fig. 4.3b. Similarly, the camera velocity profiles in
Fig. 4.3c converge to zero. Fig. 4.3d shows the image Jacobian condition number first increases,
and then decreases to a constant lower value, indicating that the Jacobian becomes worse and
then better conditioned, as the features move closer and then further apart, respectively. To-
gether, these figures show the system converges, indicating that LF-IBVS was successful in
simulation. Similar to conventional IBVS, a large λ results in a faster convergence, but a less
smooth trajectory.
Fig. 4.4a shows the view of the central camera, and the image feature paths as the camera array
servos to the goal view. We see that the image feature paths are almost straight due to the
linearisation of the Jacobian. Fig. 4.4b shows the trajectories of the top-left corner of the target
relative to the goal features, which also converge to zero. We note the slope profile matches the
inverse of the z-position profile in the top red line of Fig. 4.3b, as it encodes depth.
For large initial angular displacements, we note that like regular IBVS, this formulation of
LF-IBVS exhibited camera retreat issues. Instead of taking the straight-forward screw motion
towards the goal, the camera retreats backwards, before moving forwards to reach the goal view.
In these situations, the Jacobian linearisation is no longer valid, since the image feature paths
are optimally curved, rather than linear. This poses a performance issue because in real systems,
such backwards manoeuvres may not be feasible; however, retreat can be addressed [Corke and
Hutchinson, 2001] by decoupling the translation components from the z−axis rotation into two
separate image Jacobians, and will be considered in future work.
![Page 131: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/131.jpg)
CHAPTER 4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING 109
0 10 20 30 40 50 60
Time Steps
0
50
100
150
200
250
300
350
Err
or
[pix
]
(a)
0 10 20 30 40 50 60-0.1
0
0.1
0.2
Positio
n [m
] X Y Z
0 10 20 30 40 50 60
Time Steps
-5
0
5
10O
rienta
tion [deg]
θX
θY
θZ
(b)
0 10 20 30 40 50 60
Time Steps
-0.03
-0.02
-0.01
0
0.01
0.02
Cart
esia
n v
elo
city [m
/s, ra
d/s
]
vx
vy
vz
ωx
ωy
ωz
(c)
0 10 20 30 40 50 60
Time Steps
48
49
50
51
52
53
54
55
Jacobia
n c
onditio
n n
um
ber
(d)
Figure 4.3: Simulation of LF-IBVS, with (a) error (RMS of f − f ∗) decreasing over time,
(b) camera motion profiles relative to the goal pose, (c) Cartesian velocities, and (d) image
Jacobian’s condition number for λ = 0.1. Ideally, the condition number is low (or decreases
over time), which means the system is well-conditioned and therefore less sensitive to changes
or errors in the input. Error, relative pose and velocities all converge to zero.
![Page 132: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/132.jpg)
110 4.6. EXPERIMENTAL RESULTS
0 500 1000 1500 2000
u [pix]
0
500
1000
1500
v [pix
]
(a)
0 10 20 30 40 50 60-500
0
500
Positio
n E
rror
[pix
]
k0
l0
0 10 20 30 40 50 60
Time Steps
0
50
100
Slo
pe E
rror
[pix
/pix
]
mx
(b)
Figure 4.4: Simulation of view (a) of the initial target points (blue), servoing along the image
plane feature paths (green) to the target goal (red), and (b) the feature trajectory profile of
M −M ∗, corresponding to the top left corner of the target, which converges to zero.
4.6.2 Arm-Mounted MirrorCam Experiments
We also validated LF-IBVS using the MirrorCam mounted to the end of a Kinova MICO arm
robot, shown in Fig. 4.1a. The robot arm and camera were controlled using the architecture
outlined in Fig. 4.2. We then performed two experiments for M-IBVS, stereo image-based
visual servoing (S-IBVS) and LF-IBVS. The first involved a typical approach manoeuvre in a
Lambertian scene to evaluate the nominal performance of our LF-IBVS system. The second
involved adding occlusions after the goal image/light field was captured in order to explore the
effect of occlusions on LF-IBVS.
4.6.2.1 Lambertian Scene Experiment
We first tested the MirrorCam on a scene similar to Fig. 4.1b, with complex motion involving
all 6 DOF from the initial pose in a Lambertian scene. In a typical VS sequence, we move the
robot to a goal pose, record the camera pose and goal features, then move the robot to an initial
pose and use the features to servo back to the goal.
![Page 133: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/133.jpg)
CHAPTER 4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING 111
Fig. 4.5 shows the performance of our LF-IBVS algorithm for the scene with λ = 0.15. Fig. 4.5a
shows the error decreasing over time as the camera approaches the goal view, and converges
after 20 time steps. We attribute the non-zero error to the arm’s limited performance, which we
address at the end of this section. Fig. 4.5b shows the relative pose of the camera to the goal in
the camera frame converging smoothly to zero. Note that the goal pose is never the objective of
LF-IBVS; rather, the image features captured at the goal pose drive LF-IBVS. Fig. 4.5c shows
the commanded camera velocities also converge to zero. Fig. 4.5d shows the condition number
for the image Jacobian, which decreases slightly as the system converges. We also note that
despite only an approximate camera-to-end-effector calibration, the system converged, which
suggests the robustness of the system against modelling errors.
We compared LF-IBVS against conventional M-IBVS and S-IBVS. Using the sub-images from
the MirrorCam in Fig. 4.1c, we used the view through the central mirror for M-IBVS, and the
two horizontally-adjacent views to the centre from the MirrorCam for S-IBVS. This maintained
the same FOV and pixel resolution. Implementations were based on [Corke, 2013, Chaumette
and Hutchinson, 2006]. The average scene depth was provided for M-IBVS and S-IBVS to
compute the Jacobian, although we note depth, or disparity can be measured directly from
stereo. All three IBVS methods were tested ten times on the same goal scene and initial pose.
A typical case for S-IBVS is shown in Fig. 4.6. The image feature error is not uniformly
decreasing at the start, but eventually converges after 25 time steps. The camera moves in an
erratic motion at the start in the x- and y-axes, but still manages to converge to the goal pose,
as seen in the relative pose trajectories and camera velocities in Fig. 4.6b and 4.6c. This is
probably not because λ was too high for S-IBVS; smaller gains were tested for S-IBVS, but
yielded the same poor performance.
Instead, we observe that the S-IBVS Jacobian condition number in Fig. 4.6d was an order of
magnitude higher than LF-IBVS, producing an almost rank-deficient Jacobian; such a Jacobian
becomes an inaccurate approximation of the spatial velocities, and yields erratic motion. We
![Page 134: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/134.jpg)
112 4.6. EXPERIMENTAL RESULTS
0 10 20 30 40
Time Steps
0
5
10
15
20
25
Err
or
[pix
]
(a)
0 10 20 30 40
0
0.05
0.1
Positio
n [m
] X Y Z
0 10 20 30 40
Time Steps
-4
-2
0
2
4
6O
rienta
tion [deg]
θX
θY
θZ
(b)
0 10 20 30 40
Time Steps
-0.01
-0.005
0
0.005
0.01
0.015
0.02
0.025
Cart
esia
n v
elo
city [m
/s, ra
d/s
]
vx
vy
vz
ωx
ωy
ωz
(c)
0 10 20 30 40
Time Steps
50
100
150
200
250
300
Jacobia
n c
onditio
n n
um
ber
(d)
Figure 4.5: Experimental results of LF-IBVS with MirrorCam on the robot arm, illustrating
(a) the error (RMS of M − M ∗) that converges after 20 time steps, (b) the camera motion
profiles relative to the goal, which converge to zero, (c) the camera velocity profiles, which
converge to zero, and (d) the image Jacobian condition number. Referring also to Fig. 4.6,
we note that LF-IBVS outperforms S-IBVS; the motion profiles are much smoother, and the
velocities and condition numbers are an order of magnitude smaller than those from S-IBVS.
![Page 135: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/135.jpg)
CHAPTER 4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING 113
0 10 20 30 40
Time Steps
0
10
20
30
40
50
Err
or
[pix
]
(a)
0 10 20 30 40
-0.05
0
0.05
0.1
Positio
n [m
]
X Y Z
0 10 20 30 40
Time Steps
-10
-5
0
5
Orienta
tion [deg]
θX
θY
θZ
(b)
0 10 20 30 40
Time Steps
-0.01
-0.005
0
0.005
0.01
0.015
0.02
0.025
Cart
esia
n v
elo
city [m
/s, ra
d/s
]
vx
vy
vz
ωx
ωy
ωz
(c)
0 10 20 30 40
Time Steps
500
1000
1500
2000
2500
Jacobia
n c
onditio
n n
um
ber
(d)
Figure 4.6: Experimental results of S-IBVS with narrow FOV sub-images from the MirrorCam,
on the robot arm, illustrating the performance in (a) the error (RMS of p − p∗) that eventually
converges after 25 time steps; however, the scale is almost double compared to Fig. 4.5.a, (b)
the camera motion profiles relative to the goal that show an erratic trajectory at the start, (c)
the camera velocity profiles that also vary greatly, and (d) the extremely large image Jacobian
condition number, indicating a potentially unstable system (it can exhibit very large changes in
camera velocity output for very small changes in image feature error).
![Page 136: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/136.jpg)
114 4.6. EXPERIMENTAL RESULTS
attribute this poor performance to the narrow FOV of the MirrorCam, which is approximately 20
degrees horizontally. The lack of perspective change, which is required to differentiate rotation
from translation, particularly about the x- and y-axes, can therefore be attributed to the poor
S-IBVS performance.
During the experiments, M-IBVS exhibited much worse performance than S-IBVS, to the extent
that such erratic motion caused the robot to completely lose view of the goal scene within two
or three time steps. Therefore, M-IBVS velocity profiles are not shown in the results.
Equivalently, the projected scale of the object being servoed against affects the performance
of IBVS; smaller or more distant objects yield poorly-conditioned image Jacobians. These ob-
servations are not new or surprising [Dong et al., 2013]. LF-IBVS outperformed both of our
constrained implementations of M-IBVS and S-IBVS, as LF-IBVS converged with a smooth
trajectory regardless of the narrow FOV constraints of the MirrorCam. These improvements
were likely due to a much lower Jacobian condition number in LF-IBVS, which we attributed
to the LF camera providing the perspective change required to differentiate rotation from trans-
lation, unlike the stereo and monocular systems. Therefore, the narrow FOV constraints of the
MirrorCam can generalize to other camera systems as small or distant targets relative to the
camera, where increasing the FOV would not help the system converge to the target.
4.6.2.2 Occluded Scene Experiment
Experiments with occlusions were also conducted using a series of black wires to partially
occlude the scene. The setup is illustrated in Fig. 4.7 and 4.8. The goal, or reference image,
was captured without the occlusions at a specified goal pose. An example image is shown in
Fig. 4.8a. Next, the robot was moved to an initial pose, where the occlusions did not obscure the
scene. Then the robot was allowed to servo towards the goal, along a path where the occlusions
gradually obscured the goal view. The final goal image was partially occluded, as shown in
Fig. 4.8b. M-IBVS, S-IBVS and LF-IBVS were run using the same setup. With the partially
![Page 137: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/137.jpg)
CHAPTER 4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING 115
occluded views, M-IBVS and S-IBVS failed; whereas the LF-IBVS method servoed to the
original goal pose.
Fig. 4.9 compares the number of features matched by LF-IBVS, M-IBVS, and S-IBVS in the
occlusion experiment. Without any occlusions, we note that all three methods have a similar
number of matched features at the goal view, although stereo and mono have slightly more
matches than LF-IBVS throughout the experiment. This is likely because all 3 methods used
similar 2D feature detection methods; however, our LF-IBVS approach also rejected those fea-
tures that were inconsistent with LF geometry. In our experiment with occlusions, M-IBVS
failed at time step 5, when it was unable to match sufficient features. Similarly, the perfor-
mance of S-IBVS in our experiment quickly degraded at time step 10, as the occlusions covered
most of the left view and significant portions of the right view.
On the other hand, in the presence of occlusions, LF-IBVS had fewer matches than the un-
occluded case, but still matched a consistent and sufficient number of features throughout its
trajectory to converge. It was therefore apparent that LF-IBVS could utilize the LF camera’s
multiple views and baseline directions to handle partial occlusions. To further illustrate this,
consider the scene where a 3D point occluded from one of the LF camera’s sub-views, but still
visible in at least one other of the LF camera’s sub-views. A single LF is captured; thus there
is no physical camera motion. Conventional image feature matching would fail for stereo vi-
sion systems in this situation, because the 3D point is occluded from one of two views, and
therefore not viable for image matching. However, our LF-camera-based method would still
able to perform matching using the other unoccluded views, provided a sufficient baseline. By
setting a minimum number of views that an image feature must be visible in (NMIN ), we have
made it harder for image features to be matched (thus there are fewer image feature matches).
Those that are matched are therefore more consistent for motion estimation applications, such
as visual servoing. Thus, our feature extension from 2D to 4D enables our method to better deal
with the presence of occlusions. Trinocular camera systems may also benefit from the occlusion
tolerance that we demonstrated in Fig. 4.9 (albeit far less tolerance due to significantly fewer
![Page 138: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/138.jpg)
116 4.6. EXPERIMENTAL RESULTS
occluded goal view
unoccluded initial view
camera trajectory
MirrorCam
partial occlusions
scene features
field of view
Figure 4.7: Occlusion experimental setup, showing the initial view of the scene (red) with no
occlusions, the camera trajectory that gradually becomes more occluded, and converging to the
goal view with partial occlusions (green).
(a) (b)
Figure 4.8: Occlusion experiments showing (a) the goal view with no occlusions from the
MirrorCam, and (b) the goal view, partially occluded by a box of black wires. The arm was able
to reach the partially-occluded goal view using LF-IBVS, but not M-IBVS or S-IBVS. Images
shown are flipped vertically.
![Page 139: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/139.jpg)
CHAPTER 4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING 117
views—three compared to n × n views, where n is typically three or greater), but would lack
tolerance to specular highlights and other non-Lambertian surfaces as discussed in Table 4.1.
4.7 Conclusions
In this chapter, we have proposed the first derivation, implementation, and validation of light-
field image-based visual servoing. We have derived the image Jacobian for LF-IBVS based on a
LF feature representation that is augmented by the local light-field slope. We have exploited the
LF in our feature detection, correspondence, and matching processes. Using a basic VS control
loop, we have shown in simulation and on a robotic platform that LF-IBVS is viable for con-
trolling robot motion. Further research into alternative feature types and Jacobian decoupling
strategies may address camera retreat and improve the performance of LF-IBVS.
Our implementation takes 5 seconds per frame to operate as unoptimized MATLAB code. The
decoding and correspondence processes are the current bottlenecks. Through optimization,
real-time LF-IBVS should be possible.
Our experimental results demonstrate that LF-IBVS is more tolerant than monocular and stereo
methods to narrow FOV constraints and partially-occluded scenes. Robotic applications op-
erating in narrow, constrained and occluded environments, or those aimed at small or distant
targets would benefit from LF-IBVS, such as household grasping, medical robotics, and in-
orbit satellite servicing. In future work, we will investigate other LF camera systems, how to
further exploit the 4D nature of the light-field features, and explore LF-IBVS in the context of
refractive objects, where the method should benefit significantly from the light field.
![Page 140: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/140.jpg)
118 4.7. CONCLUSIONS
0 10 20 30 40
Time Steps
0
10
20
30
40
50
60
Nu
mb
er
of
Fe
atu
res
LF
Stereo
Mono
LF Occluded
Stereo Occluded
Mono Occluded
Figure 4.9: Experimental results for number of features matched over time with occlusions
(dashed), and without (solid), for LF-IBVS (red), S-IBVS (blue), and M-IBVS (black). Both
stereo and monocular methods fail at time step 5 and 10, respectively, but LF-IBVS maintains
enough feature matches to converge to the goal pose, which demonstrates that LF-IBVS is more
robust to occlusions.
![Page 141: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/141.jpg)
Chapter 5
Distinguishing Refracted Image
Features with Application to Structure
from Motion
Robots for the real world will inevitably have to perceive, grasp and manipulate refractive ob-
jects. However, refractive objects are particularly challenging for robots because these objects
are difficult to perceive—they are often transparent and their appearance is essentially a dis-
torted view of the background, which can change significantly with respect to small changes
in viewpoint. The amount of distortion depends on the scene geometry, as well as the shape
and refractive indices of the objects involved. As the robot approaches the refractive object, the
refracted background can move differently compared to the rest of the non-refracted scene. In-
tuitively, the key to detecting refractive objects is to understand and characterise the background
distortion caused by the refractive object.
This chapter is concerned with discriminating the appearance of features whose image features
have been distorted by a refractive object—refracted image features—from the surrounding
Lambertian features. This is because robots will need to reliably operate in scenes with re-
119
![Page 142: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/142.jpg)
120
fractive objects in a variety of applications. Unfortunately, refractive objects can cause many
robotic vision algorithms, such as structure from motion (SfM), to become unreliable or even
fail. This is because they all assume a Lambertian world, and do not know not to use refracted
image features when estimating structure and motion.
Outlier rejection methods such as RANSAC have been used to remove refracted image features
(outliers compared to the perceived relative motion of the robot) when the number of refracted
image features are small relative to the number of Lambertian image features. However, there
is a trade-off between computation and robustness when dealing with outlier-rich image feature
sets1. More computation is required to deal with increasingly outlier-rich image feature sets.
With limited computation, outlier rejection may return a sub-optimal inlier set, potentially lead-
ing to failure of the robotic vision system. Therefore, starting with a higher-quality set of image
features for applications such as SfM are preferred to reduce computation, power consumption
and probability of failure.
In this chapter, we propose a novel method to distinguish between refracted and Lambertian im-
age features using a light-field camera. For the previous refracted feature detection methods that
are limited to light-field cameras with large baselines relative to the refractive object, our method
achieves state-of-the-art performance. We extend these capabilities to light-field cameras with
much smaller baselines than previously considered, where we achieve up to 50% higher re-
fracted feature detection rates. Specifically, we propose to use textural cross-correlation to
characterise apparent feature motion in a single LF, and compare this motion to its Lambertian
equivalent based on 4D light-field geometry.
1Outliers are by definition samples that significantly differ from other observations; normally they appear with
low probability at the far end of distributions. Thus the term “outlier-rich” may appear contradictory as this would
imply another distribution, where many of the image feature points removed do not follow an assumed distribution.
However, by “outlier-rich”, we imply that there is a much higher concentration of outliers than normal. In our
context, we might obtain an outlier-rich image feature set when the cameras’ views are dominated by a refractive
object, such that a large number of image features are refracted, and only a few image features follow a consistent
(Lambertian) image motion within the light field, or due to the robot’s own motion.
![Page 143: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/143.jpg)
CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 121
We show the effectiveness of our discriminator in the application of structure from motion (SfM)
when reconstructing scenes containing a refractive object, such as Fig. 5.1. Structure from
motion is a technique to recover both scene structure and camera pose from 2D images, and
is widely applicable to many systems in computer and robotic vision [Hartley and Zisserman,
2003, Wei et al., 2013]. Many of these systems assume the scene is Lambertian, in that a
3D point’s appearance in an image does not change significantly with viewpoint. However,
non-Lambertian effects, including specular reflections, occlusions, and refraction, violate this
assumption, which can cause these systems to become unreliable or even fail. We demonstrate
that rejecting refracted features using our discriminator yields lower reprojection error, lower
failure rates, and more accurate pose estimates when the robot is approaching refractive objects.
Our method is a critical step towards allowing robots to operate in the presence of refractive
objects. This work has been published in [Tsai et al., 2019].
Figure 5.1: (Left) A LF camera mounted on a robot arm was used to distinguish refracted
features in a scene in SfM experiments. (Right) SIFT features that were distinguished as Lam-
bertian (blue) and refracted (red), revealing the presence of the refractive cylinder in the middle
of the scene.
In this chapter, our main contributions are the following.
• We extend previous work to develop a light-field feature discriminator for refractive
objects. In particular, we detect the differences between the apparent motion of non-
Lambertian and Lambertian features in the 4D LF to distinguish refractive objects more
reliably than previous work.
![Page 144: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/144.jpg)
122 5.1. RELATED WORK
• We propose a novel approach to describe the apparent motion of a feature observed within
the 4D light-field based on textural cross-correlation.
• We extend refracted feature distinguishing capabilities to lenslet-based LF cameras that
are limited to much smaller baselines by considering non-Lambertian apparent motion in
the LF. All LFs captured for these experiments are available at
https://tinyurl.com/LFRefractive.
• We show that by distinguishing and rejecting refracted features with our discriminator,
SfM performs better in scenes that include refractive objects.
The main limitation of our method is that it requires background visual texture to be distorted by
the refractive object. Our method’s effectiveness depends on the extent to which the appearance
of the object is warped in the LF. This depends on the geometry, shape, and the refractive indices
of the object involved.
The remainder of this chapter is organized as follows. Section 5.1 describes the related work and
Section 5.2 provides background on LF geometry. In Section 5.3, we explain our method for
discriminating refracted features in the LF. We show our experimental results for detection with
different LF cameras, and validation in the context of monocular SfM in Section 5.4. Lastly
in Section 5.5, we conclude the chapter and explore future work for the detection of refracted
features.
5.1 Related Work
A variety of strategies for detecting and reconstructing refractive objects using vision have been
investigated [Ihrke et al., 2010a]. For example, reflectivity has been used to reconstruct refrac-
tive object shape. A single monocular camera with a light source moving to points in a square
![Page 145: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/145.jpg)
CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 123
grid has been used to densely reconstruct complex refractive objects by tracing the specular re-
flections from different, known lighting positions over multiple monocular images [Morris and
Kutulakos, 2007]. Additionally, light refracted by transparent objects tends to be polarised, and
thus a rotating polariser in front of a monocular camera has been used to reconstruct the front
surface of glass objects that faces the camera [Miyazaki and Ikeuchi, 2005], but their method
requires prior knowledge of the object’s refractive index, shape of the back surface, and illu-
mination distribution, which for a robot are not necessarily available. Refractive object shape
has also been obtained by measuring the distortion of the light field’s background light paths,
using a monocular camera image of the refractive object placed in front of a special optical
sheet and lighting system, known as a light-field probe [Wetzstein et al., 2011], but this method
also required knowledge of the initial direction of light rays emitting from a planar background.
Furthermore, many of these methods require known light sources with bulky configurations that
are impractical for robotic applications in everyday environments.
Recent work has been aimed at finding refractive objects within a single monocular image.
SIFT features and a learning-based approach have been used to detect refractive objects [Fritz
et al., 2009]. They trained a linear, binary support-vector machine to classify glasses versus a
Lambertian background. Their approach required many hand-labelled training images from a
variety of refractive objects under different lighting environments and backgrounds, and only
returned a bounding box, providing little to no insight into the nature of the refractive object
itself. Monocular image sequences from moving cameras have been used to recover refractive
object shape and pose [Ben-Ezra and Nayar, 2003]; however, image feature correspondence
was established manually throughout camera motion, emphasizing the difficulty of automati-
cally identifying and tracking refracted image features due to the severe magnification of the
background and image distortion from the object.
LFs have been used to obtain better depth maps for Lambertian and occluded scenes [Johannsen
et al., 2017]; however, their depth estimation performance suffers for refractive objects. Jach-
nick et al. considered using light fields to estimate scene lighting configurations and then remove
![Page 146: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/146.jpg)
124 5.1. RELATED WORK
specular reflections from images of planar surfaces [Jachnik et al., 2012]. Tao et al. recently
applied a similar concept using LF cameras to simultaneously estimate depth and remove spec-
ular reflections from more general 3D surfaces (not limited to planar scenes) [Tao et al., 2016].
Wanner et al. recently considered planar refractive objects and reconstructed two different depth
layers [Wanner and Golduecke, 2013]. For example, their method provided the depth of a thin
sheet of frosted glass and the depth of the background Lambertian scene. In another example,
their method provided the depth of a reflective mirror and the apparent depth of the reflected
scene. However, this work was limited to thin planar surfaces and single reflections. Although
our work does not determine the dense structure of the refractive object, our approach can dis-
tinguish image features from objects that significantly distort the LF.
Refractive object recognition is the problem of finding or identifying a refractive object from vi-
sion. In this area, Maeno et al. proposed a light-field distortion feature (LFD), which models an
object’s refraction pattern as image distortion based on differences in the corresponding image
points between the multiple views encoded within the LF, captured by a large-baseline (relative
to the refractive object) LF camera array [Maeno et al., 2013]. Several LFDs were combined
in a bag-of-words representation for a single refractive object. However, the authors observed
significantly poor recognition performance due to specular reflections, as well as changes in
camera pose. Xu et al. used the LFD as a basis for refractive object image segmentation [Xu
et al., 2015]. Corresponding image features from all views in the LF (s, t, u, v) were fitted to the
normal of a 4D hyperplane using singular value decomposition (SVD). The smallest singular
value was taken as a measure of error to the hyperplane of best fit, for which a threshold was
applied to distinguish refracted image features. However, we will show that a 3D point cannot
be described by a single hyperplane in 4D. Instead, it manifests as a plane in 4D that has two
orthogonal normal vectors. Our approach builds on Xu’s method and solves for both normals to
find the plane of best fit in 4D; thus allowing us to discriminate refractive objects more reliably.
A key difficulty in image feature-based approaches in the LF is obtaining the corresponding
image feature locations between multiple views. It is possible to use traditional multi-view ge-
![Page 147: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/147.jpg)
CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 125
ometry approaches for image feature correspondence, such as epipolar geometry, optical flow
and RANSAC. In fact, both Maeno and Xu used optical flow between two views for correspon-
dence. However, these approaches do not exploit the unique geometry of the LF, which can
lead to algorithmic simplifications or reduced computational complexity [Dansereau, 2014].
We propose a novel textural cross-correlation method to associate image features in the LF by
describing their apparent motion in the LF, which we refer to as image feature curves. This
method directly exploits LF geometry and provides insight on the 4D nature of image features
in the LF.
Our interest in LF cameras stems from robot applications that often have mass, power and
size constraints. Thus, we are interested in employing compact lenslet-based LF cameras to
deal with refractive objects. However, most previous works have utilized gantries [Wanner and
Golduecke, 2013], or large camera arrays [Maeno et al., 2013, Xu et al., 2015]; their results
do not reliably transfer to LF cameras with much smaller baselines, where distortion is less
apparent, as we show later. We demonstrate the performance of our method using two different
LF camera architectures with different baselines. Ours is the first method, to our knowledge,
capable of identifying RFs using lenslet-based LF cameras.
For LF cameras, LF-specific image features, have been investigated. SIFT features augmented
with “slope”, an LF-based property related to depth, were proposed by the author of this thesis
for visual servoing using an LF camera [Tsai et al., 2017]; however, in Chapter 4, refractive ob-
jects were not considered. Ghasemi proposed a scale-invariant global image feature descriptor
based on a modified Hough transform [Ghasemi and Vetterli, 2014]; however, we are interested
in local image features whose positions encode the distortion observed in the refracted back-
ground. More recently, Tosic developed a scale-invariant, single-pixel-edge detector by finding
local extrema in a combined scale, depth, and image space [Tosic and Berkner, 2014]. However,
these LF image features did not differentiate between Lambertian and refracted image features,
nor were they designed for reliable matching between LFs captured from different viewpoints.
![Page 148: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/148.jpg)
126 5.2. LAMBERTIAN POINTS IN THE LIGHT FIELD
Recent work by Teixeira et al. projected SIFT features found in all views into their correspond-
ing epipolar plane images (EPIs). Example EPIs are shown in Fig. 2.15. These projections were
filtered and grouped onto straight lines in their respective EPIs and then counted. Features with
higher counts were observed in more views, and thus considered to be more reliable Lambertian
image features [Teixeira et al., 2017]. However, this approach only looked for SIFT features
that were consistently projected on top of lines in their respective EPIs and intentionally filtered
out any nonlinear image feature behaviour; thus, this approach did not consider any nonlinear
image feature behaviour, while our method aims to detect these non-Lambertian image features,
and is focused on characterising them. Clearly, there is a gap in the literature for identifying
and characterising refracted image features using LF cameras.
In this chapter, we detect unique image features that allow us to reject distorted content and
work well for SfM. This could be useful for many other common feature-based algorithms,
including recognition, segmentation, visual servoing, simultaneous localization and mapping,
visual odometry, and SfM, making these algorithms more robust to the presence of refractive
objects. We are interested in exploring the impact of our refracted image feature discriminator
in a SfM framework. While there has been significant development in SfM in recent year for
conventional monocular and stereo cameras [Wei et al., 2013], Johannsen et al. were the first to
consider LFs in the SfM framework [Johannsen et al., 2015]. Although our work does not yet
explore LF-based SfM, we investigate SfM’s performance with respect to RFs, which has not
yet been fully explored. We show that rejecting RFs reduces reprojection error and failure rate
near refractive objects, improving camera pose estimates.
5.2 Lambertian Points in the Light Field
In this section, we provide a brief reminder on the LF geometry background provided in Sec-
tion 2.7.3; however, we have re-written (2.34) and (2.35) in the context of our refracted im-
age feature discriminator. Using the two plane paramterisation, a ray φ can be described by
![Page 149: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/149.jpg)
CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 127
φ = [s, t, u, v]T ∈ R4. A Lambertian point in space P = [Px, Py, Pz]
T ∈ R3 emits rays in many
directions, which follow a linear relationship
u
v
=
(D
Pz
)
Px − s
Py − t
, (5.1)
where each row describes a hyperplane in 4D. A hyperplane in 4D is a 3D manifold and can be
described by a single equation
n1s+ n2t+ n3u+ n4v + n5 = 0, (5.2)
where n = [n1, n2, n3, n4]T is the normal of the hyperplane. A plane is defined as a 2D manifold
and can be spanned by two linearly-independent vectors. In 4D, a plane can be described by the
intersection of two 4D hyperplanes
n1s+ n2t+ n3u+ n4v + n5 = 0 (5.3)
m1s+m2t+m3u+m4v +m5 = 0, (5.4)
where m is the normal of a second hyperplane in 4D. Equations (5.3) and (5.4) can be written
in matrix form,
n1 n2 n3 n4
m1 m2 m3 m4
s
t
u
v
=
−n5
−m5
. (5.5)
![Page 150: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/150.jpg)
128 5.3. DISTINGUISHING REFRACTED IMAGE FEATURES
Equation (5.1) can be written into a similar form as (5.5) as
DPz
0 1 0
0 DPz
0 1
︸ ︷︷ ︸
N
s
t
u
v
=
DPx
Pz
DPy
Pz
, (5.6)
where N contains the two linearly-independent normals to the plane in 4D. The plane is defined
as the set of all s, t, u, v that follow (5.6). Therefore, a Lambertian point in 3D induces a plane
in 4D, which is characterised by two linearly-independent normal vectors that each define a
hyperplane in 4D. In the literature, this relationship is sometimes referred to as the point-plane
correspondence, as discussed in Section 2.7.3, because a point in 3D corresponds to a plane in
4D.
The light-field slope w relates the rate of change of image plane coordinates, with respect to
viewpoint position, for all rays emitting from a point in the scene. In the literature, slope is
sometimes referred to as “orientation” [Wanner and Golduecke, 2013], and other works com-
pute slope as an angle [Tosic and Berkner, 2014]. We recall that the slope comes directly from
(2.43) as
w = −D/Pz, (5.7)
and is clearly related to depth.
5.3 Distinguishing Refracted Image Features
Epipolar Planar Images (EPIs) graphically illustrate the apparent motion of a feature across
multiple views. If the entire light field is given as L(s, t, u, v) ∈ R4, the central view is an
image I(u, v) = L(s0, t0, u, v), and is equivalent to what a monocular camera would provide
from the same camera viewpoint. EPIs represent a 2D slice of the 4D LF. A horizontal EPI
![Page 151: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/151.jpg)
CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 129
is given as L(s, t∗, u, v∗), and a vertical EPI is denoted as L(s∗, t, u∗, v), where ∗ indicates a
variable is fixed while others may vary.
In practice, we construct the EPI by plotting all u pixels for view s, as illustrated in Fig. 2.15.
Then we plot all u pixels for view s + 1, stacking the row of pixels on top of the previous
plot, and repeating for all s. As each view is horizontally shifted by some baseline, the scene
captured by the u pixels shifts accordingly. As shown in Fig. 5.2a, image features or rays from a
Lambertian point are linearly distributed with respect to viewpoint due to the uniform sampling
of the LF camera.
Points with similar depths yield lines with similar slopes in the EPI. Points with different depths
yield lines with different slopes. Similar behaviour is observed considering the vertical viewing
direction along t and v. Equivalently, linear parallax motion manifests itself as straight lines for
Lambertian image features. Image features for highly-distorting refractive objects are nonlinear,
as illustrated in Fig. 5.2b. We can thus compare this difference in apparent motion between
Lambertian and non-Lambertian features to distinguish RFs.
Fig. 5.3a shows the central view and an example EPI of a crystal ball LF (large baseline) from
the New Stanford Light-Field Archive, captured by a camera array. The physical size of cameras
often necessitates larger baselines for LF capture. A Lambertian point forms a straight line in
the EPI, shown in Fig. 5.3b. The relation between slope and depth is also apparent in this EPI.
Refracted image features appear as nonlinear curves in the EPI, as seen in Fig. 5.3b. Refracted
image feature detection in the LF simplifies to finding image features that violate (5.1) via iden-
tifying nonlinear feature curves in the EPIs and/or inconsistent slopes between two linearly-
independent EPI lines (ie, EPIs sampled from two linearly-independent motions), such as the
vertical (along t) and horizontal (along s) EPIs. We note that occlusions and specular reflections
also violate (5.1), and so can potentially cause many vision algorithms to fail as well. Occlu-
sions appear as straight lines, but have intersections in the EPI, indicated in green. Edges of the
![Page 152: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/152.jpg)
130 5.3. DISTINGUISHING REFRACTED IMAGE FEATURES
(a) (b)
Figure 5.2: A Lambertian point emits rays of light and are captured by the LF camera. (a)
Projection of the linear behaviour of a Lambertian image feature (orange), and (b) the nonlinear
behaviour of a refracted image feature with respect to linear motion along the viewpoints of an
LF (blue).
refractive objects, and objects with low distortion also appear Lambertian. Specular reflections
appear as a superposition of lines in the EPI, which may be addressed in future work.
5.3.1 Extracting Image Feature Curves
In this section, we discuss how we extract these 4D image feature curves and how we identify
refracted image features. For a given image feature from the central view (s0, t0) at coordinates
(u0, v0), we must determine the feature correspondences (u′, v′) from the other views, which
is equivalent to finding the feature’s apparent motion in the LF. In this chapter, we start by
detecting SIFT features [Lowe, 2004] in the central view, although the proposed method is
agnostic to any scale-based image feature type.
Next, we select a template surrounding the feature which is k-times the feature’s scale. We
determined k = 5 to yield the most consistent results. 2D Gaussian-weighted normalized cross-
correlation (WNCC) is used across views to yield correlation images, such as Fig. 5.4. To
reduce computation, we only apply WNCC along the central row and column of LF views.
For Lambertian image features, peaks in the correlation space for each view correspond to the
feature’s image coordinates in that view. We create another EPI by plotting the image feature’s
![Page 153: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/153.jpg)
CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 131
(a) (b)
Figure 5.3: (a) In the crystal ball LF from the New Stanford Light Field Archive, the central
view is shown. A vertical EPI (b) is sampled from a column of pixels (yellow), where nonlinear
apparent motion caused by the crystal ball are seen in the middle (blue). Straight lines corre-
spond to Lambertian features (orange). Occlusions (green) appear as intersections of straight
lines.
correlation response with respect to the views, which we call the correlation EPI. Illustrated in
Fig. 5.4, the ridge of the correlation EPI will have the same shape as the image feature curve
from original EPI.
For refracted image features, we hypothesize that the distortion of the feature’s appearance
between views will not be so strong as to make the correlation response unusable. Thus, the
correlation response will be sufficiently strong that the ridge of the correlation EPI will still
correspond to the desired feature curve. Our textural cross-correlation method allows us to
focus on the image structure, as opposed to the image intensities. Our method can be applied to
any LF camera, and directly exploits the geometry of the LF.
There are many other strategies to find and match image features across a sequence of views,
such as stereo matching and optical flow. Such approaches have previously been used in LF-
related work [Maeno et al., 2013, Xu et al., 2015]. However, both stereo matching and optical
flow typically rely on pair-wise image comparisons and must therefore be iterated across other
views of the LF. It is often more efficient and robust against noise to consider all views simulta-
![Page 154: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/154.jpg)
132 5.3. DISTINGUISHING REFRACTED IMAGE FEATURES
Figure 5.4: The image feature curve extraction process. (Top left) The simulated horizontal
views of a yellow circle and (top right) the corresponding horizontal EPI taken along the middle
of the views from the green pixel row. A feature template is taken and used for textural cross-
correlation (bottom left). The resulting cross-correlation response is computed and shown as
the cross-correlation views for a typical scene. Yellow indicates a high response, while blue
indicates a low response. The resultant correlation EPI (bottom right) , created by stacking the
red pixel row of adjacent views. The ridge (yellow) along this correlation EPI corresponds to
the desired, extracted image feature curve (red). Note that only 3 views are shown, but the
simulated LF actually contains 9 views.
neously when attempting to characterise a trend across an image sequence. Although we only
consider 2D EPIs in the LF in this chapter, we are interested in considering full 4D approaches
for image feature extraction in future work.
5.3.2 Fitting 4D Planarity to Image Feature Curves
For Lambertian image features, the image feature disparities are linear with respect to linear
camera translation, as in (5.7). The disparities from refracted image features deviate from this
linear relation. In this section, we explain that fitting these disparities in the least squares sense
to (5.1) yields the plane of best fit in 4D. The plane in 4D can be estimated from the image
feature curves that we extracted in the previous section. The error of the 4D planar fit provides
a measure of whether or not our image feature is Lambertian.
![Page 155: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/155.jpg)
CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 133
Similar to [Xu et al., 2015], we consider the ray passing through the central view φ0(s0, t0, u0, v0).
The corresponding ray coordinates in the view (s, t) are defined as φ(s, t, u, v). The LFD is then
defined as the set of relative differences between φ0 and φ as in [Maeno et al., 2013]:
LFD(u0, v0) = {(s, t,∆u,∆v) | ∀ (s, t) 6= (s0, t0)}, (5.8)
where ∆u = u(s, t)− u0, and ∆v = v(s, t)− v0 are image feature disparities. We note that the
LFD uses φ from all other views (s, t 6= s0, t0). This differs from our proposed image feature
curves extracted from EPIs that only sample views along two orthogonal viewing directions
from which the EPIs are first created, which represents a minimal sampling of the LF in order
to discriminate against refracted image features.
As discussed in Section 5.2, our plane in 4D has two linearly-independent normals, n and m.
Then considering the LFD, we compare central view ray φ0 to ray φ. Recall that each ray is
represented by a point in 4D. Substituting each coordinate into (5.3), we can write
n1s0 + n2t0 + n3u0 + n4v0 = −n5 (5.9)
n1s+ n2t+ n3u+ n4v = −n5. (5.10)
Subtracting (5.9) from (5.10), yields
n1s+ n2t+ n3∆u+ n4∆v = 0, (5.11)
which is expressed in terms of the LFD. Recall that s0 = 0 and t0 = 0..
![Page 156: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/156.jpg)
134 5.3. DISTINGUISHING REFRACTED IMAGE FEATURES
We can write this in matrix form as
[
n1 n2 n3 n4
]
︸ ︷︷ ︸
N
s
t
∆u
∆v
︸ ︷︷ ︸n
=
[
0
]
. (5.12)
We can estimate n by fitting rays according to
[
(s, t, ∆u, ∆v)
]
︸ ︷︷ ︸
N
n1
n2
n3
n4
︸ ︷︷ ︸n
=
[
0
]
. (5.13)
Note that the constants on the right-hand side of (5.6) cancel out in (5.13) because we consider
the differences relative to u0. We also note that N is a matrix of rank one. We require a
minimum of four additional rays (equivalently, four views of the image feature) relative to φ0
to estimate n to solve the system Nn = 0.
For an LF that can be represented by an M × N camera array, we can use all MN views to
estimate n. However, to reduce the required computation involved in the image feature curve
extraction process, we can use all N views from the horizontal image feature curve, which were
extracted from the horizontal EPI, u = fh(s; tj, vl − v0). This represents the set of all values of
u that follow the horizontal image feature curve as a function of s, given the constant tj , vl and
v0. Similarly, the vertical image feature curve can be expressed as v = fv(t; si, uk − u0), for
constant si, uk and u0.
![Page 157: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/157.jpg)
CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 135
We can substitute the image feature curve fh into N as a set of stacked rays,
s1, tj, ∆u1, vl − v0...
......
...
sN , tj, ∆uN , vl − v0
︸ ︷︷ ︸
N
n1
n2
n3
n4
=
[
0
]
. (5.14)
The matrix N is a singular N×4 matrix, and (5.14) is an overdetermined system. We can solve
this system using SVD to estimate n in the least squares sense.
It is interesting to note that for a Lambertian point, we can reduce each row of N to a function
of the other rows. We know that tj and vl − v0 are constant columns. The column s1 to sN is
a linear combination of itself, as it is simply the increment of the horizontal viewpoints, which
are linearly-spaced in the LF. Using (5.1), we can write
∆u = u− u0 =D
Pz
(Px − s)−D
Pz
(Px − s0) = −D
Pz(s− s0) = −
D
Pz
∆s (5.15)
∆u
∆s= −
D
Pz
. (5.16)
The change in u is linear with respect to changes in s, which matches our expression for LF
slope in (5.7). Therefore, N has a rank of 1 and can only yield a single hyperplane.
However, recall that a Lambertian point can be described by two hyperplanes. (5.11) and con-
sequently (5.12) must hold for both hyperplanes for Lambertian point in 3D. We are interested
in estimating hyperplane normals n and m from the LFD. Therefore, we can write (5.11) in
![Page 158: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/158.jpg)
136 5.3. DISTINGUISHING REFRACTED IMAGE FEATURES
matrix form as the 4D plane containing φ0 and φ,
n1 n2 n3 n4
m1 m2 m3 m4
︸ ︷︷ ︸[
n m
]T
s
t
∆u
∆v
=
0
0
, (5.17)
where T is the transpose. The positions for s, t can be obtained by calibration [Dansereau et al.,
2013], although the nonlinear behaviour still holds when working with uncalibrated units of
“views”. We can then write
s, tj, ∆u, vl − v0
si, t, uk − u0, ∆v
︸ ︷︷ ︸
A
n1
n2
n3
n4
︸ ︷︷ ︸n
=
0
...
0
, (5.18)
where the first row (s, tj, ∆u, vl − v0) represents a single ray from fh, and the second row
(si, t, uk − u0, ∆v) represents a single ray from fv. A is a singular matrix and has a rank of at
least two. We still require a minimum of five rays to solve (5.18) (four plus φ0). As before, we
can stack all MN rays over the entire LF; however, we use a smaller set of M + N rays from
fh and fv. The system of equations can be written,
s1, tj, ∆u1, vk − v0...
......
...
sN , tj, ∆uN , vk − v0
si, t1, uk − u0, ∆v1...
......
...
(sM , tM , uk − u0, ∆vM
︸ ︷︷ ︸
A
n1
n2
n3
n4
︸ ︷︷ ︸n
=
0
...
0
. (5.19)
![Page 159: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/159.jpg)
CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 137
Equation (5.19) is of the form An = 0, which is a homogeneous system of equations. Since A
is a (M + N) × 4 matrix, the system is overdetermined. We use SVD to solve the system in a
least-squares sense to compute the four singular vectors ξi, and corresponding singular values,
λi, i = 1 . . . 4, where λi are sorted in ascending order according to their magnitude.
Additionally, A has a rank of two for a Lambertian point. We can show this by following
a similar arguments for the rank of N for the rays from fv. Thus we expect two non-zero
singular values and two trivial solutions for a system with no noise. With image noise and noise
from the image feature curve extraction process, it is possible to get four non-zero singular
values; whereupon the magnitudes of λ1 and λ2 are much smaller than λ3 and λ4. Importantly,
distortion caused by a refractive object can also cause non-zero singular values, and it is this
effect that we are primarily interested in.
The two smallest singular values, λ1 and λ2 and their corresponding singular vectors are related
to the two normals n and m that best satisfy (5.19) in the least-squares sense. The magnitude of
these singular values provides a measure of error of the planar fit. Smaller errors imply stronger
linearity, while larger errors imply that the feature deviates from the 4D plane.
5.3.3 Measuring Planar Consistency
From the two smallest singular values λ1 and λ2, we have two measures of error for the planar
fit. The Euclidean norm of λ1 and λ2,√
λ12 + λ2
2 may be taken as a single measure of pla-
narity; however, doing so masks the case where λ1 ≫ λ2, or λ1 ≪ λ2. This can occur when
observing a feature through a 1D refractive object (glass cylinder) that causes severe distortion
along one direction, but relatively little along the other. Therefore, we reject those features that
have large errors in either of the two hyperplanes in a manner similar to a logical OR gate. This
planar consistency, along with the slope consistency discussed in the following section, make
the proposed method more sensitive to distorted texture than prior work that considered only
the smallest singular value, which we refer to as hyperplanar consistency [Xu et al., 2015].
![Page 160: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/160.jpg)
138 5.3. DISTINGUISHING REFRACTED IMAGE FEATURES
5.3.4 Measuring Slope Consistency
Equation (5.1), shows that a Lambertian point has a single value of slope for both hyperplanes.
However, the case for a refracted image feature can be locally approximated as
u
v
=
w1(Px − s)
w2(Py − t),
(5.20)
where w1 and w2 are the two slopes for the same image feature. Each row in (5.20) is still a
hyperplane in 4D. The intersection of these two hyperplanes also represents a plane in 4D, as
the normals are still linearly-independent.
We are interested in the horizontal and vertical hyperplanes, which are aligned to the hori-
zontal and vertical viewpoints along t0 and s0, respectively. We can compute the slopes for
each hyperplane given their normals. For the first hyperplane, we solve for the in-plane vector
q = [qs, qu]T , by taking the inner product of the two vectors n and m from (5.17) in
n1 n3
m1 m3
qs
qu
=
0
0
, (5.21)
where q is constrained to the s, u plane, because we choose the first and third elements of n and
m. This system is solved using SVD, and the minimum singular vector yields q. The slope for
the horizontal hyperplane, wsu is then wsu = qs/qu. The slope for the vertical hyperplane wtv is
similarly computed from the second and fourth elements of n and m.
We define slope consistency c as a measure of how different the slopes are between the two
hyperplanes for a given image feature. It is possible to compute this difference as
c = (w1 − w2)2. (5.22)
![Page 161: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/161.jpg)
CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 139
However, in practice, we plot the EPIs as ds/du (rather than du/ds) because there are signif-
icantly fewer views in typical LF cameras than there are pixels per view. We measure inverse
of w1 and w2, which can approach infinity as the lines of slope become vertical. We therefore
convert each slope to an angle
σ1 = tan(w1/1), σ2 = tan(w2/1). (5.23)
We compute c as the Euclidean norm of the two slope angles,
c =√
| σ1 − σ2 |2. (5.24)
Large values of c imply a large difference in slopes between the horizontal and vertical EPIs,
which in turn implies a refractive object. Overall, image features with large planar errors and
inconsistent slopes are identified as belonging to a highly-distorting refractive object.
Two thresholds for planar consistency tplanar and slope consistency tslope are used to determine
if an image feature has been distorted. If true, we refer to it as a refracted image feature,
refracted image feature =
1, if (λ1 > tplanar) ∨ (λ2 > tplanar) ∨ (c > tslope)
0, otherwise,
(5.25)
where ∨ is the logical OR operator. Note that our method is not limited to detecting distortion
aligned with the horizontal and vertical axes of the LF. Although not implemented in this work,
we can further check for λ1, λ2 and c along other axes by rotating the LF’s s, t, u, v frame and
repeating the check. In future work, we aim to consider all of the LF, in order to estimate this
rotation.
![Page 162: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/162.jpg)
140 5.4. EXPERIMENTAL RESULTS
5.4 Experimental Results
In this section, we present our experimental setup for refracted image feature detection and
show how our methods extend from large-baseline LF camera arrays to small-baseline lenslet-
based LF cameras. Finally, we use our method to reject refracted image features for monocular
SfM in the presence of refractive objects, and demonstrate improved reconstruction and pose
estimates.
5.4.1 Experimental Setup
To obtain LFs captured by a camera array, we used the Stanford New Light Field Archive2,
which provided LFs captured from a gantry with a 17× 17 grid of rectified 1024× 1024-pixel
images that were down-sampled to 256 × 256 pixels to reduce computation. We focused on
two LFs that captured the same scene of a crystal ball surrounded by textured tarot cards. The
first LF was captured with a large baseline (16.1 mm/view over 275 mm), which exhibited
significant distortions in the LF caused by the crystal ball. The second LF was captured with
a smaller baseline (3.7 mm/view over 64 mm). This allowed us to compare the effect of LF
camera baseline for refracted image feature discrimination.
Smaller baselines were considered using a lenslet-based LF camera. These LF cameras are
of interest in robotics due to their simultaneous capture of multiple views, and typically lower
size and mass compared to LF camera arrays and gantries. In this section, the Lytro Illum
was used to capture LFs with 15 × 15 views, each 433 × 625 pixels. Dansereau’s Light-Field
Toolbox [Dansereau et al., 2013] was used to decode and rectify the LFs from raw LF imagery
to the 2PP; thereby, converting the Illum to an equivalent camera array with a baseline of 1.1
mm/view over 16.6 mm. To compensate for the extreme lens distortion of the Illum, we removed
the outer views, reducing our LF to 13 × 13 views. The LF camera was fixed at 100 mm
2The (New) Stanford Light Field Archive is available at http://lightfield.stanford.edu/lfs.html.
![Page 163: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/163.jpg)
CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 141
focal length. All LFs were captured in ambient indoor lighting conditions without the need for
specialized lighting equipment. The refractive objects were placed within a textured scene in
order to create textural details for SIFT features. For repeatability, the lenslet-based camera was
mounted to the end-effector of a 6-DOF Kinova Jaco robotic manipulator, shown in Fig. 5.1.
The arm was controlled using the Robot Operating System (ROS) framework.
It is important to remember that our results depend on a number of factors. The geometry and
refractive index of a transparent object affects its appearance. Higher curvature and thickness
yield more distortion. Second, the distance between the LF camera and refractive object, as
well as the distance between the refractive object and the background, directly affect how much
distortion can be observed. Similarly, a larger camera baseline captures more distortion. When
possible, these factors were held constant throughout different experiments.
5.4.2 Refracted Image Feature Discrimination with Different
LF Cameras
In this section, we provide a qualitative comparison of our discrimination methods for the large-
baseline and small-baseline LF camera setups. Then we provide quantitative results over a larger
variety of LFs for our refracted image feature discriminator.
5.4.2.1 Large-Baseline LF Camera Observations
The large-baseline crystal ball LF was captured by a camera array. Lambertian image features
were captured by our textural cross-correlation approach as straight lines, while refracted image
features were captured as nonlinear curves, as shown in Fig. 5.5. We observed that while the
refracted image feature’s WNCC response was weaker compared to the Lambertian case, local
maxima were observed near the image feature’s corresponding location in the central view.
![Page 164: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/164.jpg)
142 5.4. EXPERIMENTAL RESULTS
Thus, taking the local maxima of the correlation EPI yielded the desired feature curves. Our
textural cross-correlation method enables us to extract image feature curves without focusing
on image intensities.
5.4.2.2 Small-Baseline LF Camera Observations
Fig. 5.6 shows the horizontal and vertical EPIs for a refracted image feature taken from the
small-baseline crystal ball LF. The image feature curves appear straight, despite being distorted
by the crystal ball. However, we observed that the slopes were inconsistent, which could still
be used to discriminate refracted image features.
5.4.2.3 Discrimination of Refracted Image Features
To discriminate refracted image features, thresholds for planarity and slope consistency were
selected by exhaustive search over a set of training LFs, while evaluated on a different set of
LFs, with the exception of the crystal ball LFs where only one was available for each baseline
b from the New Stanford Light Field Archive. For comparison to state of the art, parameter
search was performed for both Xu’s method and our method independently, to allow for the
best performance of each method.
The ground truth refracted image features were identified via hand-drawn masks in the central
view. It was assumed that all features visible and passing through the refractive object were
distorted. Detecting a refracted image feature was considered positive, while returning a Lam-
bertian image feature was negative. Thus a true positive (TP) is correctly as a identified refracted
image feature, while a true negative (TN) is correctly identified as a Lambertian image feature.
A false positive (FP) is an incorrectly identified refracted image feature. A false negative (FN)
is an incorrectly identified Lambertian image feature, as shown in Fig 5.7.
![Page 165: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/165.jpg)
CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 143
(a) (b) (c)
(d) (e) (f)
Figure 5.5: Comparison of sample image feature curves extracted for a Lambertian (top) and
refracted (bottom) feature from the large-baseline LF. (a) Sample Lambertian SIFT feature with
template used for WNCC (red). (b) A 3D view of the vertical correlation EPI overlaid with the
straight Lambertian image feature curve (red). (c) The same straight Lambertian feature curve
(red) overlaid in the original vertical EPI. (d) Sample refracted SIFT feature with template used
for WNCC (red). (e) The refracted image feature curve (red) in the vertical correlation EPI
can still be extracted, despite more complex “terrain”, and still matches (f) the refracted image
feature curve, which exhibits nonlinear behaviour in the original vertical EPI. For reference, the
image feature location is shown at (t0, v0) by the red dot in the vertical EPIs.
![Page 166: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/166.jpg)
144 5.4. EXPERIMENTAL RESULTS
(a) (b)
Figure 5.6: Sample (a) horizontal and (b) vertical EPIs from the crystal ball LF with small
baseline. From the image feature’s location (u0, v0) in the central view (red), extracted image
feature curves (green) match the plane of best fit (dashed blue). In the small baseline LF,
refracted image features appear almost linear and are thus much more difficult to detect.
Figure 5.7: Illustrating true positive, true negative, false positive and false negative in the con-
text of refracted image feature discrimination.
From these definitions, we can compute precision and recall as performance measures. Preci-
sion is the fraction of correctly identified refracted image features that are relevant,
Pr =TP
TP + FP. (5.26)
Recall is the fraction of correctly identified refracted image features,
Re =TP
TP + FN. (5.27)
![Page 167: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/167.jpg)
CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 145
Table 5.1: Comparison of our method and the state of the art using two LF
camera arrays and a lenslet-based camera for discriminating refracted image
features
State of the Art [Xu et al., 2015] Proposed
b [mm] TPR TNR FPR FNR Pr Re TPR TNR FPR FNR Pr Re
arr
ay
crystal ball
275 0.58 0.97 0.02 0.41 0.83 0.59 0.66 0.95 0.05 0.34 0.71 0.66
68 0.42 0.91 0.08 0.89 0.35 0.42 0.63 0.94 0.05 0.37 0.55 0.63
len
slet
sphere
1.1 0.43 0.36 0.64 0.58 0.18 0.08 0.48 0.95 0.04 0.52 0.79 0.83
cylinder
1.1 0.08 0.80 0.20 0.92 0.72 0.43 0.82 0.81 0.13 0.24 0.97 0.48
Two LF camera setups were used for the crystal ball LF, a 275mm baseline and a 68mm base-
line. For the lenslet-based camera, ten LFs from a variety of different backgrounds were used
for each object type. The discrimination results are shown in Table 5.1, which we discuss in the
following paragraphs. Fig. 5.8 shows sample views of refracted features (red) and Lambertian
features (blue).
Large-baseline LF Cameras For large-baseline LF cameras, such as the LF camera array
with 275 mm, our approach had comparable performance to the state of the art, shown by
only a 14% lower precision, but an 11% increase in recall. For large baselines, a significant
amount of apparent motion for many of the refracted image features was observed in the EPIs;
thus, refractive image features yielded nonlinear curves which strongly deviated from both 4D
hyperplanes. Therefore, a single threshold (that only accounted for a single hyperplane) was
sufficient to discriminate refracted image features.
The FPs included some occlusions, which appeared nonlinear in the EPI [Wanner and Gold-
eluecke, 2014], but were not discriminated by our implementation. However, this may still be
beneficial as occlusions often cause unreliable depth estimates, and are thus undesirable for
![Page 168: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/168.jpg)
146 5.4. EXPERIMENTAL RESULTS
most robotic vision feature-based algorithms. Sampling from all the views in the LF would
likely improve the results for both methods, as more data would improve the planar fit. Interest-
ingly, more accurate depth estimates near occlusions is a common motivation to use LF cameras
over conventional vision sensors [Ham et al., 2017, Tao et al., 2013].
Small-baseline LF Cameras For small-baseline LF cameras, such as the LF camera array
with a 68 mm baseline, and the lenslet-based plenoptic camera, we observed improved perfor-
mance with our method over state of the art. For the crystal ball LF, our method had up to a
50% higher TP rate (TPR), up to a 58% lower FN rate (FNR), similar FP rates (FPR) and TN
rates (TNR), and generally better precision and recall compared to Xu’s method for the camera
array. We attributed these improvements to more accurately fitting the plane in 4D, as opposed
to a single hyperplane.
For the lenslet-based LF camera, we investigated two different types of refractive objects: a
glass sphere and an acrylic cylinder, shown in the bottom two rows of Fig. 5.8. The sphere
exhibited significant distortion along both the horizontal and vertical viewing axes, while the
cylinder only exhibited significant distortion perpendicular to its longitudinal axis.
When using the small-baseline lenslet-based LF camera, we observed significant improvement
in performance over state of the art for all object types. As shown in Table 5.1, Xu’s method
was unable to detect the refractive cylinder (TPR of 0.08), while our method succeeded with a
TPR 10 times higher. Our method had a 3.4 times increase in precision and 9.4 times increase
in recall for the sphere. The higher precision and recall imply that our method provides fewer
incorrect detections and misses fewer correct refracted image features compared to previous
work. We attribute this to accounting for slope consistency, which Xu’s method did not address.
In shorter-baseline LFs, the nonlinear characteristics of refracted image feature curves were
much less apparent, as in Fig. 5.6, but could still be distinguished by their inconsistent slopes.
![Page 169: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/169.jpg)
CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 147
Figure 5.8: Comparison of the state of the art (Xu’s method) (left), and our method (right) for
discriminating against Lambertian (blue), and refracted (red) SIFT features. The top row shows
the crystal ball captured with a large baseline LF (cropped). Both methods detect refracted
image features; however, our method outperforms Xu’s. In the second and third rows, a cylinder
and sphere captured with a small-baseline lenslet-based LF camera. Our method successfully
detects more refracted image features with fewer false positives and negatives.
We observed that features that were located close to the edge of the sphere appeared more linear,
and thus were not always detected. Other FPs were due to specular reflections that appeared
like well-behaved Lambertian points. Finally, there were some FNs near the middle of the
sphere, where there is identical apparent motion in the horizontal and vertical hyperplanes.
This is a degenerate case for the current method, due to the symmetry of the refractive object.
Principal rays that are directly aligned with the camera are not significantly refracted (their
hyperplanes would therefore appear linear and consistent to each other). However, the image of
these features appears flipped, and the scale of the object is also often changed. These indicators
may be considered in future work to address this issue.
![Page 170: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/170.jpg)
148 5.4. EXPERIMENTAL RESULTS
5.4.3 Rejecting Refracted Image Features for Structure
from Motion
Since too many refracted image features in a set of input image features can cause SfM to fail,
we examine the impact of rejecting refracted image features in a SfM pipeline. We captured
10 sequences of LFs where the camera gradually approached a refractive object using the same
lenslet-based LF camera. These sequences were captured on a robot so that the sequences were
repeatable and the ground truth of the LF camera poses were known. An OptiTrack and motion
capture system was used for ground truth camera pose. We used Colmap, a publicly-available
SfM implementation which included its own outlier rejection and bundle adjustment [Schoen-
berger and Frahm, 2016]. Incremental monocular SfM using the central view of the LF was
performed on the sequences of images. Each successive image had an increasing number of re-
fracted image features, making it increasingly difficult for SfM to converge. If SfM converged,
a sparse reconstruction was produced, and the estimated poses were further analysed. The scene
is shown in Fig. 5.1a with a textured, slanted background plane behind a refractive cylinder.
For each LF, SIFT features in the central view were detected, creating an unfiltered set of fea-
tures, some of which were refracted. Our discriminator was then used to remove refracted
image features, creating a filtered set of (ideally) only Lambertian features. Both sets were
imported separately into the SfM pipeline. This produced respective “unfiltered” and “filtered”
SfM results for comparison. The unfiltered case used all of the available image features, while
our method was applied to the filtered case to remove most of the refracted image features from
the SfM pipeline.
We note that outlier rejection schemes, such as RANSAC, are often used to reject inconsistent
features, including refracted image features. While RANSAC successfully rejected many re-
fracted image features, we observed more than 53% of inlier features used for reconstruction
were actually refracted image features in some unfiltered cases. This suggested that in the pres-
![Page 171: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/171.jpg)
CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 149
ence of refractive objects, RANSAC is insufficient on its own for robust and accurate structure
and motion estimation.
We measured the ratio of refracted image features r = ir/it, where ir is the number of refracted
image features in the image, and it is the total number of features detected in the image. We
considered the reprojection error as it varied with r. Shown in Fig. 5.9, the error for the unfil-
tered case was consistently higher than the filtered case (up to 42.4% higher for r < 0.6 in the
red case). Additionally, the unfiltered case often failed to converge, while the filtered case was
successful, suggesting better convergence. Sample scenes that caused the unfiltered SfM to fail
are shown in Fig. 5.10a and 5.10b. These scenes could not be used for SfM without our method
to find consistent image features for reconstruction.
For the monocular SfM, scale was obtained by solving the absolute orientation problem using
Horn’s method between the estimated pose ps and ground truth pose pg, and only using the
scale. Fig. 5.11a shows example pose trajectories reconstructed by SfM for a filtered and unfil-
tered LF sequence with the ground truth. The filtered trajectory had a more accurate absolute
pose over the entire sequence of images. Fig. 5.11b and 5.11c show the relative instantaneous
pose error ei, computed as
ei = (ps,i − ps,i−)− (pg,i − pg,i−) (5.28)
for image i, split into translation and rotation components. To do this, we considered the po-
sition of the camera origin at image i as hi = [Px, Py, Pz]T . We can then write the translation
error etr for a sequence of images as the L2-norm of the instantaneous translation error
etr =
√√√√
nLF∑
i=1
∣∣(hi − hi−1)− (hg,i − hg,i−1)
∣∣2, (5.29)
where nLF is the number of LFs in the image sequence, and hg,i is the ground truth position at
image i. Similarly, we consider the orientation of the camera at image i as θ = [θr, θp, θy]T for
![Page 172: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/172.jpg)
150 5.4. EXPERIMENTAL RESULTS
0.1 0.2 0.3 0.4 0.5 0.6 0.7
refracted feature ratio
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
rep
roje
ctio
n e
rro
r [p
ix]
unfiltered 1
unfiltered 2
unfiltered 3
filtered 1
filtered 2
filtered 3
Figure 5.9: Rejecting refracted image features with our method yielded lower reprojection er-
rors and better convergence for the same image sequences. SfM reprojection error vs refracted
image feature ratio for the unfiltered case containing all the features, including refracted image
features (dashed), and filtered case excluding refracted image features (solid). The spike in error
at 0.6 r for filtered sequence 2 was due to insufficient inlier matches for SfM to provide reliable
results.
roll, pitch and yaw in Euler angles (XYZ ordering). The rotation error erot for a sequence of
images is then the L2-norm of the instantaneous rotation error
erot =
√√√√
nLF∑
i=1
∣∣(θi − θi−1)− (θg,i − θg,i−1)
∣∣2. (5.30)
Although erot was ≈ 0.02◦, etr had larger errors up to 0.01m higher than the filtered case. This
suggested that filtering for refracted image features yielded more accurate pose estimates from
SfM.
In Table 5.2, we show filtering refracted image features leads to an average of 4.28 mm lower
etr, and 0.48◦ lower erot relative instantaneous pose errors over 5 LF sequences with different
objects, poses and backgrounds, excluding Seq. 6, where the number of inlier feature matches
for SfM dropped below 50. The number of LFs in each sequence varied, because the unfiltered
case could not converge with more images at the end of the sequence where r was higher. Seq. 7
and 8 are examples of where only our filtered case converged, so that SfM produced a trajectory
![Page 173: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/173.jpg)
CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 151
(a) (b)
Figure 5.10: Both (a) and (b) show example images for the refractive cylinder and sphere,
respectively, where SfM could not converge without filtering out refracted image features using
our method.
Table 5.2: Comparison of mean relative instantaneous
pose error for unfiltered and filtered SfM-reconstructed
trajectories
Unfiltered Filtered
Seq. #LFs etr [mm] erot [◦] #inliers etr erot #inliers
1 10 18.86 5.72 160 8.09 4.52 127
2 10 10.45 4.66 285 7.10 4.29 140
3 10 10.17 4.52 281 6.94 4.09 186
4 9 11.13 4.70 296 7.50 4.37 224
5 8 6.07 4.47 201 5.66 4.39 196
6 10 6.52 0.74 207 15.21 1.58 50
7 10 N/A N/A N/A 8.51 4.02 155
8 10 N/A N/A N/A 6.95 4.16 230
for analysis. Thus, filtering refracted image features using our method yielded more consistent
(non-refracted) image features that improved the accuracy of the SfM pose estimates compared
to not filtering for refracted image features, and made SfM more robust in the presence of
refractive objects.
For the cases where SfM converged in the presence of refractive objects, we created a sparse
reconstruction of the scene of Fig. 5.1, which was primarily the Lambertian background plane,
![Page 174: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/174.jpg)
152 5.4. EXPERIMENTAL RESULTS
-0.02-0.01
0
z [m
]
0.3
0.01
0.25
0.2
x [m]
0.15
-0.060.1
-0.04
y [m]
0.05 -0.02
00
groundtruth
unfiltered
filtered
(a)
1 2 3 4 5 6 7 8 9
image #
0
0.01
0.02
etr [m
]
(b)
1 2 3 4 5 6 7 8 9
image #
0
0.1
0.2
ero
t [deg]
unfiltered
filtered
(c)
1 2 3 4 5 6 7 8 9 10
image #
0
0.2
0.4
0.6
refr
active featu
re r
atio
(d)
Figure 5.11: For cases where SfM converged, filtering out the refracted image features yielded
more accurate pose estimates. (a) Sample pose trajectory with the filtered (red) closer to ground
truth (blue), compared to the unfiltered case (green). Relative instantaneous pose error for
translation (a) and rotation (b) are shown over a sample LF sequence, where the filtered case
was consistently lower than the unfiltered case. (c) With our method, the refractive feature ratio
for the filtered case was lower than the unfiltered case.
![Page 175: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/175.jpg)
CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 153
since we attempted to remove refracted image features distorted by the cylinder. Sample recon-
structions for both the unfiltered and filtered cases are shown in Fig 5.12. Both point clouds
were centered about the origin and rotated into a common frame. For visualization, an overlay
of the scene geometry’s best fit to the background plane is provided. The unfiltered case had to
be re-scaled according to the scene geometry (as opposed to via the poses as done in Fig. 5.12)
for comparison. Scaling via scene geometry resulted in severely worse pose trajectories for the
unfiltered case, although similar observations were made: with our method, there were fewer
points placed within the empty space between the refracted object and the plane. This is an
important difference since the absence of information is treated very differently from incor-
rect information in robotics. For example, estimated refracted points might incorrectly fill an
occupancy map, preventing a robot from grasping refractive objects.
5.5 Conclusions
In this chapter, we proposed a method to discriminate refracted image features based on a
planar fit in 4D and slope consistency. To achieve this, we introduced a novel textural cross-
correlation technique to extract feature curves from the 4D LF. Our approach demonstrated
higher precision and recall than previous work for LF camera arrays, and extended the detection
capability to lenslet-based LF cameras. For these cameras, slope consistency proved to be a
much stronger indicator of distortion than planar consistency. This is appealing for mobile
robot applications, such as domestic robots that are limited in size and mass, but will have to
navigate and eventually interact with refractive objects. Future work will examine in more detail
the impact of thresholds on the discriminator through the use of precision-recall curves, as well
as relate image feature slopes to surface curvature to aid grasping.
It is important to note that while we have developed a set of criteria for refracted image features
in the LF, these criteria are not necessarily limited to refracted image features. Depending on
the surface, specular reflections may appear as nonlinear in the EPI. Such image features are
![Page 176: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/176.jpg)
154 5.5. CONCLUSIONS
(a) Side view, unfiltered (b) Side view, filtered
(c) Top view, unfiltered (d) Top view, filtered
Figure 5.12: For the scene shown in Fig. 5.1a, (a,c) the unfiltered case resulted in a sparse
reconstruction where many points were generated between the refractive cylinder (red) and the
background plane (blue). In contrast, (b,d) the filtered case resulted in a reconstruction with
fewer such points, and the resulting camera pose estimates were more accurate. The cylinder
and plane are shown to help with visualization only. The camera (green) represents the general
viewpoint of the scene, not the actual position of the camera.
typically undesirable, and so we retain image features that are strongly Lambertian, and thus
good candidates for matching, which ultimately leads to more robust robot performance in the
presence of refractive objects.
Our experiments have shown that we can exclude refracted image features in a scene containing
spherical and cylindrical refractive objects; however, it is likely that not all planar objects, such
as windows, would be detected by our method. Some types of glass with a homogeneous
refractive index may not be detected by our method because they do not significantly distort
![Page 177: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/177.jpg)
CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 155
the LF by design, such as a glass rectangular prism. However, features viewed through curved
surfaces or non-homogeneous refractive indices, such as those commonly seen through privacy
glass and stained glass windows, should be detected based on the nonlinearities created by the
distortions of the object.
In this chapter we have explored the effect of removing the refractive content from the scene.
We have demonstrated that rejecting refracted image features for monocular SfM yields lower
reprojection errors and more accurate pose estimates in scenes that contain refractive objects.
The ability to more reliably perceive refractive objects is a critical step towards enabling robots
to reliably recognize, grasp and manipulate refractive objects. In the next chapter, we exploit
the refractive content to control robot motion.
![Page 178: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/178.jpg)
156 5.5. CONCLUSIONS
![Page 179: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/179.jpg)
Chapter 6
Light-Field Features for Refractive
Objects
For an eye-in-hand robot manipulator, and a refractive object surrounded by Lambertian scene
elements, we can use the Lambertian elements in the scene to approach the refractive object
using the LF-IBVS for Lambertian scenes developed in Chapter 4. The refractive object can be
partially detected via a variety of methods, such as the refracted image features as in Chapter 5,
or another different technique, such as using the occluding edges of the refractive object [Ham
et al., 2017]. However, as the camera’s FOV becomes increasingly dominated by the refractive
object, the Lambertian scene content becomes increasingly smaller to the point where it is
no longer available. In this situation, we must consider using the refractive object itself (and
thus the refracted image features) for positioning control tasks, such as visual servoing. In
this chapter, we combine the two previous chapters to develop a refracted light-field feature—a
light-field feature whose rays have been distorted by a refractive object—that will enable control
tasks, such as visual servoing towards refractive objects.
157
![Page 180: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/180.jpg)
158 6.1. REFRACTED LF FEATURES FOR VISION-BASED CONTROL
6.1 Refracted LF Features for Vision-based Control
If we consider the physics of a two-interface refractive object, the light paths tracing from the
point of origin, along the intersecting lines at the refractive object’s boundaries to the cam-
era sensor, can be described by over twelve characteristics (see Fig. 3.2). The problem of
completely reconstructing this light path is severely under-constrained for a single LF camera
observation. However, the problem is more constrained for the task of position control, where
only several DOFs need to be controlled with respect to the object (as opposed to recovering
the complete object/scene geometry). Therefore, we approximate the local surface curvature
in two orthogonal directions, which allows us to model that part of the refractive object as a
type of lens. With an LF camera, we can observe the background projections caused by this
lens. We can describe these observations with at least five parameters in the LF, which we use
as our refracted light-field feature for refractive objects. This local description of the refractive
object is much simpler than complete surface reconstructions of the refractive object. While it
may not be sufficient to fully reconstruct the shape of a refractive object, it will be sufficient for
vision-based position control tasks, such as visual servoing.
The main contributions of this chapter are as follows:
• We propose a compact representation for a refracted LF feature, which is based on the
local projections of the background through the refractive object. We assume that the
surface of the refractive object can be locally approximated as having two orthogonal
surface curvatures. We can then model the local part of the refractive object as a toric
lens. The properties of the local projections can then be observed and extracted from the
light field.
• We provide an analysis of our refracted LF feature’s behaviour in the LF in simulation. In
particular, we illustrate the feature’s continuity with respect to LF camera pose. Doing so
![Page 181: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/181.jpg)
CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 159
shows the potential for the feature’s use in vision-based control tasks towards refractive
objects.
The rest of this chapter is organised as follows. We discuss related work in Section 6.2. In
Section 6.3, we discuss the optics of the lens elements that can describe the behaviour of our
refracted LF feature. The formulation of our refracted LF feature and method of extraction
from observations captured by the LF camera is described in Section 6.4. In Section 6.5, we
describe our our implementation and discuss experimental results that illustrate the continuity
and suitability of our feature for a variety of refractive objects in simulation, for the purposes of
visual servoing. Lastly, in Section 6.7, we conclude the chapter and explore future work.
6.2 Related Work
Grasping and manipulation of refractive objects have been considered in previous work. Choi et
al. developed a method to localise refractive objects in real-time with a monocular camera [Choi
and Christensen, 2012]. The contours from a given image, matched them to a database of
refractive object contours with known poses, and efficiently searched/matched to the database.
Walter et al. did so with an LF camera combined with an RGB-D sensor [Walter et al., 2015].
Lysenkov et al. recognised and estimated the pose of rigid transparent objects using a RGB-D
(structured-light) sensor [Lysenkov, 2013]. Recently, Zhou et al. used an LF camera to recognise
and grasp a refractive object by developing a light-field descriptor based on the distribution of
depths observed by the LF camera [Zhou et al., 2018]. However, all of these previous works
rely on having a 3D model of the object a priori. Complete and accurate geometric models of
refractive objects are extremely difficult or time-consuming to acquire.
While the reconstruction of opaque surfaces with Lambertian reflectance is a well-studied prob-
lem in computer and robotic vision, reconstructing the shape of refractive objects pose chal-
lenging problems. Ihrke et al. provide an excellent survey on transparent and specular object
![Page 182: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/182.jpg)
160 6.2. RELATED WORK
reconstruction [Ihrke et al., 2010a]. Kutulakos et al. developed light path theory on refractive
objects and performed refractive object reconstruction on complex inhomogeneous refractive
objects [Kutulakos and Steger, 2007, Morris and Kutulakos, 2007]. If the light paths can be
fully determined, the shape reconstruction is solved. However, from this work, it is clear that
for a two-interface object, there are many more parameters needed than can be measured di-
rectly by an LF camera. We are left with an underdetermined system of equations, which is
insufficient for shape reconstruction.
Taking a slightly different approach, Ben-Ezra et al. used multiple monocular images to re-
cover a parameterised refractive object shape and pose [Ben-Ezra and Nayar, 2003], while
Wanner et al. used LF cameras to reconstruct planar reflective and refractive surfaces [Wanner
and Golduecke, 2013]. There are many other prior works that rely on controlling background
patterns [Kim et al., 2017, Kutulakos and Steger, 2007, Morris and Kutulakos, 2007, Wetzstein
et al., 2011], and shape assumptions [Kim et al., 2017, Tsai et al., 2015]. Many of these ap-
proaches rely on known lighting systems, large displays behind the refractive object in question
and other bulky setups that are impractical for real-world robots in general unstructured scenes.
We are interested in an approach that does not require large apparatus surrounding the refractive
object and does not require models of the entire refractive object. Our work is different from
these previous works in that we are not focused on the problem of reconstructing refractive ob-
ject surfaces. Rather, we aim to develop a refracted LF feature that will enable us to use visual
servoing to approach refractive objects.
In Chapter 4, we developed the first light-field image-based visual servoing algorithm by using
a feature based on central view image coordinates, augmented with slope [Tsai et al., 2017];
however, like many previous works, the implementation was limited to Lambertian scenes. We
revisit the Lambertian light-field feature and LF-IBVS in the context of refractive objects by
proposing a novel LF feature for refractive objects. To the best of our knowledge, refracted
light-field feature for image-based visual servoing towards refractive objects has not yet been
proposed.
![Page 183: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/183.jpg)
CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 161
For LF features, Tosic et al. developed LF-edge features [Tosic and Berkner, 2014]; however,
our interest is in keypoint features, which tend to be more uniquely identifiable are more com-
monly applied to visual servoing and structure from motion tasks. Teixeira et al. used EPIs to
detect reliable Lambertian image features [Teixeira et al., 2017]. Similarly, Dansereau recently
proposed the Light-Field Feature (LIFF) detector and descriptor [Dansereau et al., 2019], which
focuses on detecting and describing reliable Lambertian image features in a scale-invariant man-
ner. However, all of these LF features are designed for Lambertian scenes, and are not suitable
for describing refracted image features.
Maeno et al. proposed the light-field distortion feature (LFD) [Maeno et al., 2013]. Xu et al.
built on the LFD and used it for transparent object image segmentation, but only characterised
a refracted feature as a single hyperplane [Xu et al., 2015]. In Chapter 5, we then developed a
refracted feature classifier for refracted image features using an LF camera [Tsai et al., 2019].
A Lambertian point feature was identified as a planar structure in the 4D LF, which can be
described by the intersection of two 4D hyperplanes. The nature of this 4D planar structure
changes in the light field when distorted by a refractive object, and was used for discriminating
refracted image features. Previously, only a limited subset of views were used (the central
cross of the LF) were used to describe the 4D planar structure. In this chapter, we use feature
correspondences from all of the LF views and extend the theory and how we can observe, extract
and estimate the 4D planar structure of a refracted light-field feature in the LF for the purposes
of visual servoing.
6.3 Optics of a Lens
We first assume that a large, complex refractive object can be sufficiently approximated by sev-
eral smaller parts. These parts are smooth and we constrain the surface to directionally-varying
curvature by choosing two orthogonal directions on the surface. A surface defined in this man-
ner is similar to a type of astigmatic lens, known as a toric lens, which is commonly used by
![Page 184: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/184.jpg)
162 6.3. OPTICS OF A LENS
optometrists to describe and correct astigmatisms [Hecht, 2002]. Thus, we can approximate
small local parts of the refractive object as a toric lens. In general, refractive objects can project
the background into space, and lenses do this in a predictable manner. In this section, we pro-
vide a brief background in the optics of a spherical, cylindrical and finally toric lens, in order to
better understand how the appearance of a feature may be distorted by such a lens, and how it
may be observed in the light field. We describe our reasons for choosing the toric lens for our
refracted LF feature in Section 6.4.
6.3.1 Spherical Lens
One of the most common and simple lenses is the spherical lens. A convex spherical lens surface
is derived from a slice of a sphere, such that it has equal focal lengths in all orientations (it has a
single focal length) and thus focuses collimated light to a single point. As in geometrical optics,
we assume the light acts as rays (no waves). We assume we are in air, such that the index of
refraction nair = 1. We assume the lens is thin and we assume paraxial rays. The lens formula
is then given as
1
f= (n− 1)
[1
R1
−1
R2
+(n− 1)d
nR1R2
]
, (6.1)
where n is the index of refraction of the lens material, R1 and R2 are the radii of curvature of the
front and back surfaces, and d is the thickness of the lens. For thin lenses, d is much smaller than
R1 and R2 and approaches zero. Equation (6.1) is useful because it relates surface curvature
to focal length, and can be used to derive the equation describing image formation, sometimes
called the lensmaker’s formula. As discussed in Section 2.2.2, the lensmaker’s formula is given
as
1
f=
1
zo+
1
zi, (6.2)
where zo and zi describe the distance of the object and image, respectively, along the optical
axis of the lens. Therefore, given focal length f and zo, we can determine zi formed by the lens.
![Page 185: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/185.jpg)
CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 163
6.3.2 Cylindrical Lens
Cylindrical lenses are sliced from the shape of a cylinder. Cylindrical lenses also have a single
focal length, but focus collimated light into a line instead of a point. We refer to this line as
the focal line. The focal line is parallel to the longitudinal axis of the lens. Effectively, the lens
compresses the image of the background in the direction perpendicular to the focal line. The
background image is unaltered in the direction parallel to the focal line.
6.3.3 Toric Lens
A toric lens has two focal lengths in two orientations perpendicular to each other. As shown in
Fig. 6.1, the surface of a toric lens can be formed from a slice out of a torus. The surface of
a torus can be formed by revolving a circle of radius R2, about a circle of radius R1. A slice,
shown in dashed red, forms the surface of a toric lens. The radii of curvature are related to the
focal length, as in (6.1). An astigmatic lens is the more general form of the toric lens, where (for
the astigmatic lens) the axes of the two focal lengths are not constrained to be perpendicular to
each other.
The two focal lengths cause a toric lens to focus light at two different distances from the lens,
resulting in two focal lines. A toric lens has the same optical effect as two perpendicular cylin-
drical lenses combined. Visually, this is seen as a “flattening” of rays with respect to their
respective axes at these two distances [Freeman and Fincham, 1990]. The shape of bundle
of rays passing through the astigmatic lens is known as an astigmatic pencil. Mathematician
Jacques Sturm (1838) investigated the properties of the astigmatic pencil, and thus the astig-
matic pencil is also known as Sturm’s conoid. The distance between the focal lines is known as
the interval of Sturm. The circular cross-section where the pencil has the smallest area is known
as the circle of least confusion. Fig. 6.2 shows a rendering of the visual effect of a toric lens on
a background circle.
![Page 186: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/186.jpg)
164 6.4. METHODOLOGY
(a) (b)
Figure 6.1: (a) A torus can be defined by two radii, R1 and R2. The surface of a toric lens can
be sliced (dashed red) from a torus. (b) The toric lens surface is defined by the two radii of
curvature, and therefore two focal lengths f1 and f2. The direction of these two curvatures are
perpendicular to each other.
6.4 Methodology
There are three reasons for choosing to use the toric lens for locally modeling a large, complex
refractive object. First, it is reasonable to assume local orthogonal surface curvatures as a first
order approximation to any Euclidean surface. Second, it is one of the simplest refractive objects
that we can unambiguously use to describe a feature in relation to camera pose. Third, the toric
lens is more descriptive than a spherical lens in terms of describing the location and orientation
of the image created by projecting a Lambertian point through the lens. In this case, a spherical
lens is ambiguous in its orientation. In this section, we propose our refracted LF feature that is
based on the background projections of a toric lens. We define our refracted LF feature. Then
we describe our method to extract our feature from the LF.
A Lambertian point P emits rays of light that pass through a toric lens and into the LF camera.
Toric lenses project the background into 3D space through two focal lines, located at two differ-
ent distances from the lens that depend on the local surface curvatures. We can recover where
these focal lines occur in 3D based on the ray observations captured by the LF camera. Fur-
thermore, we can show that these vary continuously with respect to LF camera viewing pose,
![Page 187: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/187.jpg)
CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 165
Figure 6.2: A rendering of the visual effect of a toric lens on a blue background circle. In this
scene, a toric lens is aligned with the principal axis of a camera. The camera is moved along
this axis towards the lens. The toric lens is the transparent circular disk in the middle of the
images (1-9). For reference, the background is a checkerboard with a blue circle in the centre.
Far away (1), the blue circle appears as a flattened ellipse. At (3), the image of the blue circle is
almost completely flattened, and appears as a line at one of the focal lengths of the lens. As the
camera progresses closer, the effect of the two focal lengths acting on orthogonal axes balances
out. Image (6) shows the blue dot as a circle at the circle of least confusion. Moving forwards,
the circle is stretched vertically at the second focal length of the toric lens at (9). Finally, the
image appears almost undistorted at (12) when the camera is directly in front of the toric lens.
which makes these measurements suitable for positioning control tasks, such as visual servo-
ing. In sum, we propose a refracted LF feature based on the projections produced by local toric
lenses, which will be suitable for vision-based position and control tasks in scenes dominated
by a refractive object.
For our approach, we assume that the local surface curvatures of the refractive object can be
described by a toric lens. The validity of this assumption, and thus the effectiveness of our
method, depends on how smooth the surface of the refractive object is compared to the base-
line of the LF camera. A high-frequency surface curvature may make the background image
unmatchable and not locally well approximated by a toric lens. We also assume a thin lens,
although thick lenses can be considered in future work for more general refractive objects. We
assume that the background is infinitely far from the refractive object, such that we are dealing
with collimated light. Lastly, we assume that there is sufficient background texture to facilitate
![Page 188: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/188.jpg)
166 6.4. METHODOLOGY
image feature correspondence within the LF (i.e., between sub-images of the LF), which applies
to most feature-based robotic vision methods.
6.4.1 Refracted Light-Field Feature Definition
As described in Section 2.7, a Lambertian point in 3D induces a plane in 4D. This plane can
be described by the intersection of two 4D hyperplanes. Mathematically, the relation the 3D
point and the LF observations can be described by (4.1). Each hyperplane can be described by
a normal vector. In Chapter 5, we showed that these normal vectors are related to the light-field
slope, which is inversely proportional to the depth of the point. For a Lambertian point, the
apparent motion profiles of the feature in the LF are linear and the two slopes from the two
hyperplanes are consistent with each other—they are equal in magnitude.
However, for a refracted image feature, these two motion profiles can be nonlinear and/or the
slopes can inconsistent with each other. The latter implies that they can have different magni-
tudes. We showed this to be sufficient to discriminate Lambertian image features from refracted
image features in Chapter 5. Astonishingly, a Lambertian point projected through a toric lens,
also yields a plane in 4D. Although the normals are not necessarily equal in magnitude, as in
the Lambertian case, the apparent motion profiles are still linear. We can therefore describe the
projections from a toric lens using two slopes. We can also include a measure of orientation
of the toric lens with respect to the LF camera. In this section, we show how the 4D plane is
still formed through the projections of a toric lens, and how we can use this insight to develop a
refracted LF feature.
6.4.1.1 Two Slopes
As in Chapters 4 and 5, we parameterise the LF using the relative two-plane parameterisation
(2PP) [Levoy and Hanrahan, 1996]. A light ray φ emitting from a point P in the scene, has
![Page 189: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/189.jpg)
CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 167
coordinates φ = [s, t, u, v], and is described by two points of intersection with two parallel
reference planes. An s, t plane is conventionally closest to the camera, and a u, v plane is
conventionally closer to the scene, separated by arbitrary distance D.
(a) (b)
Figure 6.3: (a) Light-field geometry for a point in space for a single view (black), and other
views (grey) in the xz-plane, whereby u varies linearly with s for all rays originating from
P (Px, Pz). (b) A 2D (xz-plane) illustration of a background feature P that gets projected
through a toric lens (blue). The lens is characterised by focal length f and converges at the
focal line C. Note that C appears as a point here because C is a line into the page along yinto the page. C is created by the rays (red). From P to C, the image created by the lens is
upright, but from C to the LF camera, the image flips and an inverted image is observed by the
2PP of the LF camera (green). In relation to Fig. 6.3a, it is clear that the LF camera’s slope
measurements capture the depth of the toric lens’ formed image.
Considering the xz-plane, when a Lambertian point P is projected through a thin toric lens, it
forms a line at C, which is subsequently captured by the LF camera. Fig. 6.3 illustrates the rays
traced from P to the observations captured by the light-field camera. It is important to note that
in the xz-plane, C appears as a focal point; however, in 3D, C actually represents a focal line.
In relation to Fig. 6.3a, Fig. 6.3b shows that an LF camera captures the location of the toric lens’
image formation point C. The rays are arranged in such a way that the LF camera captures C’s
slope for both the xz- and yz-planes. Additionally, the position of C depends on the position of
P in the background behind the lens, as in (6.2).
![Page 190: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/190.jpg)
168 6.4. METHODOLOGY
Although much of this discussion has been focused on the positions of the two orthogonal
focal lines, we note that the light is focused on a continuum of distances from the toric lens
along Sturm’s conoid. However, the most salient aspects of Sturm’s conoid that can be directly
observed in the LF are its end points. Therefore, light rays emitted from P are refracted by the
toric lens and converge to two different and orthogonal focal lines. These focal lines occur at
two different depths from the LF camera’s perspective.
6.4.1.2 Orientation
We can describe the orientation of the focal lines with respect to the LF camera. In opthalmol-
ogy, the optical axis of the toric lens is typically aligned with the principal axis of the eye (the
LF camera in our situation). The lens’ orientation is then described with a single angle θ as the
rotation about the principal axis from the x-axis of the LF camera to the xy-axes of the toric
lens. Fig. 6.4 illustrates the orientation of the toric lens θ with respect to the LF camera. If we
define f1 and f2 as the two focal lengths of the toric lens, we note that as the difference between
f1 and f2, becomes small, the interval of Sturm approaches a point and the lens approaches a
spherical lens. The focal lines then intersect at a focal point, and the orientation information
becomes poorly-defined and unusable.
Figure 6.4: The blue ellipse represents the toric lens. The lens orientation θ, defined as the angle
between the refractive object frame (xr, yr) and the camera frame (xc, yc) for the axis-aligned
focal lines relative to the camera frame. For notation, s, t are aligned with xc, yc.
![Page 191: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/191.jpg)
CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 169
6.4.1.3 Combined Slopes and Orientation
Our previous LF feature for Lambertian points was p = [u0, v0, w] in Chapter 4, where u0 and
v0 were the image coordinates of the feature in the central view of the LF (s = 0, t = 0), and
w was the slope. Accounting for both slopes and orientation of the toric lens, we can augment
our Lambertian LF feature as a refracted LF feature described by
RLF = [u0, v0, w1, w2, θ], (6.3)
where w1 and w2 are the two slopes related to the distances to the two focal lines of the toric
lens from the LF camera.
Notably, for the axis-aligned case, where the principal axis of the LF camera is aligned with
the toric lens’ optical axis, our refracted LF feature follows the chief ray1 from the centre of
the toric lens to the centre of the LF camera. For the off-axis case (where the two axes are not
necessarily aligned), the refracted LF feature follows the LF camera’s chief ray to the u0, v0 in
the image plane. Regardless, each focal line must intersect the optical axis of the toric lens. In
either case, we can determine the 3D location of each of the two points of intersection, C1 and
C2, using similar triangles. The rays passing through the focal lines and into the LF camera all
pass through the line segment C1C2, which is known as the interval of Sturm. The line segment
C1C2 may be sufficient for visual servoing, as illustrated in Fig. 6.5, because as with many local
feature-based approaches, it is also possible to consider multiple refracted LF features at the
same time, for visual servoing.
Additionally, our refracted LF feature is not limited to the application of refractive objects.
For Lambertian points, the two slopes for the refracted LF feature are equal in magnitude.
The 3D line segment of the refracted LF feature therefore reduces to a 3D point. By ignoring
1In optics, the chief ray, or principal ray, is the ray that passes through the centre of the aperture. Thus, chief
rays are equivalent to rays observed by a pinhole camera.
![Page 192: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/192.jpg)
170 6.4. METHODOLOGY
Figure 6.5: A Lambertian point P emits a ray of light that pass through the toric lens (blue).
The ray reaches the central view of the LF camera at {L}. The refracted light field feature (red)
is shown as the 3D line segment created by the position of the two focal lines, rotated by an
orientation with respect to the LF camera’s xy-axes along the chief ray. The central view image
coordinates u0, v0 slopes w1, and w2, as well as the orientation θ define our refracted LF feature.
the orientation, our refracted LF feature generalises the Lambertian LF feature developed in
Chapter 4.
6.4.2 Refracted Light-Field Feature Extraction
In this section, we explain our method to extract the refracted LF feature from the LF. Using the
observations captured by the LF camera, we solve for the 4D plane as a 2D projection matrix.
We then decompose the projection matrix into scaling and rotation components, which allow us
to extract the slopes and orientation of the projections formed by the toric lens.
6.4.2.1 LF Observations through a Toric Lens
For the scenario outlined in Fig. 6.3b, a Lambertian point P in the background emitting rays
of light that project through a toric lens to produce a plane in the continuous domain LF. In
![Page 193: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/193.jpg)
CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 171
the discrete domain where we sample s, t in a uniform grid of points, projections appear as a
rectangular grid of points on the uv plane. As in Ch. 5, we consider the Light-Field Distortion
feature [Maeno et al., 2013] as a set of u, v relative to (u0, v0), the image coordinates of an
image feature in the central view (s0, t0). Then we can generally write the projection of P
through a toric lens as
∆u
∆v
= A
s
t
=
a1 a2
a3 a4
s
t
, (6.4)
where A is a 2 × 2 matrix. We note that if we have a spherical lens, or simply a Lambertian
point, then (6.4) simplifies to
∆u
∆v
= AL
s
t
=
w 0
0 w
s
t
, (6.5)
where w = −D/Pz. For the case of P projecting through a toric lens, in (6.4), we can factorise
A into three components of SVD as
A = ALΣAART, (6.6)
where AL is a 2 × 2 matrix, ΣA is a diagonal matrix with non-negative real numbers on the
diagonal, and AR is also a 2× 2 matrix. The diagonal entries of ΣA are the singular values of
A and represent the two slopes of the projections of the toric lens, as seen by the LF camera.
The columns of AL and AR are the left-singular and right-singular vectors of A, respectively.
Intuitively, we can reason this factorisation as three geometrical transformations, a rotation or
reflection (AL), a scaling (ΣA), followed by another rotation or reflection (AR). The orientation
from AL should be the same from AR. We can later extract slopes and orientation from these
three matrices. Therefore, in order to extract the slopes and orientation of the toric lens, we
must first recover the projection matrix A.
![Page 194: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/194.jpg)
172 6.4. METHODOLOGY
6.4.2.2 Projection Matrix
We can write (6.4) in terms of the elements of A as
∆u
∆v
︸ ︷︷ ︸
b
=
s t 0 0
0 0 s t
︸ ︷︷ ︸
F
a1
a2
a3
a4
︸ ︷︷ ︸x
. (6.7)
F is a matrix of at most rank two, because s = kt, where k ∈ R, which means we can reduce
the columns of F to a minimum of two independent columns. This equation has the common
form Fx = b. We can stack LF observations of s, t,∆u and ∆v for each corresponding point
in all n × n views of the LF and estimate a1, a2, a3, and a4 in the least-squares sense. We can
then form A and subsequently solve for the two slopes and the orientation.
6.4.2.3 Slope Extraction
We can extract the slopes from ΣA as the negative diagonals of ΣA. We note that singular
values are positive because the matrix ATA has non-negative eigenvalues, and singular values
are the square root of eigenvalues,
ΣA =
σ1 0
0 σ2
, (6.8)
where σ1 and σ2 are the singular values of ΣA. However, we know that the slopes for a point in
front of the LF camera with a positive D should be negative, based on (5.7). Then the slopes of
![Page 195: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/195.jpg)
CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 173
the toric lens projections are given as
w1 = −σ1 (6.9)
w2 = −σ2. (6.10)
6.4.2.4 Orientation Extraction
In order to extract θ, we must first consider a 2D rotation and a 2D reflection. A 2D rotation
matrix has the form
Rot(θ) =
cos θ − sin θ
sin θ cos θ
. (6.11)
For a 2D reflection, we can generally reflect vectors perpendicularly over a line that makes an
angle γ with the positive x-axis. The 2D reflection matrix then has the form
Ref(γ) =
cos 2γ sin 2γ
sin 2γ − cos 2γ
. (6.12)
In our case, θ = γ, so the combined reflection and rotation matrix R is given as
R = Ref(θ)Rot(θ) = Ref(θ −1
2θ). (6.13)
This reduces to
R(θ) =
cos θ − sin θ
− sin θ − cos θ
. (6.14)
Applying (6.14) to AR and AL yields two angles. The first angle represents a rotation and
reflection to the principal axes of the LF observations on the uv-plane. The singular values
represent scaling along the principal axes of the LF observations. The last angle represents the
![Page 196: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/196.jpg)
174 6.5. EXPERIMENTAL RESULTS
same rotation and reflection back to the original LF observations. Since we are dealing with 2D
rotations, these two angles should be equal. Thus, we only have to extract a single angle θ.
Unlike previous work, where we only considered the central cross (horizontal and vertical) of
all the sub-views in the LF, in this work, we consider all of the sub-views. This improvement
allows us to better characterise refractive objects of different orientations (which was also not
accounted for in previous work), and simply allows us to use more information captured by the
LF for less uncertainty in the fit.
6.5 Experimental Results
For position control tasks, we are primarily interested in feature continuity. Continuity implies
that there are no abrupt breaks or jumps in a function. For our refracted LF feature, continuity
means that u0, v0, w1, w2 and θ all vary smoothly with respect to viewing pose, locally on
the surface of the refractive object. Methods such as visual servoing, typically rely on feature
continuity to incrementally step towards the goal pose. In this section, we describe the two
implementations and preliminary experimental results for investigating the continuity of our
refracted LF feature with respect to a variety of viewing poses and different refractive object
types.
6.5.1 Implementations
We developed two implementations for investigating refracted LF feature continuity. First, we
developed a single-point ray simulation for a single Lambertian point through a toric lens. Note
that this is not a ray-tracing method in the classic sense that we propagate rays from the source
to the camera sensor. The purpose of this setup was to provide a useful figure to illustrate the
nature of the toric lens, the focal lines, and act as a proof of concept for the refracted LF feature.
![Page 197: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/197.jpg)
CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 175
Second, we performed a ray-tracing simulation of a background scene, refractive object and
LF camera using Blender, a popular and freely-available rendering tool. We used the Cycles
Renderer option, which performed physics-based ray tracing for accurate renderings through
refractive objects. Additionally, the Light-Field Blender Add-On [Honauer et al., 2016] was
used to capture a set of LF camera array views. Geometric models were rendered as refractive by
assigning Blender’s “glass BSDF” material property, which used an index of refraction of 1.450.
In our ray-tracing simulation, we attempted to assess the validity of the toric lens assumption
towards more general refractive object shapes in order to assess the limitations of the refracted
LF feature.
A rendered sample LF reduced to 3 × 3 views is shown in Fig. 6.6. In this environment, we
simulated and tested our method against a variety of different object types and poses, shown
later in Fig. 6.14. The background was kept up to 100 times farther than the distance of the
refractive object relative to the LF camera, in order to approximate collimated light from a
point source of light. We used a flat checkerboard background to provide a visual reference
of the amount of distortion caused by the refractive objects. However, our implementation is
agnostic to the background pattern because we rely on a uniquely-coloured, solid blue circle on
the top surface of the background plane in order to aid image feature correspondence between
different LFs captured from different poses. Future work will involve different backgrounds,
including more realistic, non-planar scenes.
We ensured that the tracked circle was visible in all the views of the LF through the refractive
object, as in Fig. 6.6. Segmentation for the refracted blue circle was accomplished by trans-
forming the red, green and blue (RGB) colour representation to the hue, saturation and value
(HSV) colour space. Thresholds were used to segment the angular value for the blue hue, which
ranged from approximately 240 to 300 degrees in the HSV colour space. The centre of mass of
the largest blue-coloured dot was used as the centre of the circle, which was taken as the same
background Lambertian point for feature correspondence.
![Page 198: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/198.jpg)
176 6.5. EXPERIMENTAL RESULTS
Figure 6.6: Ray tracing of a refractive object using Blender. Here, a toric lens is simulated for
5 × 5 views as an LF, although only 3 × 3 views are shown here. The nature of the toric lens
is visible—the square checkerboard background is elongated in the v-direction, indicating the
longer focal length along the vertical axis of the toric lens. The large circular blue dot was used
to aid feature correspondence. The blue circular blue dot appears as an ellipse due to the nature
of the toric lens.
We note that the centre of mass of a blob (for example, the ellipses in Fig. 6.6) that has been
distorted by a refractive object does not always reflect the precise centre of the circle. There
may be cases where extreme curvature and inhomogeneous structures in the refractive object
(such as bubbles or holes) can result in significant distortion, such that the circle’s centre no
longer matches the blob’s centroid in the rendered image. However, for homogeneous (no
holes or bubbles) and relatively smooth refractive objects, the centroid provides a reasonable
approximation to the coordinates of the centre of the circle in the rendered image.
![Page 199: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/199.jpg)
CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 177
6.5.2 Feature Continuity in Single-Point Ray Simulation
For the single-point ray simulation, we know the location of the Lambertian point, as well as
the pose and optical properties of the toric lens and LF camera. We can therefore determine
the location of the focal lines. The rays can then be projected from the st viewpoints of the LF
camera, which we know the chief rays must pass through. We assume paraxial rays. Fig. 6.7
illustrates rays of light emitted from a Lambertian point projected through a toric lens and into
an LF camera. The pencil-like shape of the rays is known as an astigmatic pencil. The colours
of the rays are coded with the two focal lengths of the toric lens. The 2D side-views clearly
indicate the rays pass through the two focal lines according to the two focal lengths of the toric
lens. Feature correspondence is known because we are tracing the rays individually through the
scene from the camera viewpoints.
Fig. 6.8 shows the estimated slopes for a pure translation along the z-axis in the distance be-
tween the refractive object and the LF camera. In this motion sequence, the LF camera was
moved closer to the refractive object. The ground truth was calculated from the slope equations
in Fig. 6.5. As expected, both slopes increased in magnitude (but decreased due to the negative
sign) as the LF camera moved closer towards the focal lines, and matched the ground truth.
Orientation was correctly estimated as a constant and so is not shown. Translations in x and y
also yielded constant slopes, and are therefore not shown. Similarly, Fig. 6.9 shows the correct
estimated orientation for a pure rotation about the z-axis of the LF camera. The slopes were
also correctly estimated as constant and so are not shown. In all of these plots, the refracted
LF feature is continuous with camera pose. This experiment also demonstrated that we can
correctly extract the refracted LF feature from simulated LF observations.
![Page 200: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/200.jpg)
178 6.5. EXPERIMENTAL RESULTS
(a)
0123456
z [m]
-1
-0.5
0
0.5
1
x [m
]
(b)
-0.5
0
0.5
y [m
]
0123456
z [m]
(c)
Figure 6.7: Single point ray trace simulation. (a) 3D view of a Lambertian point (black) ema-
nating light rays through the toric lens (light blue, blue). The rays are refracted and pass through
the focal lines (red, magenta). The rays pass through the uv-plane (green) and into the LF cam-
era viewpoints (blue). (b) The xz-plane showing all the light rays passing through the magenta
focus line induced by fx. (c) The yz-plane, showing all the light rays passing through the red
focus line induced by fy.
![Page 201: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/201.jpg)
CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 179
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
LF camera displacement z [m]
-0.13
-0.125
-0.12
-0.115
-0.11
-0.105
-0.1
-0.095
slo
pe
w1,gt
w2,gt
w1,est
w2,est
Figure 6.8: Our method correctly estimates the two slopes, w1,est, w2,est of the refracted light-
field feature, compared to the ground truth w1,gt, w2,gt for changing z-translation of the LF
camera.
-15 -10 -5 0 5 10 15
gt [deg]
-20
-10
0
10
20
1 [deg]
ground truth
estimated
Figure 6.9: Our method correctly estimates the orientation θ1 of the refracted light-field feature
for changing z-rotation of the LF camera.
6.5.3 Feature Continuity in Ray Tracing Simulation
In the ray-tracing simulation experiments, we extracted our refracted LF feature from rendered
LFs. Similar to the plots from Section 6.5.2, we considered basic motion sequences of the LF
camera, and plotted the elements of the refracted LF feature with respect to camera motion to
show continuity. Fig. 6.10 depicts an LF camera starting from the left and a toric lens (blue) on
the right. The LF camera is approaching the lens in a straight line. The refracted LF feature is
shown in red. Out of the eight poses in this sequence, only three instances of the pose sequence
![Page 202: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/202.jpg)
180 6.5. EXPERIMENTAL RESULTS
are shown for brevity. As the LF camera moves closer to the lens, the refracted LF feature
slopes decrease in magnitude accordingly; however, the feature’s position in 3D space remains
constant. This is because the decrease in slope (and thus decrease in distance of the feature
from the camera) is offset by the forwards motion of the position of the LF camera. Fig. 6.11
shows the corresponding two slopes as a function of LF camera displacement from the starting
position for the corresponding LF camera motion sequence. The trends in Fig. 6.11 matched
what we anticipated, based on Fig. 6.8. In this case, the refracted LF feature’s two slopes were
continuous with respect to forwards and backwards motion along the z-axis.
Figure 6.10: Refracted LF feature (red) for the approach of an LF camera (left, blue) towards
a toric lens (right, blue). For visualization, a straight line connecting the refracted LF feature
and the LF camera is shown (dashed green). As the LF camera moves closer (top to bottom),
the feature’s 3D line segment position remains constant, as we are measuring the same pencil
of light rays. Only three of the eight positions from the sequence are shown.
Similarly, Fig. 6.12 shows the recovered orientation estimates for rotating an ellipsoid about the
principal axis of the LF camera. The ellipsoid was aligned with the same axis and rotated from
-30 to 30 degrees. In this graph, we note that although the correct relative angles are recovered,
the entire line is centred about 90 degrees, instead of zero. This was likely due to the inherent
ambiguity from SVD, where 30 degrees rotation from one axis is equivalently 60 degrees from
the other axis of the toric lens. This ambiguity may be addressed by considering the heuristics
![Page 203: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/203.jpg)
CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 181
-1 -0.95 -0.9 -0.85 -0.8 -0.75 -0.7 -0.65 -0.6 -0.55 -0.5
z
-0.35
-0.3
-0.25
-0.2
-0.15
-0.1
-0.05
slo
pe
w1,est
w2,est
Figure 6.11: Slope estimates for the entire approach towards the toric lens that was illustrated
in Fig. 6.10. Again, w1,est and w2,est represent the two estimated slopes for the toric lens. As
we approach the toric lens (decreasing z), we expect the slope to decrease in magnitude, which
we observe. We also note that the slopes appear continuous for z-translation.
of the problem, or by only considering small changes in orientation, and will be addressed in
future work.
Fig. 6.13 shows refracted LF features (red/orange/yellow) for a toric lens (light blue, right)
plotted in 3D from a grid of LF camera poses (blue squares, left). Note that a single blue square
represents an entire LF camera, as opposed to a single monocular camera. The regularity of the
grid of LF camera poses was an experimental design choice. The dashed lines (green) connect
the LF camera to its corresponding refracted LF feature. The refracted LF features are between
the LF camera poses and the toric lens, as expected. Interestingly, in traditional robotic vision,
Lambertian features do not move in 3D on their own. They are anchored in space (or attached
to some object), and are therefore clearly useful for localisation and image registration, among
other tasks. In Fig. 6.13, and many of the following refracted LF feature visualisations shown
in the following section, we note that our refracted LF features are not simply stationary. They
appear to move with the LF camera pose.
![Page 204: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/204.jpg)
182 6.5. EXPERIMENTAL RESULTS
(a)
-30 -20 -10 0 10 20 30
z rotation [deg]
50
60
70
80
90
100
110
120
[d
eg
]
(b)
Figure 6.12: (a) An elongated ellipsoid that was rotated about the principal axis of the LF
camera to capture orientation change. (b) Orientation estimate, which reflects the orientation
of the principal axis of the ellipsoid relative to the horizontal. Here, the z-rotation is rotation
about the principal axis of the camera. We note that even though the ellipsoid is not an ideal
toric lens, the orientation was still correctly recovered and it was also continuous with respect
to the camera rotation.
However, the feature’s movement due to camera pose was well-defined. The slopes define the
distance of the feature to the LF camera, and these appear to be consistently at -0.2 m and 0.4
m on the z-axis. We note that the layout of the cluster of refracted LF features closely mirrors
the ray patterns of the astigmatic pencil from Fig. 6.7. The uniform grid of LF camera poses
mimics the sampling pattern of an LF camera array. The direction of the refracted LF features is
clearly dictated by the toric lens’ two orthogonal focal lines and the LF camera pose. Although
one can think of the interval of Sturm as simply a line segment along the principal axis of the
toric lens, Fig. 6.13 reminds us that the interval of Sturm is actually a collection of rays along a
continuum defined by the two focal lines of the toric lens. We also note that the direction of the
refracted LF feature appears to change in a continuous manner with camera pose. Therefore, the
alignment of our refracted LF feature implies a corresponding alignment of the LF camera pose
to the toric lens. A position and alignment task in this case could take the form of line-segment
alignment.
![Page 205: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/205.jpg)
CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 183
-0.1
-0.05
-0.1
0
x
0.05
-0.05
0.1
y
01
0.50.05
z
00.1 -0.5
-1
(a) (b)
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
z
-0.1
-0.05
0
0.05
0.1
x
(c)
-0.1
-0.05
0
0.05
0.1
y
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
z
(d)
Figure 6.13: (a) Refracted LF feature (red/orange/yellow) for a toric lens (right, light blue)
from a grid of LF camera positions (left, dark blue). Note that each blue square represents an
entire LF camera, not a single monocular camera. (b) Central view of the central LF, showing
the view of the flattened blue circle by the lens. The centre of the blue circle, (red star), was the
image feature that was tracked across the different LFs. (c) The top and (d) side views of the
refracted LF feature that clearly illustrate the focal lines of the toric lens at z of-0.2 and 0.4 m.
Note that the scale of the z-axis is much larger than the x- and y-axes, in order to clearly show
the refracted LF feature.
![Page 206: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/206.jpg)
184 6.5. EXPERIMENTAL RESULTS
6.5.3.1 Different Object Types
We considered a variety of refractive object types from a set of poses in order to visualise our
refracted LF feature in 3D along with the respective LF camera poses and the refractive object
itself. Fig. 6.14 shows several of the objects, a sphere, a cylinder, and a “tornado”, along with
their corresponding refracted LF features sampled by a sequence or grid of LF camera poses.
First, Fig. 6.14b shows the case of a refractive sphere. As expected, the sphere focused the
refracted LF feature into a single 3D point. Using a spherical lens model instead of a toric lens
model is a viable refracted LF feature. A spherical refracted LF feature would be analogous to
a 3D Lambertian point for position and control tasks; however, the refracted sphere would not
be valid or as accurate for as many refractive object surfaces as the toric refracted LF feature.
Second, Fig. 6.14d shows the refracted LF features for a horizontal translation in x along a
cylinder. The features spread out in a fan at z = 12 m, which is the location of the cylinder’s
focal line. The single focal line is due to the curvature of the cylinder. The cylinder acts as a 1D
refractive element, and therefore the other slope is simply a shifted measure of the Lambertian
background. As we can see, the end points of the refracted LF features are approximately the
same for this reason.
Finally, a tornado-shaped refractive object was rendered in Blender to represent a more com-
plex, but still relatively smooth type of object, shown in Fig. 6.14f. The refracted LF features
were estimated to be in front of the tornado; however, the features did not appear to have com-
mon focal lines, like all of the refracted LF features the toric lens in Fig. 6.13. Despite the
initial intention, we also noticed that the tornado model was surprisingly bumpy with its cur-
vature. This lead to significant distortion caused by the refractive object. Several times, the
bumpiness of the refractive object separated the blue circle’s image (through the refractive ob-
ject) into two or more separate blobs, which greatly impacted the centroid measurements. LF
camera poses were selected so as to minimise and avoid this impact in our experiments. Mul-
![Page 207: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/207.jpg)
CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 185
(a)
12-2
-1
10-2
0
1
8
x
z
2
6
y
04
22
0
(b)
(c) (d)
(e)
-2
0
2
y
0 2 4 6 8 10 12 14 16 18 20
z
(f)
Figure 6.14: (a) For a sphere, the centroid of the blue circle (red star) was tracked throughout
the LF as a means of feature correspondence. (b) The sphere, with equal focal lengths in all
directions, forms an image of the background blue circle at a single point in space, which is
shown in the refracted LF features (red) that also encapsulate a point. Note that each blue LF
camera illustrated here represents a full LF camera, as opposed to a single monocular camera.
The dashed green lines indicate which refracted LF feature matches to which LF camera pose.
(c) For a cylinder, (d) the projections of the blue circle appears at the physical location for the
cylinder-aligned focal direction, as expected. (e) For a “tornado”, (f) the refracted LF features
from a grid of LF camera poses appear almost straight, as if the focal lines of the approximated
local toric lenses are far away. The tornado represented a complex refractive object, but still
yielded a continuous set of refracted LF features.
![Page 208: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/208.jpg)
186 6.6. VISUAL SERVOING TOWARDS REFRACTIVE OBJECTS
tiple projections of the same point, caused by internal reflection and total refraction may also
need to be considered for future work image feature correspondence through refractive objects.
It is important to note that although our refracted LF feature is based on the assumption of
local surface curvatures, we cannot solve for the surface curvatures themselves given only our
refracted LF feature. Considering the lensmaker’s equation in (6.2), our method yields the
distance of image formation zi from the lens. We know that focal length f is intrinsically linked
to the surface curvature r. Therefore, in order to recover f , we require zo, the distance of the
object to the lens along the lens’ optical axis. However, despite this lack of knowledge, our
refracted LF feature is sufficient for the purposes of position control with respect to refractive
objects.
6.6 Visual Servoing Towards Refractive Objects
To put the refracted LF feature into context, in this section, we provide an illustrative example
of visual servoing towards a refractive object, shown in Fig. 6.15. This work has not yet been
investigated, and is further discussed as future work. An LF camera is mounted at the end-
effector of a robotic manipulator. A refractive object is placed in the scene with sufficient
visual texture in the background. The LF camera is moved to the goal pose in order to capture
a (set of) goal refracted LF feature(s). Then the LF camera is moved to an initial pose that
is close to the goal pose, so that the relevant refracted LF features(s) can still be observed
within the camera’s FOV. The robotic system uses a control loop similar to Fig. 4.2 in order to
visual servo towards the goal pose. At each iteration, a refracted LF feature Jacobian, which
relates LF feature changes to camera spatial velocities, is computed and used to iteratively
step towards the goal pose until the difference(s) between the current and goal refracted LF
feature(s) is/are sufficiently small; thereby completing a visual servo towards a refractive object.
Some approaches on how to compute this refracted LF feature Jacobian are mentioned in the
Section 6.7 as future work.
![Page 209: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/209.jpg)
CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 187
Figure 6.15: Concept for visual servoing towards a refractive object. At the start pose (red), a
starting refracted LF feature (red) is captured by observing the distorted images of the yellow
rubber duck in the LF. The LF camera moves (green) in order to align the current and the
goal (blue) refracted LF feature. Owing to the continuity of the refracted LF feature, feature
alignment corresponds with pose alignment, enabling the robot to reach the goal pose without
requiring a 3D geometric model of the refractive object.
![Page 210: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/210.jpg)
188 6.7. CONCLUSIONS
6.7 Conclusions
Overall, we have developed a refracted light-field feature that may be used for positioning tasks
in robotic vision, such as visual servoing. Our feature approximates the surface of a refractive
object with two local orthogonal surface curvatures. We can describe this part of the refractive
object’s surface with a toric lens. The locations of the focal lines created by such a lens can be
measured by an LF camera. We have demonstrated that the location of these focal lines can then
be extracted from rendered light fields. By illustrating the continuity of our refracted light-field
feature from a variety of LF camera poses and for a variety of different refractive objects, this
feature can enable visual servoing and other positioning tasks without the need for a geometric
model of the refractive object.
For future work, we are interested in deriving Jacobians for our refracted LF feature. Doing
so would allow us to close the loop for servoing towards refractive objects. In order to com-
plete visual servoing towards refractive objects, a Jacobian for the refracted light-field feature
is required. Part of our feature extraction process relies on SVD, which potentially complicates
the Jacobian derivation. It may be possible to derive an analytical expression for w1, w2 and θ
via analytical expressions for the derivatives of singular values and singular vectors [Magnus,
1985]. Numerical methods could also be employed to estimate the Jacobian online [Jägersand,
1995]. An alternative approach is to simply derive Jacobians for the 3D line segments induced
by the refracted LF feature, as we illustrated in Fig. 6.10, 6.13, and 6.14. Deriving analytical
expressions for 3D points and line segments is likely more intuitive and straightforward.
Further investigation into denser LF camera pose sweeps to illustrate feature continuity in
graphical form on a larger variety of refractive objects and surface curvatures would be use-
ful to test the limitations of the toric lens assumption. It is also worth noting that the slopes
recovered in this chapter are related to the position of focal lines, and that these focal lines are
a function of surface curvature. Thus, it may be possible to use our refracted LF features to
augment techniques for refractive object surface reconstruction. Finally, it may be possible to
![Page 211: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/211.jpg)
CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 189
extend our refracted LF feature concept to include reflections, which also induce multiple depth
observations (multiple slopes) in the LF, and our orientation already provides a measure for
reflection.
![Page 212: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/212.jpg)
190 6.7. CONCLUSIONS
![Page 213: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/213.jpg)
Chapter 7
Conclusions and Future Work
7.1 Conclusions
At the start of this thesis, we identified an opportunity to advance robotic vision in the area of
perceiving refractive objects. Although many robotic vision algorithms have been successful
assuming a Lambertian world, the real world is far from Lambertian. Water, ice, glass and clear
plastic in a variety of shapes and forms are common throughout the environments that robots
must operate within. Our goal in this research was to help remove the Lambertian assumption
in order to broaden the range of operable scenes and perceivable objects for robots.
We considered light-field cameras as a technology unique in their ability to capture scene tex-
ture, depth and view-dependent phenomena, such as occlusion, specular reflection and refrac-
tion. Furthermore, image-based visual servoing was chosen as a particularly interesting robotic
vision technique for its wide range of applicability, robustness against modelling and calibra-
tion errors, and because it did not necessarily require a 3D geometric model of the target object
to perform positioning and control tasks. Thus, the overall aim of this thesis was to use LF
cameras to advance robotic vision in the area of visual servoing towards refractive objects.
191
![Page 214: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/214.jpg)
192 7.1. CONCLUSIONS
We decomposed this broad goal into the more manageable and specific objectives of demon-
strating (1) image-based visual servoing using light-field cameras for Lambertian scenes; (2)
detecting refracted image features using LF cameras; and (3) developing refracted LF features
for visual servoing towards refractive objects.
In addressing these objectives, the key developments were a result of exploring the properties
of the LF and developing algorithms to exploit them. The first objective was accomplished in
Chapter 4. LF cameras were used for image-based visual servoing. Specifically, we proposed
a novel Lambertian light-field feature and used it to derive image Jacobians from the light field
that were then used to control robot motion. To deal with the lack of available real-time LF
cameras, we designed a custom mirror-based light-field camera adapter. To the best of our
knowledge, this was the first published light-field image-based visual servoing algorithm. Our
method enabled more reliable VS compared to monocular and stereo IBVS approaches for small
or distant targets that occupy a narrow part of the camera’s FOV and in the presence of occlu-
sions. Areas in robotics that may benefit from this contribution include vision-based grasping,
manipulation and docking problems in household, medical and in-orbit satellite servicing ap-
plications.
For the second objective, discrimination of refracted image features from Lambertian image
features was accomplished in Chapter 5. We developed a discriminator based on detecting the
differences between the apparent motion of non-Lambertian and Lambertian image features in
the LF using textural cross-correlation that was more reliable than previous work. We were
able extend these distinguishing capabilities to lenslet-based LF cameras, which typically are
limited to much smaller baselines that conventional LF camera arrays. Using our method to
reject refracted image features, we also enabled monocular SfM in the presence of refractive
objects, where traditional methods would normally fail. Domestic robots that clean dishes or
serve glasses, as well as manufacturing robots attempting to interact with or near clear plastic
packaging or heavily distorting refractive objects, such as stained glass or bottles of water, may
benefit from this research.
![Page 215: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/215.jpg)
CHAPTER 7. CONCLUSIONS AND FUTURE WORK 193
Finally, for the third objective, development of a refracted light-field feature to enable visual
servoing towards a refracted object was accomplished in Chapter 6. In particular, we proposed
and extracted a novel refracted LF feature that could be described by the local projections of
the refractive object. We demonstrated that our feature’s characteristics were continuous with
respect to LF camera pose to show that our feature was suitable for visual servoing without
requiring a 3D geometric model of the target refractive object.
7.2 Future Work
In this journey of the thesis, we have scratched the surface of the unknown, only to uncover
more questions and ideas that might answer them. In this section, we propose the directions
of future research that might build upon and improve the current state of the art for the robotic
vision community.
In Chapter 6, we demonstrated the viability of our refracted light-field image feature for visual
servoing towards refractive objects. Further research in this direction is needed to achieve
a complete visual servoing system. Following our development of LF-IBVS in Chapter 4,
derivations for the refracted light-field feature Jacobian need to be performed. LF-IBVS can
also be implemented on a lenslet-based LF camera for comparison. Together, these tasks will
finally close the loop on visual servoing towards refracted objects.
Additionally, we recognise that VS only addresses part of the problem in enabling robots to
work with refractive objects. VS does not touch upon the area of interaction—grasping and
manipulation. We consider recent works that have enabled grasping of refractive objects, such
as [Zhou et al., 2018], which describe a refractive object as a distribution of depths obtained
from an LF camera. Comparisons to a 3D geometric model are made for object localization
and grasping. Zhou’s method relies on 3D models of refractive objects, while our method does
not require such explicit 3D models. Thus, there is interest in combining our two contributions
![Page 216: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/216.jpg)
194 7.2. FUTURE WORK
to further the functionality of robotic perception for vision-based manipulation of refractive
objects.
The performance and behaviour of VS strongly depends on the choice of image feature. The
LF feature used in Chapter 4 for LF-IBVS constrains a Lambertian point in the scene to a plane
in the 4D LF with equal slopes in all directions. We showed in Chapter 5 that we can describe
a plane in 4D to discriminate refracted from Lambertian image features with nonlinear feature
curves and unequal slopes in the horizontal and vertical directions of the LF. In Chapter 6, we
extracted a more general 4D planar light-field feature from the entire LF as point with multiple
depths and an orientation, and demonstrated the feature’s potential use for VS. However, it may
be possible to servo on the more general 4D planar structure within the LF directly for VS.
Specifically, servoing based on the parameters that describe the plane in 4D (such as the plane’s
two linearly independent normals) provides a larger structure to estimate and track, compared
to individual point features, which may make the approach more robust in low light (night time)
and low contrast (foggy) conditions. This may also lead to analytical expressions of image
Jacobians for visual servoing towards refractive objects. Furthermore, recent advances in LF-
specific features, such as the Light-Field Feature detector and descriptor (LiFF) [Dansereau
et al., 2019], may similarly lead to improved performance and accuracy in VS.
An interesting research direction is refractive object shape reconstruction using LF cameras.
Previous work has shown that occlusion boundaries provide reliable depth information for re-
fractive objects; however, these approaches have either relied on monocular cameras and motion
to collect multiple views [Ham et al., 2017]. Occlusion boundaries of refractive objects may
provide areas in the LF where the depth can be estimated. Local surface curvatures may be
estimated by comparing the depths of the occlusion boundaries to the corresponding depths of
the refracted LF feature from Chapter 6. These local surface curvatures and occlusion boundary
depths may be combined to approximately reconstruct refractive object shape.
![Page 217: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/217.jpg)
CHAPTER 7. CONCLUSIONS AND FUTURE WORK 195
Alternatively, a deep learning approach might be considered to relate the characteristic image
feature curves from Chapter 5 to object depth and surface curvature. Deep learning techniques
might also be used to separate diffuse, specular and refracted image features. Such approaches
are typically reliant upon large amounts of ground truth data; however, ground truth data for
refractive objects is difficult to obtain and often very labour intensive. To address this issue, it
may be possible to rely on simulated ground truth data that use realistic ray-tracing to form the
bulk of the training data and then rely on only a small amount of real-world data for fine-tuning
the network. We may draw on the literature from the sim-to-real field, where this approach is
referred to as a domain adaption technique.
Another interesting direction of research is to use the LF camera for virtual exploration. In
this thesis, image Jacobians were computed analytically for Lambertian scenes based on point
features that ultimately relied on an approximate model of the LF camera. In visual servoing,
there exist a variety of methods to compute the image Jacobian online without prior camera
models using a set of “test movements”, which are not part of the manipulation task [Jägersand,
1995, Piepmeier et al., 2004]. However, LF cameras capture a small amount of virtual motion
by virtue of their multiple views, similar to a local image-based derivative of robot motion.
Recently, there are also a variety of deep learning approaches to monocular VS [Lee et al.,
2017, Bateux et al., 2018]. Thus, an LF camera may be used to estimate the image Jacobian
by comparing these multiple views and the central view of the LF to some goal image. In a
related project, our recent work demonstrated that gradients from a multi-camera array could
be used to servo towards a target object in highly-occluded scenarios [Lehnert et al., 2019],
although a non-planar grid of cameras were used, as opposed to a planar grid of cameras, as in a
traditional LF camera array. Further research into these avenues may result in faster and simpler
visual servoing algorithms that can still operate in cluttered and non-Lambertian environments,
possibly without the need for LF camera calibration.
This work has been largely addressing problems in the context of robotic vision. Taking a
broader view outside of the field of robotics, this research will hopefully find research directions
![Page 218: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/218.jpg)
196 7.2. FUTURE WORK
in other fields, such as cinematography, virtual reality, mixed or augmented reality, video games
and consumer photography. In particular, augmented reality is an emerging technology that
may rely on light-field imaging technology. The augmented reality must address a similar
problem faced by robots—how to perceive the real world using limited sensor technologies,
whilst still enabling safe and reliable interaction. However, as humans are an integral part of
augmented reality, these interactions must also be real-time and realistic in appearance. An
improved understanding of how refractive objects behave in the light field may lead to more
realistic and faster renderings of scenes with refractive objects, as well as safer and more reliable
interaction.
![Page 219: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/219.jpg)
Bibliography
[Adelson and Anandan, 1990] Adelson, E. H. and Anandan, P. (1990). Ordinal characteristics
of transparency. Vision and Modeling Group, Media Laboratory, Massachusetts Institute of
Technology.
[Adelson and Bergen, 1991] Adelson, E. H. and Bergen, J. R. (1991). The plenoptic function
and the elements of early vision. Computational models of visual processing, 91(1):3–20.
[Adelson and Wang, 1992] Adelson, E. H. and Wang, J. Y. A. (1992). Single lens stereo with
a plenoptic camera. IEEE Transactions on Pattern Analysis & Machine Intelligence, (2):99–
106.
[Adelson and Wang, 2002] Adelson, E. H. and Wang, J. Y. A. (2002). Single lens stereo
with a plenoptic camera. IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI), 14(2):99–106.
[Andreff et al., 2002] Andreff, N., Espiau, B., and Horaud, R. (2002). Visual servoing from
lines. The International Journal of Robotics Research, 21(8):679–699.
[Baeten et al., 2008] Baeten, J., Donné, K., Boedrij, S., Beckers, W., and Claesen, E. (2008).
Autonomous fruit picking machine: A robotic apple harvester. In Field and Service Robotics,
pages 531–539. Springer.
197
![Page 220: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/220.jpg)
198 BIBLIOGRAPHY
[Bateux and Marchand, 2015] Bateux, Q. and Marchand, E. (2015). Direct visual servoing
based on multiple intensity histograms. In IEEE International Conference on Robotics and
Automation.
[Bateux et al., 2018] Bateux, Q., marchand, E., Leitner, J., Chaumette, F., and Corke, P. (2018).
Training deep neural networks for visual servoing. In IEEE International Conference on
Robotics and Automation, pages 3307–3314.
[Bay et al., 2008] Bay, H., Ess, A., Tuytelaars, T., and Gool, L. V. (2008). Speeded-up robust
features (SURF). Computer Vision and image understanding, 110(3):346–359.
[Ben-Ezra and Nayar, 2003] Ben-Ezra, M. and Nayar, S. K. (2003). What does motion reveal
about transparency. In Intl. Conference on Computer Vision (ICCV). IEEE Computer Society.
[Bergeles et al., 2012] Bergeles, C., Kratochvil, B. E., and Nelson, B. J. (2012). Visually ser-
voing magnetic intraocular microdevices. IEEE Transactions on Robotics, 28(4):798–809.
[Bernardes and Borges, 2010] Bernardes, M. C. and Borges, G. A. (2010). 3D line estimation
for mobile robotics visual servoing. In Congresso Brasileiro de Automática (CBA).
[Bista et al., 2016] Bista, S. R., Giordano, P. R., and Chaumette, F. (2016). Appearance-based
indoor navigation by ibvs using line segments. IEEE Robotics and Automation Letters,
1(1):423–430.
[Bolles et al., 1987] Bolles, R., Baker, H., and Marimont, D. (1987). Epipolar-plane image
analysis: An approach to determining structure from motion. Intl. Journal of Computer
Vision (IJCV), 1(1):7–55.
[Bolles and Fischler, 1981] Bolles, R. C. and Fischler, M. A. (1981). A ransac-based approach
to model fitting and its application to finding cylinders in range data. In IJCAI, volume 1981,
pages 637–643.
![Page 221: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/221.jpg)
BIBLIOGRAPHY 199
[Bourquardez et al., 2009] Bourquardez, O., Mahony, R., Guenard, N., Chaumette, F., Hamel,
T., and Eck, L. (2009). Image-based visual servo control of the translation kinematics of a
quadrotor aerial vehicle. Trans. on Robotics, 25(3).
[Cai et al., 2013] Cai, C., Dean-Leon, E., Mendoza, D., Somani, N., and Knoll, A. (2013).
Uncalibrated 3D stereo image-based dynamic visual servoing for robot manipulators. In
Intl. Conference on Intelligent Robots and Systems (IROS), pages 63–70. IEEE.
[Calonder et al., 2010] Calonder, M., Lepetit, V., Strecha, C., and Fua, P. (2010). Brief: Binary
robust independent elementary features. In European conference on computer vision, pages
778–792. Springer.
[Cervera et al., 2003] Cervera, E., Del Pobil, A. P., Berry, F., and Martinet, P. (2003). Improv-
ing image-based visual servoing with three-dimensional features. The International Journal
of Robotics Research, 22(10-11):821–839.
[Chan, 2014] Chan, S. C. (2014). Light field. In Computer Vision A Reference Guide, pages
447–453. Springer Link.
[Chaumette, 1998] Chaumette, F. (1998). Potential problems of stability and convergence in
image-based and position-based visual servoing. Lecture Notes in Control and Information
Sciences, 237:66–78.
[Chaumette, 2004] Chaumette, F. (2004). Image moments: a general and useful set of features
for visual servoing. IEEE Transactions on Robotics, 20(4):713–723.
[Chaumette and Hutchinson, 2006] Chaumette, F. and Hutchinson, S. (2006). Visual servo
control part 1: Basic approaches. Robotics and Automation Magazine, 6:82–90.
[Chaumette and Hutchinson, 2007] Chaumette, F. and Hutchinson, S. (2007). Visual servo
control part 2: Advanced approaches. IEEE Robotics and Automation Magazine, pages
109–118.
![Page 222: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/222.jpg)
200 BIBLIOGRAPHY
[Choi and Christensen, 2012] Choi, C. and Christensen, H. (2012). 3D textureless object de-
tection and tracking: An edge-based approach.
[Christensen, 2016] Christensen, H. I. (2016). A roadmap for US robotics (2016) from internet
to robotics.
[Civera et al., 2008] Civera, J., Davison, A. J., and Montiel, J. M. (2008). Inverse depth
parametrization for monocular slam. IEEE transactions on robotics, 24(5):932–945.
[Collewet and Marchand, 2009] Collewet, C. and Marchand, E. (2009). Photometry-based
visual servoing using light reflexion models. In 2009 IEEE International Conference on
Robotics and Automation, pages 701–706. IEEE.
[Collewet and Marchand, 2011] Collewet, C. and Marchand, E. (2011). Photometric visual
servoing. Trans. on Robotics, 27(4).
[Comport et al., 2011] Comport, A. I., Mahony, R., and Spindler, F. (2011). A visual servoing
model for generalised cameras: Case study of non-overlapping cameras. In 2011 IEEE
International Conference on Robotics and Automation, pages 5683–5688. IEEE.
[Corke, 2013] Corke, P. (2013). Robotics, Vision and Control. Springer.
[Corke and Hutchinson, 2001] Corke, P. and Hutchinson, S. (2001). A new partitioned ap-
proach to image-based visual servo control. Transactions on Robotics and Automation,
17(4):507–515.
[Corke, 2017] Corke, P. I. (2017). Robotics, Vision and Control. Springer, 2 edition.
[Dalal and Triggs, 2005] Dalal, N. and Triggs, B. (2005). Histograms of oriented gradients for
human detection. In Intl. Conference on Computer Vision and Pattern Recognition (CVPR).
[Dansereau, 2014] Dansereau, D. G. (2014). Plenoptic Signal Processing for Robust Vision in
Field Robotics. PhD thesis, University of Sydney.
![Page 223: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/223.jpg)
BIBLIOGRAPHY 201
[Dansereau and Bruton, 2007] Dansereau, D. G. and Bruton, L. T. (2007). A 4-D dual-fan
filter bank for depth filtering in light fields. IEEE Transactions on Signal Processing (TSP),
55(2):542–549.
[Dansereau et al., 2019] Dansereau, D. G., Girod, B., and Wetzstein, G. (2019). LiFF: Light
field features in scale and depth. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 8042–8051.
[Dansereau et al., 2011] Dansereau, D. G., Mahon, I., Pizarro, O., and Williams, S. B. (2011).
Plenoptic flow: Closed-form visual odometry for light field cameras. In Intl. Conference on
Intelligent Robots and Systems (IROS), pages 4455–4462. IEEE.
[Dansereau et al., 2013] Dansereau, D. G., Pizarro, O., and Williams, S. B. (2013). Decoding,
calibration and rectification for lenselet-based plenoptic cameras. In Intl. Conference on
Computer Vision and Pattern Recognition (CVPR), pages 1027–1034. IEEE.
[De Luca et al., 2008] De Luca, A., Oriolo, G., and Robuffo Giordano, P. (2008). Feature depth
observation for image-based visual servoing: Theory and experiments. The International
Journal of Robotics Research, 27(10):1093–1116.
[Dong et al., 2013] Dong, F., Ieng, S.-H., Savatier, X., Etienne-Cummings, R., and Benosman,
R. (2013). Plenoptic cameras in real-time robotics. The Intl. Journal of Robotics Research,
32(2):206–217.
[Dong and Soatto, 2015] Dong, J. and Soatto, S. (2015). Domain-size pooling in local de-
scriptors: Dsp-sift. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 5097–5106.
[Drummond and Cipolla, 1999] Drummond, T. and Cipolla, R. (1999). Visual tracking and
control using lie algebras. In Proceedings. 1999 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (Cat. No PR00149), volume 2, pages 652–657.
IEEE.
![Page 224: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/224.jpg)
202 BIBLIOGRAPHY
[Engel et al., 2014] Engel, J., Schoeps, T., and Cremers, D. (2014). LSD-SLAM: Large-scale
direct monocular SLAM. European Conference on Computer Vision (ECCV).
[Fischler and Bolles, 1981] Fischler, M. and Bolles, R. (1981). Random sample consensus: a
paradigm for model fitting with applications to image analysis and automated cartography.
[Freeman and Fincham, 1990] Freeman, M. H. and Fincham, W. H. A. (1990). Optics. Butter-
worths, London, 10th edition. An optional note.
[Fritz et al., 2009] Fritz, M., Bradski, G., Karayev, S., Darrell, T., and Black, M. (2009). An
additive latent feature model for transparent object recognition.
[Fuchs et al., 2013] Fuchs, M., Kächele, M., and Rusinkiewicz, S. (2013). Design and fabrica-
tion of faceted mirror arrays for light field capture. In Computer Graphics Forum, volume 32,
pages 246–257. Wiley Online Library.
[Gao and Zhang, 2015] Gao, X. and Zhang, T. (2015). Robust rgb-d simultaneous localization
and mapping using planar point features. Robotics and Autonomous Systems, 72:1–14.
[Georgiev et al., 2011] Georgiev, T., Lumsdaine, A., and Chunev, G. (2011). Using focused
plenoptic cameras for rich image capture. IEEE Computer Graphics and Applications,
31(1):62–73.
[Gershun, 1936] Gershun, A. (1936). Fundamental ideas of the theory of a light field (vector
methods of photometric calculations). Journal of Mathematics and Physics, 18.
[Ghasemi and Vetterli, 2014] Ghasemi, A. and Vetterli, M. (2014). Scale-invariant represen-
tation of light field images for object recognition and tracking. In IS&T/SPIE Electronic
Imaging. International Society for Optics and Photonics.
[Godard et al., 2017] Godard, C., Mac Aodha, O., and Brostow, G. J. (2017). Unsupervised
monocular depth estimation with left-right consistency. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, pages 270–279.
![Page 225: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/225.jpg)
BIBLIOGRAPHY 203
[Gortler et al., 1996] Gortler, S., Grzeszczuk, R., Szeliski, R., and Cohen, M. (1996). The
lumigraph. In SIGGRAPH, pages 43–54. ACM.
[Grossmann, 1987] Grossmann, P. (1987). Depth from focus. Pattern recognition letters,
5(1):63–69.
[Gu et al., 1997] Gu, X., Gortler, S., and Cohen, M. (1997). Polyhedral geometry and the two-
plane parameterisation. In Proc. Eurographics Workshop on Rendering Techniques, pages
1–12. Springer.
[Gupta et al., 2014] Gupta, S., Girshick, R., Arbeláez, P., and Malik, J. (2014). Learning rich
features from rgb-d images for object detection and segmentation. In European Conference
on Computer Vision, pages 345–360. Springer.
[Ham et al., 2017] Ham, C., Singh, S., and Lucey, S. (2017). Occlusions are fleeting - texture
is forever: Moving past brightness constancy. In WACV.
[Han et al., 2015] Han, K., Wong, K.-Y. K., and Liu, M. (2015). A fixed viewpoint approach
for dense reconstruction of transparent objects. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 4001–4008.
[Han et al., 2018] Han, K., Wong, K.-Y. K., and Liu, M. (2018). Dense reconstruction of trans-
parent objects by altering incident light paths through refraction. International Journal of
Computer Vision, 126(5):460–475.
[Han et al., 2012] Han, K.-S., Kim, S.-C., Lee, Y.-B., Kim, S.-C., Im, D.-H., Choi, H.-K.,
and Hwang, H. (2012). Strawberry harvesting robot for bench-type cultivation. Journal of
Biosystems Engineering, 37(1):65–74.
[Harris and Stephens, 1988] Harris, C. and Stephens, M. (1988). A combined corner and edge
detector. In Alvey vision conference, volume 15, page 50.
[Hartley and Zisserman, 2003] Hartley, R. and Zisserman, A. (2003). Multiple View Geometry
in Computer Vision. Cambridge.
![Page 226: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/226.jpg)
204 BIBLIOGRAPHY
[Hata et al., 1996] Hata, S., Saitoh, Y., Kumamura, S., and Kaida, K. (1996). Shape extraction
of transparent object using genetic algorithm. In Proceedings of 13th International Confer-
ence on Pattern Recognition, volume 4, pages 684–688. IEEE.
[Hecht, 2002] Hecht, E. (2002). Optics. Addition-Wesley, 4th ed. edition.
[Hill, 1979] Hill, J. (1979). Real time control of a robot with a mobile camera. In 9th Int. Symp.
on Industrial Robots, 1979, pages 233–246.
[Hinton, 1884] Hinton, C. H. (1884). What is the fourth dimension? Scientific Romances,
1:1–22.
[Honauer et al., 2016] Honauer, K., Johannsen, O., Kondermann, D., and Goldluecke, B.
(2016). A dataset and evaluation methodology for depth estimation on 4D light fields. In
Asian Conference on Computer Vision, pages 19–34. Springer.
[Hutchinson et al., 1996] Hutchinson, S., Hager, G., and Corke, P. (1996). A tutorial on visual
servo control. Transactions on Robotics and Automation, 12(5):651–670.
[Ideguchi et al., 2017] Ideguchi, Y., Uranishi, Y., Yoshimoto, S., Kuroda, Y., and Oshiro, O.
(2017). Light field convergency: Implicit photometric consistency on transparent surface.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Work-
shops, pages 41–49.
[Ihrke et al., 2010a] Ihrke, I., Kutulakos, K., Lensch, H., Magnor, M., and Heidrich, W.
(2010a). Transparent and specular object reconstruction. Computer Graphics forum,
29:2400–2426.
[Ihrke et al., 2010b] Ihrke, I., Wetzstein, G., and Heidrich, W. (2010b). A theory of plenoptic
multiplexing. In Intl. Conference on Computer Vision and Pattern Recognition (CVPR),
pages 483–490. IEEE.
[Irani and Anandan, 1999] Irani, M. and Anandan, P. (1999). About direct methods. In Work-
shop on Vision Algorithms. Springer.
![Page 227: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/227.jpg)
BIBLIOGRAPHY 205
[Iwatsuki and Okiyama, 2005] Iwatsuki, M. and Okiyama, N. (2005). A new formulation of
visual servoing based on cylindrical coordinate system. IEEE Transactions on Robotics,
21(2):266–273.
[Jachnik et al., 2012] Jachnik, J., Newcombe, R. A., and Davison, A. J. (2012). Real-time
surface light field capture for augmentation of planar specular surfaces. In Mixed and Aug-
mented Reality (ISMAR), 2012 IEEE Intl. Symposium on, pages 91–97. IEEE.
[Jägersand, 1995] Jägersand, M. (1995). Visual servoing using trust region methods and esti-
mation of the full coupled visual-motor jacobian. image, 11:1.
[Jang et al., 1991] Jang, W., Kim, K., Chung, M., and Bien, Z. (1991). Concepts of augmented
image space and transformed feature space for efficient visual servoing of an “eye-in-hand
robot”. Robotica, 9:203–212.
[Jerian and Jain, 1991] Jerian, C. P. and Jain, R. (1991). Structure from motion-a critical anal-
ysis of methods. IEEE Transactions on systems, Man, and Cybernetics, 21(3):572–588.
[Johannsen et al., 2017] Johannsen, O. et al. (2017). A taxonomy and evaluation of dense light
field depth estimation algorithms. In CVPR Workshop.
[Johannsen et al., 2015] Johannsen, O., Sulc, A., and Goldluecke, B. (2015). On linear struc-
ture from motion for light field cameras. In Intl. Conference on Computer Vision (ICCV),
pages 720–728.
[Johnson and Hebert, 1999] Johnson, A. and Hebert, M. (1999). Using spin images for efficient
object recognition in cluttered 3D scenes.
[Kemp et al., 2007] Kemp, C. C., Edsinger, A., and Torres-Jara, E. (2007). Challenges for
robot manipulation in human environments.
[Keshmiri and Xie, 2017] Keshmiri, M. and Xie, W.-F. (2017). image-based visual servoing
using an optimized trajectory planning technique. IEEE Transactions on Mechatronics,
22(1):359–370.
![Page 228: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/228.jpg)
206 BIBLIOGRAPHY
[Kim et al., 2017] Kim, J., Reshetouski, I., and Ghosh, A. (2017). Acquiring axially-symmetric
transparent objects using single-view transmission imaging. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 3559–3567.
[Klank et al., 2011] Klank, U., Carton, D., and Beetz, M. (2011). Transparent object detection
and reconstruction on a mobile platform. In 2011 IEEE International Conference on Robotics
and Automation, pages 5971–5978. IEEE.
[Kompella and Sturm, 2011] Kompella, V. R. and Sturm, P. (2011). Detection and avoidance
of semi-transparent obstacles using a collective-reward based approach. In 2011 IEEE Inter-
national Conference on Robotics and Automation, pages 3469–3474. IEEE.
[Kragic and Christensen, 2002] Kragic, D. and Christensen, H. (2002). Survey on visual ser-
voing for manipulation.
[Krizhhevsky et al., 2012] Krizhhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet
classification with deep convolutional neural networks.
[Krotkov and Bajcsy, 1993] Krotkov, E. and Bajcsy, R. (1993). Active vision for reliable rang-
ing: Cooperating focus, stereo, and vergence. Int. Journal of Computer Vision, 11(2):187–
203.
[Kurt and Edwards, 2009] Kurt, M. and Edwards, D. (2009). A survey of brdf models for
computer graphics. ACM SIGGRAPH Computer Graphics, 43(2):4.
[Kutulakos and Steger, 2007] Kutulakos, K. N. and Steger, E. (2007). A theory of refractive
and specular 3D shape by light-path triangulation. 76(1).
[Le et al., 2011] Le, M.-H., Woo, B.-S., and Jo, K.-H. (2011). A comparison of sift and har-
ris conner features for correspondence points matching. In 2011 17th Korea-Japan Joint
Workshop on Frontiers of Computer Vision (FCV), pages 1–4. IEEE.
[Lee et al., 2017] Lee, A. X., Levine, S., and Abbeel, P. (2017). Learning visual servoing with
deep features and fitted q-iteration. arXiv preprint arXiv:1703.11000.
![Page 229: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/229.jpg)
BIBLIOGRAPHY 207
[Lee, 2005] Lee, H.-C. (2005). Introduction to Color Imaging Science. Cambridge University
Press.
[Lehnert et al., 2019] Lehnert, C., Tsai, D., Eriksson, A., and McCool, C. (2019). 3D Move to
See: Multi-perspective visual servoing for improving object views with semantic segmenta-
tion. In Intl. Conference on Intelligent Robots and Systems (IROS).
[Levin and Durand, 2010] Levin, A. and Durand, F. (2010). Linear view synthesis using a
dimensionality gap light field prior. In Intl. Conference on Computer Vision and Pattern
Recognition (CVPR), pages 1831–1838. IEEE.
[Levoy and Hanrahan, 1996] Levoy, M. and Hanrahan, P. (1996). Light field rendering. In
SIGGRAPH, pages 31–42. ACM.
[Levoy et al., 2000] Levoy, M., Pulli, K., Curless, B., Rusinkiewicz, S., Koller, D., Pereira, L.,
Ginzton, M., Anderson, S., Davis, J., Ginsberg, J., et al. (2000). The digital michelangelo
project: 3D scanning of large statues. In Proceedings of the 27th annual conference on
Computer graphics and interactive techniques, pages 131–144. ACM Press/Addison-Wesley
Publishing Co.
[Li et al., 2008] Li, H., Hartley, R., and Kim, J.-h. (2008). A linear approach to motion estima-
tion using generalized camera models. In 2008 IEEE Conference on Computer Vision and
Pattern Recognition, pages 1–8. IEEE.
[Lippmann, 1908] Lippmann, G. (1908). Epreuves reversibles. photographies integrals.
Comptes-Rendus Academie des Sciences, 146:446–451.
[López-Nicolás et al., 2010] López-Nicolás, G., Guerrero, J. J., and Sagüés, C. (2010). Vi-
sual control through the trifocal tensor for nonholonomic robots. Robotics and Autonomous
Systems, 58(2):216–226.
![Page 230: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/230.jpg)
208 BIBLIOGRAPHY
[Low et al., 2007] Low, E. M., Manchester, I. R., and Savkin, A. V. (2007). A biologically in-
spired method for vision-based docking of wheeled mobile robots. Robotics and Autonomous
Systems, 55(10):769–784.
[Lowe, 2004] Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints.
Intl. Journal of Computer Vision (IJCV), 60(2):91–110.
[Luke et al., 2014] Luke, J., Rosa, F., Marichal, J., Sanluis, J., Dominguez Conde, C., and
Rodriguez-Ramos, J. (2014). Depth from light fields analyzing 4D local structure. Display
Technology, Journal of.
[Lumsdaine and Georgiev, 2008] Lumsdaine, A. and Georgiev, T. (2008). Full resolution light
field rendering. Technical report, Adobe Systems.
[Lumsdaine and Georgiev, 2009] Lumsdaine, A. and Georgiev, T. (2009). The focused plenop-
tic camera. In Computational Photography (ICCP), pages 1–8. IEEE.
[Luo et al., 2015] Luo, R., Lai, P.-J., and Ee, V. W. S. (2015). Transparent object recognition
and retrieval for robotic bio-laboratory automation applications. Intl. Conference on Intelli-
gent Robots and Systems (IROS).
[Lysenkov, 2013] Lysenkov, I. (2013). Recognition and pose estimation of rigid transparent
objects with a kinect sensor. Robotics: Science and Systems VIII, page 273.
[Lytro, 2015] Lytro (2015). Lytro Illum User Manual. Lytro Inc., Mountain View, CA.
[Maeno et al., 2013] Maeno, K., Nagahara, H., Shimada, A., and Taniguchi, R.-I. (2013). Light
field distortion feature for transparent object recognition. In Intl. Conference on Computer
Vision and Pattern Recognition (CVPR). IEEE.
[Magnus, 1985] Magnus, J. R. (1985). On differentiating eigenvalues and eigenvectors. Econo-
metric Theory, 1(2):179–191.
![Page 231: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/231.jpg)
BIBLIOGRAPHY 209
[Mahony et al., 2002] Mahony, R., Corke, P., and Chaumette, F. (2002). Choice of image fea-
tures for depth-axis control in image based visual servo control. In Intl. Conference on
Intelligent Robots and Systems (IROS), pages 390–395. IEEE.
[Malis and Chaumette, 2000] Malis, E. and Chaumette, F. (2000). 2 1/2 d visual servoing with
respect to unknown objects through a new estimation scheme of camera displacement. In-
ternational Journal of Computer Vision, 37(1):79–97.
[Malis et al., 1999] Malis, E., Chaumette, F., and Boudet, S. (1999). 2 1/2 d visual servoing.
IEEE Transactions on Robotics and Automation, 15(2):238–250.
[Malis et al., 2000] Malis, E., Chaumette, F., and Boudet, S. (2000). Multi-cameras visual
servoing. In Robotics and Automation (ICRA), pages 3183–3188. IEEE.
[Malis and Rives, 2003] Malis, E. and Rives, P. (2003). Robustness of image-based visual
servoing with respect to depth distribution errors. In 2003 IEEE International Conference
on Robotics and Automation (Cat. No. 03CH37422), volume 1, pages 1056–1061. IEEE.
[Marchand and Chaumette, 2017] Marchand, E. and Chaumette, F. (2017). Visual servoing
through mirror reflection. In 2017 IEEE International Conference on Robotics and Automa-
tion (ICRA), pages 3798–3804. IEEE.
[Mariottini et al., 2007] Mariottini, G. L., Oriolo, G., and Prattichizzo, D. (2007). Image-based
visual servoing for nonholonomic mobile robots using epipolar geometry. IEEE Transactions
on Robotics, 23(1):87–100.
[Marto et al., 2017] Marto, S. G., Monteiro, N. B., Barreto, J. P., and Gaspar, J. A. (2017).
Structure from plenoptic imaging. In 2017 Joint IEEE International Conference on Devel-
opment and Learning and Epigenetic Robotics (ICDL-EpiRob), pages 338–343. IEEE.
[McFadyen et al., 2017] McFadyen, A., Jabeur, M., and Corke, P. (2017). Image-based vi-
sual servoing with unknown point feature correspondence. IEEE Robotics and Automation
Letters, 2(2):601–607.
![Page 232: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/232.jpg)
210 BIBLIOGRAPHY
[McHenry et al., 2005] McHenry, K., Ponce, J., and Forsyth, D. (2005). Finding glass. In
2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR’05), volume 2, pages 973–979. IEEE.
[Mehta and Burks, 2014] Mehta, S. and Burks, T. (2014). Vision-based control of robotic ma-
nipulator for citrus harvesting. Computers and Electronics in Agriculture, 102:146–158.
[Mezouar and Allen, 2002] Mezouar, Y. and Allen, P. K. (2002). Visual servoed microposi-
tioning for protein manipulation tasks. In IEEE/RSJ International Conference on Intelligent
Robots and Systems, volume 2, pages 1766–1771. IEEE.
[Miyazaki and Ikeuchi, 2005] Miyazaki, D. and Ikeuchi, K. (2005). Inverse polarisation ray-
tracing: estimating surface shapes of transparent objects. Intl. Conference on Computer
Vision and Pattern Recognition (CVPR).
[Morris and Kutulakos, 2007] Morris, N. J. W. and Kutulakos, K. N. (2007). Reconstructing
the surface of inhomogeneous transparent scenes by scatter-trace photography. 76(1).
[Muja and Lowe, 2009] Muja, M. and Lowe, D. G. (2009). Fast approximate nearest neighbors
with automatic algorithm configuration. VISAPP (1), 2(331-340):2.
[Mukaigawa et al., 2010] Mukaigawa, Y., Tagawa, S., Kim, J., Raskar, R., Matsushita, Y., and
Yagi, Y. (2010). Hemispherical confocal imaging using turtleback reflector. In Computer
Vision–ACCV 2010, pages 336–349. Springer.
[Murase, 1990] Murase, H. (1990). Surface shape reconstruction of an undulating transparent
object. In [1990] Proceedings Third International Conference on Computer Vision, pages
313–317. IEEE.
[Neumann and Fermuller, 2003] Neumann, J. and Fermuller, C. (2003). Polydioptric camera
design and 3D motion estimation. Intl. Conference on Computer Vision and Pattern Recog-
nition (CVPR).
![Page 233: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/233.jpg)
BIBLIOGRAPHY 211
[Newcombe et al., 2011] Newcombe, R. A., Lovegrove, S., and Davison, A. J. (2011). DTAM:
dense tracking and mapping in real-time. In Intl. Conference on Computer Vision (ICCV),
pages 2320–2327.
[Ng et al., 2005] Ng, R., Levoy, M., Bredif, M., Duval, G., Horowitz, M., and Hanrahan, P.
(2005). Light field photography with a hand-held plenoptic camera. Technical report, Stan-
ford University Computer Science.
[O’Brien et al., 2018] O’Brien, S., Trumpf, J., Ila, V., and Mahony, R. (2018). Calibrating light
field cameras using plenoptic disc features. In 2018 International Conference on 3D Vision
(3DV), pages 286–294. IEEE.
[Pages et al., 2006] Pages, J., Collewet, C., Chaumette, F., and Salvi, J. (2006). An approach to
visual servoing based on coded light. In Proceedings 2006 IEEE International Conference
on Robotics and Automation, 2006. ICRA 2006., pages 4118–4123. IEEE.
[Papanikolopoulos and Khosla, 1993] Papanikolopoulos, N. P. and Khosla, P. K. (1993). Adap-
tive robotic visual tracking: Theory and experiments. IEEE Transactions on Automatic Con-
trol, 38(3):429–445.
[Pedrotti, 2008] Pedrotti, L. S. (2008). Fundamentals of Photonics.
[Perwass and Wietzke, 2012] Perwass, C. and Wietzke, L. (2012). Single lens 3D-camera with
extended depth-of-field. In IST/SPIE Electronic Imaging, pages 829108–829108. Interna-
tional Society for Optics and Photonics.
[Phong, 1975] Phong, B. T. (1975). Illumination for computer generated pictures. Communi-
cations of the ACM, 18(6):311–317.
[Piepmeier et al., 2004] Piepmeier, J. A., McMurray, G. V., and Lipkin, H. (2004). Uncali-
brated dynamic visual servoing. IEEE Transactions on Robotics and Automation, 20(1):143–
147.
![Page 234: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/234.jpg)
212 BIBLIOGRAPHY
[Quadros, 2014] Quadros, A. J. (2014). Representing 3D Shape in Sparse Range Images for
Urban Object Classification. Thesis, University of Sydney.
[Raytrix, 2015] Raytrix (2015). Raytrix light field sdk.
[Rosten et al., 2009] Rosten, E., Porter, R., and Drummond, T. (2009). Faster and better: A
machine learning approach to corner detection. IEEE Trans. Pattern Analysis and Machine
Intelligence (to appear).
[Rublee et al., 2011] Rublee, E., Rabaud, V., Konolige, K., and Bradski, G. (2011). Orb: an
efficient alternative to sift or surf. In Intl. Conference on Computer Vision (ICCV).
[Salti et al., 2014] Salti, S., Tombari, F., and Stefano, L. D. (2014). SHOT: Unique signatures
of histograms for surface and texture description. Computer Vision and Image Understand-
ing, 125:251–264.
[Saxena et al., 2006] Saxena, A., Chung, S. H., and Ng, A. Y. (2006). Learning depth from
single monocular images. In Advances in neural information processing systems, pages
1161–1168.
[Saxena et al., 2008] Saxena, A., Driemeyer, J., and Ng, A. (2008). Robotic grasping of novel
objects using vision. International Journal of Robotics Research.
[Schlick, 1994] Schlick, C. (1994). A survey of shading and reflectance models. In Computer
Graphics Forum, volume 13, pages 121–131. Wiley Online Library.
[Schoenberger and Frahm, 2016] Schoenberger, J. and Frahm, J.-M. (2016). Structure-from-
motion revisited. CVPR.
[Schoenberger et al., 2017] Schoenberger, J., Hardmeier, H., Sattler, T., and Pollefeys, M.
(2017). Comparative evaluation of hand-crafted and learned local features. Intl. Confer-
ence on Computer Vision and Pattern Recognition (CVPR).
![Page 235: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/235.jpg)
BIBLIOGRAPHY 213
[Shafer, 1985] Shafer, S. A. (1985). Using color to separate reflection components. Color
Research & Application, 10(4):210–218.
[Shi and Tomasi, 1993] Shi, J. and Tomasi, C. (1993). Good features to track. Technical report,
Cornell University.
[Siciliano and Khatib, 2016] Siciliano, B. and Khatib, O. (2016). Springer handbook of
robotics. Springer.
[Smith et al., 2009] Smith, B. M., Zhang, L., Jin, H., and Agarwala, A. (2009). Light field
video stabilization. In Intl. Conference on Computer Vision (ICCV).
[Song et al., 2015] Song, W., Liu, Y., Li, W., and Wang, Y. (2015). Light-field acquisition using
a planar catadioptric system. Optics Express, 23(24):31126–31135.
[Strecke et al., 2017] Strecke, M., Alperovich, A., and Goldluecke, B. (2017). Accurate depth
and normal maps from occlusion-aware focal stack symmetry. In Intl. Conference on Com-
puter Vision and Pattern Recognition (CVPR).
[Sturm et al., 2011] Sturm, P., Ramalingam, S., Tardif, J.-P., Gasparini, S., Barreto, J., et al.
(2011). Camera models and fundamental concepts used in geometric computer vision. Foun-
dations and Trends R© in Computer Graphics and Vision, 6(1–2):1–183.
[Szeliski et al., 2000] Szeliski, R., Avidan, S., and Anandan, P. (2000). Layer extraction from
multiple images containing reflections and transparency. In Proceedings IEEE Conference
on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662), volume 1,
pages 246–253. IEEE.
[Tahri and Chaumette, 2003] Tahri, O. and Chaumette, F. (2003). Application of moment in-
variants to visual servoing. In 2003 IEEE International Conference on Robotics and Au-
tomation (Cat. No. 03CH37422), volume 3, pages 4276–4281. IEEE.
![Page 236: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/236.jpg)
214 BIBLIOGRAPHY
[Tao et al., 2013] Tao, M. W., Hadap, S., Malik, J., and Ramamoorthi, R. (2013). Depth
from combining defocus and correspondence using light field cameras. In Computer Vision
(ICCV), 2013 IEEE International Conference on, pages 673–680. IEEE.
[Tao et al., 2016] Tao, M. W., Su, J.-C., Wang, T.-C., Malik, J., and Ramamoorthi, R. (2016).
Depth estimation and specular removal for glossy surfaces using point and line consistency
with light field cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence,
38(6):1155–1168.
[Teixeira et al., 2017] Teixeira, J. A., Brites, C., Pereira, F., and Ascenso, J. (2017). Epipolar
based light field key-location detector. In Multimedia Signal Processing.
[Teulière and Marchand, 2014] Teulière, C. and Marchand, E. (2014). A dense and direct ap-
proach to visual servoing using depth maps. IEEE Transactions on Robotics, 30(5):1242–
1249.
[Tombari et al., 2010] Tombari, F., Salti, S., and Stefano, L. D. (2010). Unique signatures of
histograms for local surface description. ECCV.
[Torr and Zisserman, 2000] Torr, P. H. and Zisserman, A. (2000). Mlesac: A new robust es-
timator with application to estimating image geometry. Computer vision and image under-
standing, 78(1):138–156.
[Tosic and Berkner, 2014] Tosic, I. and Berkner, K. (2014). 3D keypoint detection by light
field scale-depth space analysis. In Image Processing (ICIP). IEEE.
[Triggs et al., 2000] Triggs, B., McLauchlan, P., Hartley, R., and Fitzgibbon, A. (2000). Bundle
adjustment - a modern synthesis. Vision Algorithms, pages 298–372.
[Tsai et al., 2015] Tsai, C.-Y., Veeraraghavan, A., and Sankaranarayanan, A. C. (2015). What
does a single light-ray reveal about a transparent object? In 2015 IEEE International Con-
ference on Image Processing (ICIP), pages 606–610. IEEE.
![Page 237: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/237.jpg)
BIBLIOGRAPHY 215
[Tsai et al., 2016] Tsai, D., Dansereau, D., Martin, S., and Corke, P. (2016). Mirrored Light
Field Video Camera Adapter. Technical report, Queensland University of Technology.
[Tsai et al., 2017] Tsai, D., Dansereau, D. G., Peynot, T., and Corke, P. (2017). Image-based
visual servoing with light field cameras. IEEE Robotics and Automation Letters, 2(2):912–
919.
[Tsai et al., 2019] Tsai, D., Dansereau, D. G., Peynot, T., and Corke, P. (2019). Distinguishing
refracted features using light field cameras with application to structure from motion. IEEE
Robotics and Automation Letters, 4(2):177–184.
[Tsai et al., 2013] Tsai, D., Nesnas, I., and Zarzhitsky, D. (2013). Autonomous vision-based
tether-assisted rover docking. In Intl. Conference on Intelligent Robots and Systems (IROS).
IEEE.
[Tuytelaars et al., 2008] Tuytelaars, T., Mikolajczyk, K., et al. (2008). Local invariant feature
detectors: a survey. Foundations and trends in computer graphics and vision, 3(3):177–280.
[Vaish et al., 2006] Vaish, V., Levoy, M., Szeliski, R., Zitnick, C., and Kang, S. (2006). Re-
constructing occluded surfaces using synthetic apertures: Stereo, focus and robust measures.
In Intl. Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages
2331–2338. IEEE.
[Verdie et al., 2015] Verdie, Y., Yi, K. M., Fua, P., and Lepetit, V. (2015). TILDE: A temporally
invariant learned DEtector. Intl. Conference on Computer Vision and Pattern Recognition
(CVPR).
[Walter et al., 2015] Walter, C., Penzlin, F., Schulenburg, E., and Elkmann, N. (2015). En-
abling multi-purpose mobile manipulators: Localization of glossy objects using a light field
camera. In Conference on Emerging Technologies & Factory Automation (ETFA), pages 1–8.
IEEE.
![Page 238: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/238.jpg)
216 BIBLIOGRAPHY
[Wanner and Goldeluecke, 2014] Wanner, S. and Goldeluecke, B. (2014). Variational light
field analysis for disparity estimation and super-resolution. IEEE Trans. on Pattern Analysis
and Machine Intelligence, 36(3).
[Wanner and Goldluecke, 2012] Wanner, S. and Goldluecke, B. (2012). Globally consistent
depth labeling of 4D light fields. In Intl. Conference on Computer Vision and Pattern Recog-
nition (CVPR).
[Wanner and Golduecke, 2013] Wanner, S. and Golduecke, B. (2013). Reconstructing reflec-
tive and transparent surfaces from epipolar plane images. Proc. 35th German Conf. Pattern
Recog.
[Wei et al., 2013] Wei, Y., Kang, L., Yang, B., and Wu, L. (2013). Applications of structure
from motion: a survey. Journal of Zhejiang University-SCIENCE C (Computers & Electron-
ics), 14(7).
[Weisstein, 2017] Weisstein, E. W. (2017). Hyperplane. http://mathworld.wolfram.
com/Hyperplane.html. [Online; accessed 19-July-2017].
[Wetzstein et al., 2011] Wetzstein, G., Roodnick, D., Heidrich, W., and Raskar, R. (2011). Re-
fractive shape from light field distortion. In Intl. Conference on Computer Vision (ICCV),
pages 1180–1186. IEEE.
[Wilburn et al., 2004] Wilburn, B., Joshi, N., Vaish, V., Levoy, M., and Horowitz, M. (2004).
High-speed videography using a dense camera array. In Intl. Conference on Computer Vision
and Pattern Recognition (CVPR), volume 2, pages II–294. IEEE.
[Wilburn et al., 2005] Wilburn, B., Joshi, N., Vaish, V., Talvala, E., Antunez, E., Barth, A.,
Adams, A., Horowitz, M., and Levoy, M. (2005). High performance imaging using large
camera arrays. ACM Transactions on Graphics (TOG), 24(3):765–776.
![Page 239: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/239.jpg)
BIBLIOGRAPHY 217
[Wilson et al., 1996] Wilson, W. J., Hulls, C. W., and Bell, G. S. (1996). Relative end-effector
control using cartesian position based visual servoing. IEEE Transactions on Robotics and
Automation, 12(5):684–696.
[Xu et al., 2015] Xu, Y., Nagahara, H., Shimada, A., and ichiro Taniguchi, R. (2015). Transcut:
Transparent object segmentation from a light field image. Intl. Conference on Computer
Vision and Pattern Recognition (CVPR).
[Yamamoto, 1986] Yamamoto, M. (1986). Determining three-dimensional structure from im-
age sequences given by horizontal and vertical moving camera. Denshi Tsushin Gakkai
Ronbunshi (Transactions of the Institute of Electronics, Information and Communication
Engineers of Japan), pages 1631–1638.
[Yeasin and Sharma, 2005] Yeasin, M. and Sharma, R. (2005). Foveated vision sensor and im-
age processing–a review. In Machine Learning and Robot Perception, pages 57–98. Springer.
[Yi et al., 2016] Yi, K. M., Trulls, E., Lepetit, V., and Fua, P. (2016). LIFT: Learned invariant
feature transform. arXiv.
[Zeller et al., 2015] Zeller, N., Quint, F., and Stilla, U. (2015). Narrow field-of-view visual
odometry based on a focused plenoptic camera. In ISPRS Annals of the Photogrammetry,
Remote Sensing and Spatial Information Sciences.
[Zhang et al., 2018] Zhang, K., Chen, J., and Chaumette, F. (2018). Visual servoing with tri-
focal tensor. In 2018 IEEE Conference on Decision and Control (CDC), pages 2334–2340.
IEEE.
[Zhang et al., 2017] Zhang, Y., Yu, P., Yang, W., Ma, Y., and Yu, J. (2017). Ray space features
for plenoptic structure-from-motion. In Proceedings of the IEEE International Conference
on Computer Vision, pages 4631–4639.
![Page 240: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/240.jpg)
218 BIBLIOGRAPHY
[Zhou et al., 2018] Zhou, Z., Sui, Z., and Jenkins, O. C. (2018). Plenoptic Monte Carlo object
localization for robot grasping under layered translucency. In 2018 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), pages 1–8. IEEE.
![Page 241: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/241.jpg)
Appendix A
Mirrored Light-Field Video Camera
Adapter
This appendix section proposes the design of a custom mirror-based light-field camera adapter
that is cheap, simple in construction, and accessible. Mirrors of different shape and orientation
reflect the scene into an upwards-facing camera to create an array of virtual cameras with over-
lapping field of view at specified depths, and deliver video frame rate s. We describe the design,
construction, decoding and calibration processes of our mirror-based light-field camera adapter
in preparation for an open-source release to benefit the robotic vision community.
The latest report, computer-aided design models, diagrams and code can be obtained from the
following repository:
https://bitbucket.org/acrv/mirrorcam.
I
![Page 242: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/242.jpg)
II A.1. INTRODUCTION
A.1 Introduction
Light-field cameras are a new paradigm in imaging technology that may greatly augment the
computer vision and robotics fields. Unlike conventional cameras that only capture spatial
information in 2D, light-field cameras capture both spatial and angular information in 4D using
multiple views of the same scene within a single shot [Ng et al., 2005]. Doing so implicitly
encodes geometry and texture, and allows for depth extraction. Capturing multiple views of
the same scene also allows light-field cameras to handle occlusions [Walter et al., 2015], and
non-Lambertian (glossy, shiny, reflective, transparent) surfaces, that often break most modern
computer vision and robotic techniques [Vaish et al., 2006].
Robots must operate in continually changing environments on relatively constrained platforms.
As such, the robotics community is interested in low cost, computationally inexpensive, and
real-time camera performance. Unfortunately, there is a scarcity of commercially available
light-field cameras appropriate for robotics applications. Specifically, no commercial camera
delivers 4D s at video frame rates1. Creating a full camera array comes with more synchroniza-
tion, bulk, input-output and bandwidth issues. However, the advantages of our approach are
video-framerate LF video allowing real-time performance, the ability to customize the design
to optimize key performance metrics required for the application, and the ease of fabrication.
The main disadvantages of our approach are a lower resolution, a lower FOV2, and a more
complex decoding process.
Therefore, we constructed our own LF video camera by employing a mirror-based adapter. This
approach splits the camera’s field of view into sub-images using an array of planar mirrors. By
appropriately positioning the mirrors, a grid of virtual views with overlapping fields of view
can be constructed, effectively capturing a . We 3D-printed the mount based on our design, and
populated the mount with laser-cut acrylic mirrors.
1Though one manufacturer provides video, it does not provide a 4D LF, only 2D, RGBD or raw lenslet images
with no method for decoding to 4D.2A 3× 3 array will have 1/3 the FOV of the base camera.
![Page 243: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/243.jpg)
APPENDIX A. MIRRORED LIGHT-FIELD VIDEO CAMERA ADAPTER III
(a)
(b)
(c)
Figure A.1: (a) MirrorCam mounted on the Kinova MICO robot manipulator. Nine mirrors
of different shape and orientation reflect the scene into the upwards-facing camera to create 9
virtual cameras, which provides video frame-rate s. (b) A whole image captured by the Mirror-
Cam and (c) the same decoded into a light-field parameterisation of 9 sub-images, visualized as
a 2D tiling of 2D images. The non-rectangular sub-images allow for greater FOV overlap [Tsai
et al., 2017].
![Page 244: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/244.jpg)
IV A.2. BACKGROUND
The main contribution of this appendix is the design and construction of a mirror-based adapter
like the one shown in Fig. A.1a, which we refer to as MirrorCam. We provide a novel optimiza-
tion routine for the design of the custom mirror-based camera that models each mirror using
a 3-Degree-of-Freedom (DOF) reflection matrix. The calibration step uses 3-DOF mirrors as
well; the design step allows non-rectangular projected images. We aim to make the design,
methodology and code open-source to benefit the robotic vision research community.
The remainder of this appendix is organized as follows. Section A.2 provides some background
on light-field cameras in relation to the MirrorCam. Section A.3 explains our methods for
designing, optimizing, constructing, decoding and calibrating the MirrorCam. And finally in
Section A.4, we conclude the appendix and explore future work.
A.2 Background
Light-field cameras measure the amount of light travelling along each ray that intersects the
sensor by acquiring multiple views of a single scene. Doing so allows these cameras to obtain
both geometry, texture, and depth information within a single light-field image/photograph.
Some excellent references for s are [Adelson and Wang, 2002, Chan, 2014, Dansereau, 2014].
Table A.1 compares some of the most common LF camera architectures. The most prevalent
are the camera array [Wilburn et al., 2005], and the micro-lens array (MLA) [Ng et al., 2005].
However, the commercially-available light-field cameras are insufficient for providing s for real-
time robotics. Notably, the Lytro Illum does not provide s at a video frame rate [Lytro, 2015].
The Raytrix R10 is a light-field camera that captures the at more than 7-30 frames-per-second
(FPS); however, the camera uses lenslets with different focal lengths, which makes decoding the
raw image extremely difficult, and only provides 3D depth maps [Raytrix, 2015]. Furthermore,
as commercial products, the light-field camera companies have not disclosed details on how
to access and decode the light-field camera images, forcing researchers to hack solutions with
![Page 245: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/245.jpg)
APPENDIX A. MIRRORED LIGHT-FIELD VIDEO CAMERA ADAPTER V
limited success. All of these reasons motivate a customizable, easy-to-access, easy to construct,
and open-source video frame-rate light-field camera.
A.3 Methods
We constructed our own LF video camera by employing a mirror-based adapter based on previ-
ous works [Fuchs et al., 2013, Song et al., 2015, Mukaigawa et al., 2010]. This approach slices
the original camera image into sub-images using an array of planar mirrors. Curved mirrors
may produce better optics; however, these mirrors are difficult to produce. Planar mirrors are
much more accessible and customizable. A grid of virtual views with overlapping field of view
can be constructed by carefully aligning the mirrors. These multiple views effectively capture
a . Our approach differs from previous work by reducing the optimization routine to a single
tunable parameter, and identifying the fundamental trade-off between depth of field and field of
view in the design of mirrored LF cameras. Additionally, we utilize non-square mirror shapes.
A.3.1 Design & Optimization
Because an array of mirrors has insufficient degrees of freedom to provide both perfectly over-
lapping FOVs and perfectly positioned projective centres, we employ an optimization algorithm
to strike a balance between these factors, as in [Fuchs et al., 2013]. A tunable parameter deter-
mines the relative importance of closeness to a perfect grid of virtual poses, and field of view
overlap, which is evaluated at a set of user-defined depths. The grid of virtual poses is allowed
to be rectangular, to better exploit rectangular camera FOVs.
The optimization routine begins with a faceted parabola at a user-defined scale and mirror count.
Optimization is allowed to manipulate the positions and normals of the mirror planes, as well
![Page 246: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/246.jpg)
VI A.3. METHODS
Table A.1: Comparison of Accessibility for Different LF Camera Systems
LF Systems Sync FPS1 Customizability Open-Source
Camera Array poor 7-30 significant yes
MLA (Lytro Illum) good 0.5 none limited
MLA (Raytrix R8/R10) good 7-30 minor limited
MirrorCam good 2-30 significant yes
1 Frames per second
as their extents. Optimization constraints prevent mirrors occluding their neighbours, and allow
a minimum spacing between mirrors to be imposed for manufacturability.
Fig. A.2 shows an example 3× 3 mirror array before and after optimization. The FOV overlap
was evaluated at 0.3 and 0.5 m. Fig. A.1a shows an assembled model mounted on a robot arm,
and Fig. 4.1b shows an example image taken from the camera. Note that the optimized design
does not yield rectangular sub-images, as allowing a general quadrilateral shape allows for
greater FOV overlap. In future work, we will explore the use of non-quadrilateral sub-images.
A.3.2 Construction
For the construction of the MirrorCam, we aimed to use easily accessible materials and methods.
We 3D-printed the mount based on our design, and populated the mounts with laser-cut flat
acrylic mirrors. Figure A.3 shows a computer rendering of the MirrorCam before 3D printing.
The reflection of the 9 mirrors show the upwards-facing camera, which is secured at the base
of the MirrorCam. This design was built for the commonly available Logitech C920 webcam.
More detailed diagrams of the design are supplied in the Appendix.
Mirror thickness and quality proved to be an issue for the construction of the MirrorCam. Since
the mirrors are quite close to the camera, the thickness of the mirrors occlude a significant
portion of the image, which greatly reduces the resolution of each sub-image. Thus, we opted
![Page 247: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/247.jpg)
APPENDIX A. MIRRORED LIGHT-FIELD VIDEO CAMERA ADAPTER VII
0
0.02
0.04
-0.04
0.06
z
0.08
-0.02 0.1
x
0 0.05
y
0.02 00.04 -0.05
(a)
-0.2 0 0.2
x
-0.1
0
0.1
0.2
y(b)
0
0.02
0.04
-0.04
0.06
z
0.08
-0.02 0.1
x
0 0.05
y
0.02 00.04 -0.05
(c)
-0.2 0 0.2
x
-0.1
0
0.1
0.2
y
(d)
Figure A.2: (a) A parabolic mirror array reflects images from the scene at right into a camera,
shown in blue at bottom; Each mirror yields a virtual view, shown in red – note that these are
far from an ideal grid; (b) The FOV overlap evaluated at 0.5 m, with the region of full overlap
highlighted in green; (c) and (d) the same after optimization, showing better virtual camera
placement and FOV overlap.
![Page 248: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/248.jpg)
VIII A.3. METHODS
(a) (b)
Figure A.3: Rendered image of the MirrorCam version 0.4C, (a) from the front showing the
single camera lens that is visible from all nine mirrored surfaces, and (b) an isometric view
showing how the camera is attached to the mirrors.
for thin mirrors, but encountered problems with mirror warping and flatness from the cheap
acrylic mirrors. By inspecting the mirrors before purchase, and handling them very carefully
(without flexing them) during construction, cutting and adhesion, we were able to minimise
image warping and flatness.
A.3.3 Decoding & Calibration
Our MirrorCam calibration has two steps: first the base camera is calibrated following a con-
ventional intrinsic calibration, e.g. using MATLAB’s built-in camera calibration tool. Next
the camera is assembled with mirrors and the mirror geometry is estimated using a Levenberg-
Marquardt optimization of the error between expected and observed checker board corner lo-
cations. Initialization of the mirror geometry is based on the array design, and sub-image seg-
mentation is manually specified.
![Page 249: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/249.jpg)
AP
PE
ND
IXA
.M
IRR
OR
ED
LIG
HT
-FIE
LD
VID
EO
CA
ME
RA
AD
AP
TE
RIX
128
.73
41.
23
Mount for Kinova MICO
Indentation used for approximate alignment of mirror arrqy
103.80
64.60
Mounting to suit Logitech C920 Camera
Holes sized to self tap with M5 mertic bolts
142.94
N/A
N/AN/AN/AN/A
0.4
8 7
A
B
23456 1
578 246 13
E
D
C
F F
D
B
A
E
C
DRAWN
FINISH: DEBURR AND BREAK SHARP EDGES
NAME SIGNATURE DATE MATERIAL:
DO NOT SCALE DRAWING REVISION
TITLE:
DWG NO.
SCALE:1:2 SHEET 1 OF 1
A33D Printed ABSSteven Martin 11/04/2016 KinovaMount
MirrorCam
Fig
ure
A.4
:M
irrorC
amv0.4
ckin
ova
mount.
One
poin
tof
differen
cew
ithprio
rw
ork
isth
atrath
erth
anem
plo
yin
ga
6-D
OF
transfo
rmatio
n
for
eachvirtu
alcam
eraview
,our
calibratio
nm
odels
eachm
irror
usin
ga
3-D
OF
reflectio
n
matrix
.T
his
reduces
the
DO
Fin
the
camera
model
and
more
closely
match
esth
ephysical
camera,
speed
ing
converg
ence
and
impro
vin
gro
bustn
ess.
Alim
itation
of
our
calibratio
ntech
niq
ue
isth
atth
eim
ages
taken
with
out
mirro
rsare
only
con-
sidered
when
initializin
gth
ecam
erain
trinsics.
Abetter
solu
tion,
leftas
futu
rew
ork
,w
ould
join
tlyco
nsid
erall
imag
es,w
ithan
dw
ithout
mirro
rs.
![Page 250: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/250.jpg)
X A.4. CONCLUSIONS AND FUTURE WORK
Based on the calibrated mirror geometry, the nearest grid of parallel cameras is estimated, and
decoding proceeds as:
1. Remove 2D radial distortion,
2. Slice 2D image into a 4D array, and
3. Reproject each 2D sub-image into central camera view orientation.
Here, we assume the central camera view is aligned with the center mirror.
The final step corrects for rotational differences between the calibrated and desired virtual cam-
era arrays using 2D projective transformations. There is no compensation for translational error,
though in practice the cameras are very close to an ideal grid. An example input image and de-
coded are shown in Fig. 4.1c. Our calibration routine reported a 3D spatial reprojection RMS
error of 1.80 mm. The spatial reprojection error is the 3D distance from the projected ray to the
expected feature location during camera calibration, where pixel projections are traced through
the camera model into space. This small error confirms that the camera design, manufacture
and calibration has yielded observations close to an ideal .
It is important to note that our current calibration did not account for the manufacturing aspects
of the camera, such as the thickness of the acrylic mirrors, or the additional thickness of the
epoxy used to secure the mirrors to the mount. The acrylic mirrors we used also exhibited some
bending and rippling, causing image distortion unaccounted for in the calibration process.
A.4 Conclusions and Future Work
In this appendix, we have proposed the design optimisation, construction, decoding and cali-
bration process of a mirror-based light-field camera. We have shown that our 3D-printed Mir-
rorCam, optimized for overlapping FOV, reproduced a . This implies that the mirror-based LF
![Page 251: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/251.jpg)
APPENDIX A. MIRRORED LIGHT-FIELD VIDEO CAMERA ADAPTER XI
camera was a viable, low-cost, and accessible alternative to commercially available LF cam-
eras. Our implementation takes 5 seconds per frame to operate as unoptimized MATLAB code.
The decoding and correspondence processes are the current bottlenecks. Through optimization,
real-time s should be possible. We push the envelope of technology towards real-time light-field
cameras for robotics. In future work, we will validate the MirrorCam in terms of image refo-
cusing, depth estimation and perspective shift in comparison to other commercially-available
light-field cameras.
![Page 252: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/252.jpg)
XII
A.4
.C
ON
CL
US
ION
SA
ND
FU
TU
RE
WO
RK
50.
13
Mirrors mounted flush to surface using epoxy
85.14
14.80
Optional M5 shroud mounting holes
Center line for alignment
73.20
15
27
M6 Bolts with washers used to mount mirror holder to main assembly.
Not for vertical alignment
N/A
8 7
A
B
23456 1
578 246 13
E
D
C
F F
D
B
A
E
C
DRAWN
FINISH: DEBURR AND BREAK SHARP EDGES
NAME SIGNATURE DATE MATERIAL:
DO NOT SCALE DRAWING REVISION
TITLE:
DWG NO.
SCALE:1:1 SHEET 1 OF 1
A33D printed ABS
WEIGHT:
Steven Martin 11/04/2016 MirrorHolder
MirrorCam
0.4
Fig
ure
A.5
:M
irrorC
amv0.4
cm
irror
hold
er.
![Page 253: LIGHT FIELD FEATURES FOR ROBOTIC VISION IN THE PRESENCE … Yu Peng_Tsai_Thesis.pdf · game theory applied to the birds and the bees, until I changed topics to light fields and robotic](https://reader034.vdocuments.net/reader034/viewer/2022042413/5f2cde1df3dbfb30cd545018/html5/thumbnails/253.jpg)
AP
PE
ND
IXA
.M
IRR
OR
ED
LIG
HT
-FIE
LD
VID
EO
CA
ME
RA
AD
AP
TE
RX
III
10
92.80
55.60
N/A
8 7
A
B
23456 1
578 246 13
E
D
C
F F
D
B
A
E
C
DRAWN
FINISH: DEBURR AND BREAK SHARP EDGES
NAME SIGNATURE DATE MATERIAL:
DO NOT SCALE DRAWING REVISION
TITLE:
DWG NO.
SCALE:1:1 SHEET 1 OF 1
A33D printed ABS
WEIGHT:
Steven Martin 11/04/2016 CameraClip
MirrorCam
0.4
Fig
ure
A.6
:M
irrorC
amv0.4
ccam
eraclip
.