light field features for robotic vision in the presence … yu peng_tsai_thesis.pdf · game theory...

LIGHT-FIELD FEATURES FOR ROBOTIC

VISION IN THE PRESENCE OF

REFRACTIVE OBJECTS

Dorian Yu Peng Tsai

MSc (Technology)

BASc (Enginnering Science with Honours)

Submitted in fulfillment

of the requirement of the degree of

Doctor of Philosophy

2020

School of Electrical Engineering and Computer Science

Science and Engineering Faculty

Queensland University of Technology

Abstract

Robotic vision is an integral aspect of robot navigation and human-robot interaction, as well as

object recognition, grasping and manipulation. Visual servoing is the use of computer vision

for closed-loop control of a robot’s motion and has been shown to increase the accuracy and

performance of robotic grasping, manipulation and control tasks. However, many robotic vision

algorithms (including those focused on solving the problem of visual servoing) find refractive

objects particularly challenging. This is because these types of objects are difficult to perceive.

They are transparent and their appearance is essentially a distorted view of the background,

which can change significantly with small changes in viewpoint. What is often overlooked is

that most robotic vision algorithms implicitly assume that the world is Lambertian—that the

appearance of a point on an object does not change significantly with respect to small changes

in viewpoint1. Refractive objects violate the Lambertian assumption and this can lead to image

matching inconsistencies, pose errors and even failures of modern robotic vision systems.

This thesis investigates the use of light-field cameras for robotic vision to enable vision-based

motion control in the presence of refractive objects. Light-field cameras are a novel camera

technology that use multi-aperture optics to capture a set of dense and uniformly-sampled views

of the scene from multiple viewpoints. Light-field cameras capture the light field, which simul-

taneously encodes texture, depth and multiple viewpoints. Light-field cameras are a promising

alternative to conventional robotic vision sensors, because of their unique ability to capture

view-dependent effects, such as occlusion, specular reflection and, in particular, refraction.

First, we investigate using input from the light-field camera to directly control robot motion,

a process known as image-based visual servoing, in Lambertian scenes. We propose a novel

light-field feature for Lambertian scenes and develop the relationships between feature motion

and camera motion for the purposes of visual servoing. We also illustrate in both simulation

and using a custom mirror-based light-field camera, that our method of light-field image-based

visual servoing is more tolerant to small and distant targets and partially-occluded scenes than

monocular and stereo-based methods.

Second, we propose a method to detect refractive objects using a single light field. Specifi-

cally, we define refracted image features as those image features whose appearance have been

distorted by a refractive object. We discriminate between refracted image features and the

surrounding Lambertian image features. We also show that using our method to ignore the re-

fracted image features enables monocular structure from motion in scenes containing refractive

objects, where traditional methods fail.

We combine and extend our two previous contributions to develop a light-field feature capable

of enabling visual servoing towards refractive objects without needing a 3D geometric model of

the object. We show in experiments that this feature can be reliably detected and extracted from

the light field. The feature appears to be continuous with respect to viewpoint, and is therefore

be suitable for visual servoing towards refractive objects.

1This Lambertian assumption is also known as the photo-consistency or brightness constancy assumption.

This thesis represents a unique contribution toward our understanding of refractive objects in

the light field for robotic vision. Application areas that may benefit from this research include

manipulation and grasping of household objects, medical equipment, and in-orbit satellite ser-

vicing equipment. It could also benefit quality assurance and manufacturing pick-and-place

robots. The advances constitute a critical step to enabling robots to work more safely and reli-

ably with everyday refractive objects.

Statement of Original Authorship

The work contained in this thesis has not been previously submitted to meet requirements for an

award at this or any other higher education institution. To the best of my knowledge and belief,

the thesis contains no material previously published or written by another person except where

due reference is made.

Dorian Tsai

March 2, 2020

QUT Verified Signature

Acknowledgements

To my academic advisors, Professor Peter Ian Corke, Dr. Donald Gilbert Dansereau and Asso-

ciate Professor Thierry Peynot, I would like to offer my most heartfelt gratitude. They shared

with me an amazing knowledge, insight, creativity and enthusiasm. I am grateful for the re-

sources and opportunities they provided, as well as their guidance, support and patience.

In addition, I wish to convey my appreciation to Douglas Palmer and Thomas Coppin who were

my fellow plenopticists for many helpful and stimulating discussions. Thanks to Dr. Steven

Martin who helped with much of the technical engineering aspects of building and mounting

light-field cameras to various robots over the years. Thanks to Dominic Jack and Ming Xu for

being an excellent desk buddy. Thanks to Prof. Tristan Perez, Associate Professor Jason Ford

and Dr. Timothy Molloy for helping to get me started on my PhD journey in inverse differential

game theory applied to the birds and the bees, until I changed topics to light fields and robotic

vision six months later.

Thanks to Kate Aldridge, Sarah Allen and all of the other administrative staff in the Australian

Centre for Robotic Vision (ACRV) for organising so many conferences and workshops, and

keeping things running smoothly.

This research was funded in part from the Queensland University of Technology (QUT) Post-

graduate Research Award, the QUT Higher Degree Tuition Fee Sponsorship, the QUT Excel-

lent Top-Up Scholarship, and the ACRV Top-Up Scholarship, as well as financial support in

the form of employment as a course mentor and research assistant. The ACRV scholarship was

supported in part by the Australian Research Council Centre of Excellence for Robotic Vision.

Lastly, a very special thanks goes to the many faithful friends and family and colleagues who’s

backing and constant encouragements sustained me through this academic marathon and grad-

uate with a degree. I am especially indebted to Robin Tunley and Miranda Cherie Fittock for

their camaraderie and steady moral support. Thank you all very much.

Contents

Abstract

List of Tables vii

List of Figures ix

List of Acronyms xiii

List of Symbols xv

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Limitations of Robotic Vision for Refractive Objects . . . . . . . . . . 3

1.1.2 Seeing and Servoing Towards Refractive Objects . . . . . . . . . . . . 5

1.2 Statement of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.5 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Background on Light Transport & Capture 15

2.1 Light Transport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

i

ii CONTENTS

2.1.1 Specular Reflections . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1.2 Diffuse Reflections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1.3 Lambertian Reflections . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.1.4 Non-Lambertian Reflections . . . . . . . . . . . . . . . . . . . . . . . 17

2.1.5 Refraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Monocular Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2.1 Central Projection Model . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2.2 Thin Lenses and Depth of Field . . . . . . . . . . . . . . . . . . . . . 23

2.3 Stereo Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4 Multiple Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.5 Light-Field Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.5.1 Plenoptic Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.5.2 4D Light Field Definition . . . . . . . . . . . . . . . . . . . . . . . . 32

2.5.3 Light Field Parameterisation . . . . . . . . . . . . . . . . . . . . . . . 34

2.5.4 Light-Field Camera Architectures . . . . . . . . . . . . . . . . . . . . 36

2.6 4D Light-Field Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.7 4D Light-Field Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.7.1 Geometric Primitive Definitions . . . . . . . . . . . . . . . . . . . . . 44

2.7.2 From 2D to 4D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.7.3 Point-Plane Correspondence . . . . . . . . . . . . . . . . . . . . . . . 56

2.7.4 Light-Field Slope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3 Literature Review 61

3.1 Image Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.1.1 2D Geometric Image Features . . . . . . . . . . . . . . . . . . . . . . 62

CONTENTS iii



3.1.4 Direct Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.1.5 Image Feature Correspondence . . . . . . . . . . . . . . . . . . . . . . 70

3.2 Visual Servoing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.2.1 Position-based Visual Servoing . . . . . . . . . . . . . . . . . . . . . 73

3.2.2 Image-based Visual Servoing . . . . . . . . . . . . . . . . . . . . . . 75

3.3 Refractive Objects in Robotic Vision . . . . . . . . . . . . . . . . . . . . . . . 81

3.3.1 Detection & Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 82

3.3.2 Shape Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4 Light-Field Image-Based Visual Servoing 95

4.1 Light-Field Cameras for Visual Servoing . . . . . . . . . . . . . . . . . . . . . 95

4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.3 Lambertian Light-Field Feature . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.4 Light-Field Image-Based Visual Servoing . . . . . . . . . . . . . . . . . . . . 100

4.4.1 Continuous-domain Image Jacobian . . . . . . . . . . . . . . . . . . . 100

4.4.2 Discrete-domain Image Jacobian . . . . . . . . . . . . . . . . . . . . . 102

4.5 Implementation & Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 104

4.5.1 Light-Field Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4.5.2 Mirror-Based Light-Field Camera Adapter . . . . . . . . . . . . . . . 105

4.5.3 Control Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4.6.1 Camera Array Simulation . . . . . . . . . . . . . . . . . . . . . . . . 108

iv CONTENTS

4.6.2 Arm-Mounted MirrorCam Experiments . . . . . . . . . . . . . . . . . 110

4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

5 Distinguishing Refracted Image Features 119

5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.2 Lambertian Points in the Light Field . . . . . . . . . . . . . . . . . . . . . . . 126

5.3 Distinguishing Refracted Image Features . . . . . . . . . . . . . . . . . . . . . 128

5.3.1 Extracting Image Feature Curves . . . . . . . . . . . . . . . . . . . . . 130

5.3.2 Fitting 4D Planarity to Image Feature Curves . . . . . . . . . . . . . . 132

5.3.3 Measuring Planar Consistency . . . . . . . . . . . . . . . . . . . . . . 137

5.3.4 Measuring Slope Consistency . . . . . . . . . . . . . . . . . . . . . . 138


5.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

5.4.2 Refracted Image Feature Discrimination with Different LF Cameras . . 141

5.4.3 Rejecting Refracted Image Features for Structure from Motion . . . . . 148

5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

6 Light-Field Features for Refractive Objects 157

6.1 Refracted LF Features for Vision-based Control . . . . . . . . . . . . . . . . . 158

6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

6.3 Optics of a Lens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

6.3.1 Spherical Lens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

6.3.2 Cylindrical Lens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

6.3.3 Toric Lens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

6.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

CONTENTS v

6.4.1 Refracted Light-Field Feature Definition . . . . . . . . . . . . . . . . . 166

6.4.2 Refracted Light-Field Feature Extraction . . . . . . . . . . . . . . . . 170


6.5.1 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

6.5.2 Feature Continuity in Single-Point Ray Simulation . . . . . . . . . . . 177

6.5.3 Feature Continuity in Ray Tracing Simulation . . . . . . . . . . . . . . 179

6.6 Visual Servoing Towards Refractive Objects . . . . . . . . . . . . . . . . . . . 186

6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

7 Conclusions and Future Work 191

7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

Bibliography 197

A Mirrored Light-Field Video Camera Adapter I

A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II

A.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV

A.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V

A.3.1 Design & Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . V

A.3.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI

A.3.3 Decoding & Calibration . . . . . . . . . . . . . . . . . . . . . . . . . VIII

A.4 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . X

vi CONTENTS

List of Tables

2.1 Minimum Number of Parameters to Describe Geometric Primitives from 2D to

4D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.1 Comparison of camera systems’ capabilities and tolerances for VS . . . . . . . 98

5.1 Comparison of our method and the state of the art using two LF camera arrays

and a lenslet-based camera for discriminating refracted image features . . . . . 145

5.2 Comparison of mean relative instantaneous pose error for unfiltered and filtered

SfM-reconstructed trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . 151

A.1 Comparison of Accessibility for Different LF Camera Systems . . . . . . . . . VI

vii

viii LIST OF TABLES

List of Figures

1.1 Robot applications with refractive objects . . . . . . . . . . . . . . . . . . . . 3

1.2 An example of unreliable RGB-D camera output for a refractive object . . . . . 5

1.3 Light-field camera as a array of cameras . . . . . . . . . . . . . . . . . . . . . 6

1.4 Gradual changes in a refractive object’s appearance in an image can be pro-

grammatically detected . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 Surface reflections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Lambertian and non-Lambertian reflections . . . . . . . . . . . . . . . . . . . 18

2.3 Non-Lambertian reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 Snell’s law of refraction at the interface of two media. . . . . . . . . . . . . . 19

2.5 The central projection model . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.6 Image formation for a thin lens . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.7 Depth of field and focus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.8 Epipolar geometry for a stereo camera system . . . . . . . . . . . . . . . . . . 27

2.9 The plenoptic function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.10 The two-plane parameterisation of the 4D LF . . . . . . . . . . . . . . . . . . 34

2.11 Example 4D LF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.12 Light-field camera architectures . . . . . . . . . . . . . . . . . . . . . . . . . 37

ix

x LIST OF FIGURES

2.13 Monocular versus plenoptic camera . . . . . . . . . . . . . . . . . . . . . . . 40

2.14 Raw plenoptic imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.15 Visualization of the light-field . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.16 A line in 3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.17 4D point example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

2.18 4D line example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

2.19 4D hyperplane example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

2.20 4D plane example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

2.21 Illustrating the depth of a point in the 2PP . . . . . . . . . . . . . . . . . . . . 58

2.22 Light-field slope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.1 Architectures for visual servoing . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.2 Light path through a refractive object . . . . . . . . . . . . . . . . . . . . . . . 87

4.1 MirrorCam setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.2 Visual servoing control loop . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.3 Simulation results for LF-IBVS . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.4 Simulation of views for LF-IBVS . . . . . . . . . . . . . . . . . . . . . . . . 110

4.5 Experimental results of LF-IBVS trajectories . . . . . . . . . . . . . . . . . . 112

4.6 Experimental results of stereo-IBVS . . . . . . . . . . . . . . . . . . . . . . . 113

4.7 Setup for occlusion experiment . . . . . . . . . . . . . . . . . . . . . . . . . . 116

4.8 Example views from occlusion experiments . . . . . . . . . . . . . . . . . . . 116

4.9 Experimental results from occlusion experiments . . . . . . . . . . . . . . . . 118

5.1 LF camera mounted on a robot arm . . . . . . . . . . . . . . . . . . . . . . . . 121

5.2 Lambertian versus non-Lambertian feature in the . . . . . . . . . . . . . . . . 130

LIST OF FIGURES xi

5.3 Example epipolar planar images . . . . . . . . . . . . . . . . . . . . . . . . . 131

5.4 Extraction of the image feature curve using the correlation EPI using simulated

data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

5.5 Example Lambertian and refracted image feature curves . . . . . . . . . . . . 143

5.6 Example Lambertian and refracted feature curves from a small-baseline LF

camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

5.7 Discrimination of refracted image features . . . . . . . . . . . . . . . . . . . . 144

5.8 Refracted image features detected in sample images . . . . . . . . . . . . . . . 147

5.9 Rejecting refracted image features for SfM . . . . . . . . . . . . . . . . . . . . 150

5.10 Sample images where monocular SfM failed by not rejecting refracted features 151

5.11 Comparison of camera trajectories for monocular structure from motion . . . . 152

5.12 Point cloud reconstructions of scenes with refracted objects . . . . . . . . . . . 154

6.1 Toric lens cut from a torus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

6.2 The visual effect of the toric lens on a background circle . . . . . . . . . . . . 165

6.3 Light-field geometry depth and projections of a lens into a light field . . . . . . 167

6.4 Orientation for the toric lens . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

6.5 Illustration of 3D line segment projected by a toric lens . . . . . . . . . . . . . 170

6.6 Ray tracing of a refractive object using Blender . . . . . . . . . . . . . . . . . 176

6.7 Single point ray trace simulation . . . . . . . . . . . . . . . . . . . . . . . . . 178

6.8 Slope estimates for changing z-translation of the LF camera . . . . . . . . . . 179

6.9 Orientation estimates for changing z-rotation of the LF camera . . . . . . . . . 179

6.10 Refracted LF feature approach towards a toric lens . . . . . . . . . . . . . . . 180

6.11 Refracted light-field feature slopes during approach towards a toric lens . . . . 181

xii LIST OF FIGURES

6.12 Orientation estimate from a Blender simulation of an ellipsoid that was rotated

about the principal axis of the LF camera. . . . . . . . . . . . . . . . . . . . . 182

6.13 Refracted light-field features for a toric lens . . . . . . . . . . . . . . . . . . . 183

6.14 Refracted light-field features for different objects . . . . . . . . . . . . . . . . 185

6.15 Concept for visual servoing towards a refractive object . . . . . . . . . . . . . 187

A.1 MirrorCam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III

A.2 MirrorCam field of view overlap . . . . . . . . . . . . . . . . . . . . . . . . . VII

A.3 Rendering of MirrorCam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII

A.4 MirrorCam v0.4c Kinova arm mount diagrams . . . . . . . . . . . . . . . . . . IX

A.5 MirrorCam v0.4c mirror holder diagrams . . . . . . . . . . . . . . . . . . . . XII

A.6 MirrorCam v0.4c camera clip diagrams . . . . . . . . . . . . . . . . . . . . . XIII

Acronyms

2PP two-plane parameterisation.

BRIEF binary robust independent elementary features.

CNN convolutional neural networks.

DOF degree-of-freedom.

DSP-SIFT domain-size pooled SIFT.

FAST features from accelerated segment test.

FOV field of view.

GPS global positioning system.

HoG histogram of gradients.

IBVS image-based visual servoing.

IOR index of refraction.

LF light-field.

LF-IBVS light-field image-based visual servoing.

LIDAR light detection and ranging.

xiii

xiv Acronyms

M-IBVS monocular image-based visual servoing.

MLESAC maximum likelihood estimator sampling and consensus.

ORB oriented FAST and rotated BRIEF.

PBVS position-based visual servoing.

RANSAC random sampling and consensus.

RMS root mean square.

S-IBVS stereo image-based visual servoing.

SfM structure from motion.

SIFT scale invariant feature transform.

SLAM simultaneous localisation and mapping.

SURF speeded-up robust feature.

SVD singular value decomposition.

VS visual servoing.

List of Symbols

θi angle of incidence

θr angle of reflection

N surface normal

n index of refraction

zi distance to image along the camera’s z-axis

zo distance to object along the camera’s z-axis

d disparity

b camera baseline

P 3D world point

Px 3D world point’s x-coordinate

Py 3D world point’s y-coordinate

Pz 3D world point’s z-coordinate

CP world point with respect to the camera frame of reference

p 2D image coordinate

p∗ initial/observed image coordinates

p# desired/goal image coordinates

p homogeneous image plane point

f focal length

R radius of curvature

K camera matrix

xv

xvi Acronyms

T translation vector

R rotation matrix

F fundamental matrix

J Jacobian

J+ left Moore-Penrose pseudo-inverse of the Jacobian

ν camera spatial velocity

v translational velocity

ω rotational velocity

NMIN minimum number of sub-images in which feature matches must be found

KP proportional control gain

KI integral control gain

KD derivative control gain

s light-field horizontal viewpoint coordinate

t light-field vertical viewpoint coordinate

u light-field horizontal image coordinate

v light-field vertical image coordinate

D light-field plane separation distance

w light-field slope from the continuous domain (unit-less)

m light-field slope from the discrete domain (views/pixels)

σ light-field slope as an angle

s0 light-field central view horizontal viewpoint coordinate

t0 light-field central view vertical viewpoint coordinate

u0 light-field central view horizontal image coordinate

v0 light-field central view vertical image coordinate

L(s, t, u, v) 4D light field

I(u, v) 2D image

()∗

indicates a variable is fixed while others may vary

W Lambertian light-field feature

Acronyms xvii

H intrinsic matrix for an LF camera

Rn real coordinate space of n dimensions

Π1 a plane

φ a ray

n normal of a 4D hyperplane

∆u pixel differences between u and u0

ξ singular vector from SVD

λ singular value from SVD

c slope consistency

tplanar threshold for planar consistency

tslope threshold for slope consistency

ei relative instantaneous pose error

etr instantaneous translation pose error

erot instantaneous rotation pose error

C the focal point or focal line

A the projection of point P through a toric lens

ΣA the scaling matrix, containing the singular values of A

Chapter 1

Introduction

In this chapter, we introduce the motivation for this research and outline the research goals

and questions this thesis seeks to address. Then we provide our list of contributions, their

significance and an overview of this thesis.

1.1 Motivation

Robots are changing the world. Their use for automating the dull, dirty and dangerous tasks

in the modern world has increased economic growth, improved quality of life and empowered

people. For example, robots assist in manipulating heavy car components in the automotive

manufacturing industry. Robots are being used to survey underwater ruins, sewage pipes, col-

lapsed buildings, and other planets in space. At home, robots are also starting to be used for de-

livery services, home cleaning, and assisting people with reduced mobility [Christensen, 2016].

Traditionally, many robots have operated in isolation from humans, but through the gradual

availability of inexpensive computing, user interfaces, integrated sensors, and improved algo-

rithms, robots are quickly improving in function and capability. The confluence of technologies

is enabling a robot revolution that will lead to the adoption of robotic technologies for all as-

1

2 1.1. MOTIVATION

pects of daily life. As such, robots are gradually venturing into less constrained environments to

work with humans and an entirely new set of challenging objects to interact with. These more

complex and unstructured working environments require richer perceptual information for safer

interaction.

Historically, roboticists have had success with a variety of sensing modalities, from light detec-

tion and ranging (LIDAR), global positioning system (GPS), radar, acoustic imaging, infrared

range-finding sensors, to time-of-flight and structured-light depth sensors, as well as cameras.

In particular, cameras uniquely measure both dense colour and textural detail that other sensors

do not normally provide, which enables robots to use vision to perceive the world. Vision as a

robotic sensor is particularly useful because it mimics human vision and allows for non-contact

measurement of the environment. Much of the human world has been engineered around our

sense of sight, and a significant amount of our communication and interaction relies on vision.

Robotic vision has proven effective in terms of object detection, localization, and scene un-

derstanding for robotic grasping and manipulation [Kemp et al., 2007]. Furthermore, directly

using visual feedback information extracted from the camera to control robot motion, a tech-

nique known as visual servoing (VS), has proven useful for real-time, high-precision robotic

manipulation tasks [Kragic and Christensen, 2002]. However, refractive objects, which are

common throughout the human environment, are one of the areas where modern robotic vision

algorithms and conventional camera systems still encounter difficulties [Ihrke et al., 2010a].

One novel camera technology that may enable robots to better perceive refractive objects is the

light-field (LF) camera, which uses multiple-aperture optics to implicitly encode both texture

and depth. In this thesis, we look at exploring LF cameras as a means of seeing and servoing

towards refractive objects. By seeing, we refer to detecting refractive objects using only the LF.

By servoing, we refer to visual servoing using LF camera measurements to directly control the

camera’s relative pose. Combining the two, this research may enable a robotic manipulator to

detect and move towards, grasp, and manipulate refractive objects—for example a glass of beer

or wine. The principal motivation for this topic lies in improving our understanding of how

CHAPTER 1. INTRODUCTION 3

refractive objects behave in the LF and how to exploit this knowledge to enable more reliable

motion towards refractive objects.

1.1.1 Limitations of Robotic Vision for Refractive Objects

Robots for the real world will inevitably interact with refractive objects, as in Fig. 1.1. Future

robots will contend with wine glasses and clear water bottles in domestic applications [Kemp

et al., 2007]; glass objects and clear plastic packaging for quality assessment and packing in

manufacturing [Ihrke et al., 2010a]; glass windows throughout the urban environment, as well

as water and ice for outdoor operations [Dansereau, 2014]. For example, a household robot

must be able to pick up, wash and place glassware; a bartender robot must serve drinks from

bottles of wine and spirits; an outdoor maintenance robot may want to avoid falling into the

swimming pool or nearby fountains. Other examples of robots interacting with refractive ob-

jects include medical robots performing opthalmic (eye) surgery, or servicing satellites working

with telescopic lenses or shiny and transparent surface coverings. Automating these applica-

tions typically requires knowledge of either object structure and/or robot motion. Yet objects

such as those just described are particularly challenging for robots, largely because they are

transparent.

(a) (b)

Figure 1.1: Robots will have to interact with refractive objects. (a) In domestic applications,

such as cleaning and putting away dishes. (b) In manufacturing, assessing the quality of glass

objects, or picking and placing such objects in warehouses.

4 1.1. MOTIVATION

Refractive objects are particularly challenging for robots primarily because these types of ob-

jects do not typically have texture of their own. Instead their appearance depends on the object’s

shape and the surrounding background and lighting conditions. Robotic methods for localiza-

tion, manipulation and control exist to deal with refractive objects when accurate 3D geometric

models of the refractive objects themselves are available [Choi and Christensen, 2012, Luo

et al., 2015, Walter et al., 2015, Zhou et al., 2018]. However, these models are often difficult,

time-consuming and expensive to obtain, or simply not available [Ihrke et al., 2010a]. When

3D geometric models of the refractive objects are not available, localization, manipulation and

control around refractive objects become much harder problems.

In robotic vision, a common approach when no models are available, regardless of whether the

scene contains refractive objects, is to use features. Features are distinct aspects of interest in the

scene that can be repeatedly and reliably identified from different viewpoints. Image features

are those features recorded in the image as a set of pixels by the camera. Image features can then

be automatically detected and extracted as a vector of numbers, which we refer to as the image

feature vector. Features are often chosen because their appearances do not change significantly

with small changes in viewpoint. The same features are matched from image to image as the

robot moves, which enables the robot to establish a consistent geometric relationship between

its observed image pixels and the 3D world.

Feature-based matching strategies form the basis for many robotic vision algorithms, such as

object recognition, image segmentation, structure from motion (SfM), VS, and simultaneous

localisation and mapping (SLAM). However, many of these algorithms implicitly assume that

the scene or object is Lambertian—that the object’s (or feature’s) appearance remains the same

despite moderate changes in viewpoint. Instead, refractive objects are non-Lambertian because

their appearance often varies significantly with viewpoint. The violation of the Lambertian as-

sumption can cause inconsistencies, errors and even failures for modern robotic vision systems.


1.1.2 Seeing and Servoing Towards Refractive Objects

Humans are able to discern refractive objects visually by looking at the objects from different

perspectives and observing that the appearance of refractive objects change differently from the

rest of the scene. The robotic analogue to human eyes are cameras, which have proven ex-

tremely useful as a low-cost and compact sensor. Monocular vision systems are by far the most

common amongst robots today, but suffer from the ambiguity that small and close objects appear

the same size as distant and large objects. Moreover, a single view from a monocular camera

does not provide sufficient information to detect the presence of refractive objects. Stereo cam-

eras, which provide two views of a scene, do not work well with refractive objects without prior

knowledge of the scene, because triangulation relies heavily on appearance matching. RGB-D

cameras and LIDAR sensors do not work reliably on refractive objects because the emitted light

is either partially reflected or travels through these objects. Fig. 1.2 shows an example of unre-

liable depth measurements from an RGB-D camera for a refractive sphere. Robots can move to

gain better understanding of a refractive object over time; however, physically moving a robot

can be time-consuming, expensive and potentially hazardous. A more efficient approach would

be to instantaneously capture multiple views of the refractive object.

(a) (b)

Figure 1.2: An example of unreliable RGB-D camera (Intel Realsense D415) output for a re-

fractive sphere. This RGB-D camera uses the structured-light approach to measure depth, which

works well for Lambertian surfaces, but not for refractive objects. (a) The colour image, which

a monocular camera would also provide. (b) Incorrect and missing depth information around

the refractive sphere.

6 1.1. MOTIVATION

The light field describes all the light flowing in every direction through every point in free

space at a certain instance in time [Levoy and Hanrahan, 1996]. LF cameras are a novel camera

technology that use multi-aperture optics to measure the LF by capturing a dense and uniformly-

sampled set of views of the same scene from multiple viewpoints in a single capture from a

single sensor position. We refer to a view as what one would see from a particular viewing pose

or viewpoint. Conceptually, unlike looking through a peephole, an LF “image” is similar to

an instantaneous window that one can look through to see how a refractive object’s appearance

changes smoothly with viewpoint. As illustrated in Fig. 1.3, compared to a monocular camera,

which uses a single aperture to capture a single view of the scene from a single viewpoint, an

LF camera is analogous to having an array of cameras all tightly packed together which provide

multiple views of the scene from multiple viewpoints. Within the LF camera array, a single

view can be selected and changed from one to the other in a way that can be described as virtual

motion within the single shot of the LF.

(a) (b) (c)

Figure 1.3: (a) A monocular camera acts as a peephole with a single aperture to capture a

single view of the scene. Light from the scene (yellow) passes through the aperture (red) and is

recorded on the image sensor (green). (b) A LF camera can be thought of as a window, or an

equivalent camera array that uses multi-aperture optics to capture multiple views of the scene.

As a result, the LF camera can capture much more information of the scene from a single sensor

capture than a monocular camera. (c) An example LF camera array by Wilburn [Wilburn et al.,

2004].

For example, compared to Fig. 1.4a, Fig. 1.4b shows the gradual change in appearance of the

refractive sphere from a much denser and more regular or uniform sampling of views from

an LF. Perhaps one of the reasons why humans can somewhat reliably perceiving refractive

objects is because we may unconsciously move a little bit side-to-side using our continuous


stream of vision—which is a very dense sampling of the scene. Humans may be able to detect

the inconsistent motions of the background caused by the refractive object with respect to their

viewpoints.

Therefore, the dense sampling of the LF camera captures the behaviour of the refractive object

with a high level of redundancy that is needed to differentiate refractive objects from normal

scene content. The uniform sampling pattern of the LF camera induces patterns and algorithmic

simplifications that would be unavailable to a set of non-uniformly-sampled views1. Addition-

ally, while the same set of images could be obtained with a single moving camera, LF cameras

can capture this information from a single sensor position, reducing the amount of motion re-

quired by the robot to perceive a refractive object. Therefore, LF cameras could allow robots to

more reliably and efficiently capture the behaviour of refractive objects.

(a)

(b)

Figure 1.4: In this scene, a refractive sphere has been placed amongst cards. A camera has

captured images of the scene along a horizontal rail at (a) a 3 cm interval, and (b) 1 cm in-

tervals. The end images (blue border) are taken from the same positions. In (a), the change

in appearance of the refractive sphere is significant and perhaps very challenging to recognize

without the prior knowledge that there is a refractive sphere in the middle of the scene. In (b), a

more frequent sampling of the scene reveals the gradual change in appearance of the refractive

sphere, which may be programmatically detected. Images from the New Stanford Light Field

Archive.

1Consider a conventional monocular camera and its dense and uniformly-sampled array of pixels that produce a

detailed 2D image. Often, more pixels yield more detail in a single image. Additionally, if the pixels were oriented

in different directions and at a variety of positions, interpreting the scene would be a much more complex task.

8 1.2. STATEMENT OF RESEARCH

Returning to the original theme of this section, robots must not just perceive refractive objects,

they must be able to precisely control their relative pose around these objects as well. In the

traditional open-loop “look then move” approach, the accuracy of the operation depends directly

on the accuracy of the visual sensor and robot end-effector. VS is a robot control technique that

uses the camera output to directly control the robot motion in a feedback loop, which is referred

to as a closed-loop approach. VS has proven to be reliable at controlling robot motion with

respect to visible objects without requiring a geometric model of the object, or an accurate robot.

While refractive objects are challenging because the objects are not always directly visible, they

leave fingerprints based on how the background is distorted. LF cameras capture some of this

distortion, which we show can be exploited to visual servo towards refractive objects for further

grasping and manipulation.

1.2 Statement of Research

Based on the previous section, there exists a clear opportunity to advance robotic vision in the

area of visual control with respect to refractive objects using LF cameras. The main research

question of this thesis is thus:

How can we enable robotic vision systems to visually control their motion around refractive

objects in real-world environments, using a light-field camera and without prior models of the

objects?

The primary research question can be decomposed into sub-questions:

1. How can we visually servo using a light-field camera?

2. How can we detect refractive objects using a light-field camera?

3. How can we servo towards a refractive object?


Our hypothesis is that we can develop a novel light-field feature based on the dense and uniform

observations of a Lambertian point captured by an LF camera, which we refer to as a Lamber-

tian light-field feature. We can use this Lambertian light-field feature to perform visual servoing

in Lambertian scenes. We can observe how this light-field feature becomes distorted by a re-

fractive object and use these changes to distinguish refracted image features from Lambertian

image features. Using insight from visual servoing with the Lambetian light-field feature and

distinguishing refracted image features in the LF, we can propose a novel refracted light-field

feature to directly control the robot pose with respect to a refractive object, without needing a

prior model of the object. We define a refracted light-field feature as the projection of a feature

in the LF that has been distorted by a refractive object. The key challenges in showing this will

be in understanding how the Lambertian light-field feature changes with respect to camera pose,

how to characterise the changes in our light-field feature caused by a refractive object, and how

the LF changes as the robot moves towards a refractive object.

1.3 Contributions

The broad topics addressed in this thesis are (1) image-based visual servoing using a light-field

camera (2) detecting refracted features, and (3) visual servoing towards refracted objects. The

specific contributions are as follows:

Light-field image-based visual servoing – partially published as [Tsai et al., 2017]

1. We propose the first derivation, implementation and experimental validation of light-field

image-based visual servoing (LF-IBVS). In particular, we define an appropriate compact

representation of an LF feature that is close to the form measured directly by LF cameras

for Lambertian scenes. We derive continuous- and discrete-domain image Jacobians for

the light field. Our LF feature enforces LF geometry in feature detection and correspon-

dence. We experimentally validate LF-IBVS in simulation and on a custom LF camera

adapter, called the MirrorCam, mounted on a robot arm.

10 1.4. SIGNIFICANCE

2. We show that our method of LF-IBVS outperforms conventional monocular and stereo

image-based visual servoing in the presence of occlusions.

Distinguishing refracted image features – partially published as [Tsai et al., 2019]

1. We develop an LF feature discriminator for refractive objects. In particular, we develop

a method to distinguish a Lambertian image feature from a feature whose rays have been

distorted by a refractive object, which we refer to as a refracted image feature. Our

discriminator can distinguish refractive objects more reliably than previous work. We also

extend refracted image feature discrimination capabilities to lenslet-based LF cameras

which typically have much smaller baselines than conventional LF camera arrays.

2. We show that using our method to reject most of the refracted image feature content

enables monocular SfM in scenes containing refractive objects, where traditional methods

otherwise fail.

Light-field features for refractive objects

1. We define a representation for a refracted LF feature that approximates the local surface

area of the refractive object as two orthogonal surface curvatures. We can then model the

local part of the refractive object as a toric lens. The properties of the local projections

can then be observed by and extracted from the light field.

2. We evaluate the feature’s continuity with respect to LF camera pose for a variety of dif-

ferent refractive objects to demonstrate the potential for our refracted LF feature’s use in

vision-based control tasks, such as for visual servoing.ee

1.4 Significance

This research is significant because it will provide robots with hand-eye coordination skills for

objects that are difficult to perceive. It is a critical step towards enabling robots to see and


interact with refractive objects. Specifically, with an improved understanding of how refractive

objects behave in a single light field, robots can now distinguish refractive objects and reject

the refracted feature content. Robots can then move in scenes containing refractive objects

without having their pose estimates corrupted by the refracted scene content. Being able to

describe refractive objects in the light field and then servo towards them enables more advanced

grasping and manipulation tasks for robots.

Furthermore, applications of understanding the behaviour of refracted objects in the light field

as robots move is not limited to structure from motion and visual servoing. This theory could

help improve visual navigation and even SLAM applications for domestic and manufacturing

robots. Ultimately, this research will enable manufacturing robots to quickly manipulate objects

encased in clear plastic packaging. Domestic robots will be able to more reliably clean glasses

and serve drinks. Medical robots will more safely operate on transparent objects, such as human

eyes. Overall, this research is a significant step towards opening up an entirely new class of

objects for manipulation that have been largely ignored by the robotics community until now.

1.5 Structure of the Thesis

This thesis in robotic vision draws on theory from both computer vision and robotics research

communities. Chapter 2 provides the necessary background relevant to the remainder of this

thesis, including a description of light transport and light capture. Specifically, we explain the

difference between specular and diffuse reflections, as well as Lambertian and non-Lambertian

reflections and refraction. We discuss image formation with respect to monocular, stereo, mul-

tiple camera and LF camera systems. We then discuss visualization of 4D LFs and 4D LF

geometry, which are built on in the following chapters.

In Chapter 3, we provide a review of the relevant literature surrounding three topics, image

features, VS and refractive objects. Because VS systems typically rely on tracking image fea-

12 1.5. STRUCTURE OF THE THESIS

tures in a sequence of images, we first include a review of image features, how they have been

used in VS systems and how image features have been used in LF cameras. Second, we discuss

the major classes of VS systems, position-based and image-based systems, in the context of LF

cameras and refractive objects. Third, we review a variety of methods that have been explored

to automatically detect and perceive refractive objects in both computer and robotic vision. All

together, this chapter explains how traditional image features are insufficient for dealing with

refractive objects, LF cameras have not yet been considered for VS systems, and for refractive

objects, other methods for perceiving these objects are impractical for most mobile robotic plat-

forms or rely on assumptions that significantly narrow their application window. Thus, there is a

gap for methods that do not rely on 3D geometric models of the refractive objects that can apply

to a wide variety of object shapes. Using LF cameras for VS towards refractive objects therefore

carves out a niche in the research community that leaves room for scientific exploration.

As mentioned previously, LF cameras are of interest for VS because they can capture the be-

haviour of view-dependent light transport effects, such as occlusions, specular reflections and

refraction within a single shot. However, VS with an LF camera for basic Lambertian scenes

has not yet been explored. As an initial investigation, we first focus on using an LF camera to

servo in Lambertian scenes. Chapter 4 develops a light-field feature for Lambertian scenes,

which we later refer to as a Lambertian light-field feature. This feature exploits the fact that a

Lambertian point in the world induces a plane in the 4D LF. Afterwards, we derive the relations

between differential feature changes and resultant robot motion. Using this feature, we then

present the first development of light-field image-based VS for Lambertian scenes and compare

its performance to traditional monocular and stereo VS systems.

Next, Chapter 5 presents our method to distinguish a Lambertian image feature from a feature

whose rays have been distorted by a refractive object, which we refer to as a refracted feature.

We do this by characterising the apparent motion of an image feature in the light field and

compare it to how well this apparent motion matches the model of an ideal Lambertian image

feature (which is based upon the plane in the 4D LF). We apply this method to the problem of


SfM, allowing us to reject most of the refracted feature content, which enables monocular SfM

using the Lambertian parts of the scene, in scenes containing refractive objects where traditional

methods would normally fail.

In Chapter 6, we combine the Lambertian light-field feature definition and LF-IBVS frame-

work from Chapter 4, with the concept of the refracted image feature from Chapter 5, to explore

the concept of a refracted light-field feature. This chapter is largely focused on investigating

how the 4D planar structure of a light-field feature can be extended to refractive objects, ex-

tracted from a single light field, and how this structure changes with respect to viewing pose.

We demonstrate this feature’s suitability for VS with respect to pose change, and lay the ground-

work for a system to visual servo towards refractive objects.

The unifying theme underlying the contributions of this thesis is exploring and exploiting the

properties of the light field for robotic vision. In Chapter 4, we developed a Lambertian light-

field feature for visual servoing, and in Chapter 5, we propose a method to detect refractive

objects. Both of these investigations exploit the fact that a Lambertian point in the world in-

duces a plane in the 4D LF. In Chapter 6, we use the induced plane to propose a method that en-

ables visual servoing towards refractive objects. Throughout this thesis, the dense and uniform

sampling of the light field induce patterns that we exploit to improve robotic vision algorithms.

Finally, conclusions and suggestions for further work are presented in Chapter 7.

14 1.5. STRUCTURE OF THE THESIS

Chapter 2

Background on Light Transport &

Capture

This chapter begins with a background on how light is transported through scenes, including

reflection and refraction. We then discuss single image formation with conventional monocu-

lar cameras and extend the discussion to LF cameras. Finally, we illustrate how we typically

visualize LFs and discuss the theory of 4D LF geometry.

2.1 Light Transport

In order to understand refractive objects and LF cameras, it is important to first understand light

transport, the nature of light and how it interacts with matter. Light is an electromagnetic wave,

but when the wavelength of light is small relative to the size of the structures it interacts with,

we can neglect the more complex wave-like behaviours of light and focus on the particle-like

behaviours of light, where light can be described as rays that move in straight lines within a

constant medium [Pedrotti, 2008]. This approximation describes most phenomena measured

by human eyes, most cameras and most robotic vision systems.

15

16 2.1. LIGHT TRANSPORT

2.1.1 Specular Reflections

When light rays hit a surface, light is reflected. The law of reflection states that for a smooth,

flat and mirror-like surface, the reflected light ray is on a plane formed by the incident light ray

and the surface normal. Additionally, the angle of reflection θr is equal to the angle of incidence

θi [Lee, 2005], as shown in Fig. 2.1a. If we know the surface geometry and the incident light

ray, then we can recover the direction of the reflected ray. Alternatively, if we know the incident

and reflected light rays, then we can determine the geometry (normals) of the reflective surface.

For surfaces that are not perfect mirrors, specular reflections can still occur, taking the form of a

narrow angular distribution of the reflected light. The ratio of reflected light to incident light is

known as the reflectance and values of more than 99% can be achieved through a combination of

surface polishing and advanced coatings [Freeman and Fincham, 1990]. Examples of specular

reflective materials are metal, mirrors, glossy plastics and shiny surfaces of transparent objects.

2.1.2 Diffuse Reflections

Most real surfaces are not perfect mirrors. Instead, they are often rough and produce diffuse

reflections. Light interacts with rough surfaces via penetration, scattering, absorption and being

re-emitted from the surface. These surfaces are commonly modelled using a distribution of

(a) (b)

Figure 2.1: (a) The law of reflection for a smooth surface. The angle of the incidence θi is

equal to the angle of reflection θr about the surface normal N on the plane of incidence. The

reflection off a smooth surface illustrates a specular reflection. (b) The reflections from a rough

surface of micro-facets illustrate a diffuse reflection.

CHAPTER 2. BACKGROUND ON LIGHT TRANSPORT & CAPTURE 17

micro-facets. Each facet acts like a small smooth surface that has its own, single surface normal,

which varies from facet to facet, as in Fig. 2.1b. The extent to which the micro-facet normals

differ from the smooth surface normal is a measure of surface roughness. The distribution of

the micro-facet normals create a broad angular distribution of reflected light, which is known as

a diffuse reflection. Some examples of diffuse materials include wood and felt.

2.1.3 Lambertian Reflections

The Lambertian surface model is often referred to as the isotropic radiance constraint, the

brightness constancy assumption, or the photo consistency assumption in computer graphics

and robotic vision. Each point on a Lambertian surface reflects light with a cosine angular

distribution, as shown in Fig. 2.2a, where θ is the viewing angle relative to the surface normal.

However, when a surface is viewed with a finite field of view (FOV), the surface area seen by the

observer is proportional to 1/ cos θ. As θ approaches 90◦, more surface points become visible

to the observer. The observed radiance (amount of reflected light) comes from the reflected in-

tensity from each surface point (∝ cos θ) multiplied by the number of points seen (∝ 1/ cos θ),

which cancels out and is thus independent of θ. This results in the observed radiance being

roughly equal in all directions [Lee, 2005], as shown in Fig. 2.2b. The Lambertian model is

very common in computer graphics, and often implicitly assumed in many robotic vision al-

gorithms. However, this assumption is invalid for specular reflections and refractive objects,

which motivates us to consider non-Lambertian reflections and refraction.

2.1.4 Non-Lambertian Reflections

Although the Lambertian assumption has been shown to work quite well in practice for most

scenes and surfaces [Levin and Durand, 2010], there remains a variety of surfaces, such as those

polished or shiny, that reflect light in a manner that does not follow the Lambertian model. These


(a) (b)

Figure 2.2: (a) The cosine distribution of a Lambertian reflection at a point with an observer at

viewing angle θ. (b) A Lambertian reflection has an observed radiance approximately equal in

all directions. The appearance of the ray stays the same regardless of viewing angle.

non-Lambertian reflections are when at least part of the reflected light is dependent on viewing

angle, as shown in Fig. 2.3. Such surfaces are not perfectly smooth because of the molecular

structure of materials; however, when the irregularities are less than the wavelength of incident

light, the reflected light becomes increasingly specular. This means that even rough surfaces can

exhibit some degree of non-Lambertian reflections when viewed at a sufficiently sharp angle.

Shiny surfaces involve both specular and diffuse surface reflections. A common approach to

dealing with these non-Lambertian surfaces is to use the dichromatic reflection model [Shafer,

1985], which separates the reflections into specular and diffuse components. The diffuse com-

ponent is modelled as Lambertian and the rest is attributed to a non-Lambertian reflection. The

relative amount of these two components depends on material properties, geometry of light

source, observer viewing pose and surface normal [Corke, 2017]. The model is valid for mate-

rials such as woods, paints, papers and plastics, but excludes materials, such as metals. In the

graphics community, there are more advanced models. Schlick [Schlick, 1994], Lee [Lee, 2005]

and Kurt [Kurt and Edwards, 2009] are good reference surveys of modelling non-Lambertian

light reflection.


Figure 2.3: A non-Lambertian reflection has an uneven reflected light distribution. The reflected

intensity, and thus appearance of the ray changes with viewing angle.

2.1.5 Refraction

Refractive objects pose a major challenge for robotic vision because they typically do not have

an appearance of their own. Rather, they allow light to pass through them and in the process

distort or change the direction of the light. When light passes through an interface—a boundary

that separates one medium from another—light is partially reflected and transmitted. Refraction

occurs when light rays are bent at the interface. Assuming the media are isotropic, the amount

of bending is determined by the media’s index of refraction (IOR) n and Snell’s Law.

Snell’s law of refraction, illustrated in Fig. 2.4, relates the sines of the angles of incidence θi

and refraction θr at an interface between two optical media based on their IOR,

ni sin θi = nr sin θr, (2.1)

Figure 2.4: Snell’s law of refraction at the interface of two media.


where θi and θr are measured with respect to the surface normal, and ni and nr are the IOR of

the incident and refracting medium, respectively. The IOR of a medium is defined as the ratio

of the speed of light in a vacuum c over the speed of light in the medium v, given by n = c/v.

For air and most gases, n is taken as 1.0, while for other solid materials such as glass, n = 1.52.

As light passes from a lower to higher n, the light ray is bent towards the normal, while light is

bent away from the normal when it passes from higher to lower n.

Because the bending of light depends on the surface normal, the shape of an object as well as

its IOR play an important role in the appearance of visual features on the surface of a refractive

object. The larger the angle between the incident light and the object’s surface normal, the

larger the change in direction of the refracted light. And the thicker an object is, the longer the

light can travel through the refractive medium, resulting in a larger change in appearance.

To complicate matters more, the surfaces of transparent objects are often both reflective and

refractive. This means that a portion of the light is reflected at the surface, while another portion

is refracted through the surface. Fresnel’s equations describe the reflection and transmission of

light at the boundary of two different optical media [Hecht, 2002]. The amount of reflected

light depends on the media’s n and angle of incidence.

Furthermore, when light travels from a medium of higher to lower n, internal reflections can

occur within refractive objects. Light moving perpendicular to the interface’s surface normal

does not change direction. Light moving at an angle large enough to cause the refracted ray

to bend 90◦ from the normal travels along the interface itself. Such an angle is referred to as

the critical angle θc. Any incident light that has an angle greater than θc is totally reflected

back into the original medium, as per the law of reflection. This phenomenon is known as total

internal reflection, which is typically exploited in propagating light through fibre optics. Internal

reflection can cause light sources to appear within transparent objects from unexpected angles

and even disappear entirely. This further adds to the viewpoint-dependent nature of refractive

objects.


In all, refractive objects are texture-less on their own. They can refract, magnify (scale), flip

and distort the background, and even cause it to vanish at certain angles. All of these effects

depend heavily on the refractive object’s surface normals, thickness, material properties, as well

as the object’s background, so it is no surprise that refractive objects easily confuse most robotic

vision techniques that do not account for more than simple Lambertian reflections.

2.2 Monocular Cameras

Cameras are excellent sensors for robotic vision; they are compact, affordable sensors that have

low power consumption and provide a wealth of visual information. Camera systems are very

flexible in application, owing to the variety of computer vision and image processing algorithms

available. In this section, we look at monocular cameras, the central projection model and the

loss of depth information. We do this to better understand LF cameras, which can sometimes

be considered as an array of monocular cameras.

2.2.1 Central Projection Model

Image formation using a conventional monocular camera projects a 3D world onto a 2D surface.

The central projection model is often used to perform this transformation and is also referred

to as the central perspective or pinhole camera model. An illustration of how the model works

is shown in Fig. 2.5. It assumes an infinitely small aperture for light to pass through to the

image plane and sensor. The camera’s optical axis is defined as the centre of the field of view.

The geometry of similar triangles describes the projective relationships for world coordinates

P = (Px, Py, Pz) onto the image plane p = (x, y) as

x = fPx

Pz

, y = fPy

Pz

. (2.2)

22 2.2. MONOCULAR CAMERAS

The image plane point can be written in homogeneous form p = (x′, y′, z′) as

x′ = fPx

z′, y′ = f

Py

z′, z′ = Pz. (2.3)

If we consider the homogeneous world coordinates P ′ = (Px, Py, Pz, 1), then the central pro-

jection model can be written linearly in matrix form as

p =

f 0 0 0

0 f 0 0

0 0 1 0

Px

Py

Pz

1

= KP ′, (2.4)

where K is a 3×4 matrix known as the camera matrix and p is the coordinate of the point with

respect to the camera frame [Hartley and Zisserman, 2003].

Figure 2.5: The central projection model. The image plane is a focal length f distance in front of

the camera’s origin. A non-inverted image of the scene is formed as world-point P (Px, Py, Pz),and is captured at image point p(x, y) on the image plane.


The central projection model is relatively simple and no lenses are needed to focus the light,

and is thus commonly used throughout robotic vision [Siciliano and Khatib, 2016]. However,

the infinitely small aperture from the central projection model has the practical problem of not

letting much light from the scene onto the image sensor. This may result in dark images that

may not be useful, or impractically long exposure times for many robotic applications.

In practice, most modern cameras use optical lenses to achieve reasonable image exposure.

However, the central projection model does not include geometric distortions or blurred effects

caused by lenses and finite-sized apertures. Thus, the central projection model is often aug-

mented with additional terms to account for the image distortion caused by lenses. See Sturm et

al. for a survey on other camera models, including models for catadioptric and omnidirectional

cameras [Sturm et al., 2011].

2.2.2 Thin Lenses and Depth of Field

The infinitely small aperture of the central projection model is a mathematical approximation

only and does not physically exist. In fact, as the aperture size approaches a certain limit (related

to the wavelength of the observed light and the shape of the aperture), diffraction increases

proportionally (resulting in a blurrier image) and prevents infinitely small apertures. Thus all

apertures have a nonzero diameter. And in practice, optical lenses are used to allow for a much

larger aperture so that more light from the scene can reach the image sensor.

It is typical to assume that the axial thickness of the lens is small relative to the radius of

curvature of the lens, which means that the lens is “thin”. It is also common to assume that

the angles the light rays make with the optical axis of the lens are small, which is known as

the paraxial ray approximation. Thus, assuming thin lenses and paraxial rays, the mathematics

describing the behaviour of lenses can be significantly simplified. As shown in Fig. 2.6, the

light rays emitting from a point P in the scene pass through the lens and converge to a point

24 2.2. MONOCULAR CAMERAS

behind the lens based on the thin lens formula,

1

zi+

1

zo=

1

f, (2.5)

where zo is the distance to the subject, zi is the distance to the image, and f is the focal length

of the lens. Therefore, we can determine the distance along the z-axis of P , given the lens’

focal length and the distance of the image formed by the lens.

The trade-off with the larger aperture from the lens is that the incoming light rays can only be

focused at a certain depth. Objects at different depths produce rays that converge at different

points behind the lens. A cone of light rays that converge to a point on the image plane are

considered to be in focus, or at the point of convergence, as shown in Fig. 2.7. When the point

of convergence does not lie on the focal plane, the rays occupy an area on the image plane and

appear blurred. This area is known as the circle of confusion c and is useful for describing how

“sharp” or in focus a world point appears in an image.

Real lenses are not able to focus all rays to perfect points and the smallest circle of confusion

that a lens can produce that is indistinguishable from a point on the image plane is often referred

to as the circle of least confusion, which also depends on pixel size. If c is smaller than a pixel, it

is usually indistinguishable from a point on the image plane and thus considered to be in focus,

even if the focused light does not converge to a point that strictly lies on the image plane. This

leads cameras to have a nonzero depth of field, the range of distances at which objects in the

scene appear in focus on a discrete, digital sensor.

A small aperture provides a large depth of field, which is desirable to keep more of the scene

in focus. However, the small aperture admits less light. Less light can lead to issues with noise

and these issues cannot always be compensated for by simply increasing the exposure time,

due to motion blur. There is therefore an integral link between depth of field, motion blur and

signal-to-noise ratio, with relationships determined by exposure duration and aperture diameter.


Figure 2.6: Image formation for a thin lens, shown as a 2D cross section. By convention, the

camera’s optical axis is the z-axis with the origin at the centre of the thin lens.

2.2.2.1 Monocular Depth Estimation

As the 3D world is projected onto a 2D surface, the mapping is not one-to-one and depth

information is lost. A unique inverse of the central projection model does not exist. Given an

image point p(x, y), we cannot uniquely determine its corresponding world point P (Px, Py, Pz).

In fact, P can lie at any distance along the projecting ray CP in Fig. 2.5. This is known as the

scale ambiguity and is a significant challenge for robots striving to interact in a 3D world using

only 2D images.

A variety of strategies can be applied to compensate for this loss, such as active vision [Krotkov

and Bajcsy, 1993], depth from focus [Grossmann, 1987,Krotkov and Bajcsy, 1993], monocular

SfM [Hartley and Zisserman, 2003,Schoenberger and Frahm, 2016], monocular SLAM [Civera

et al., 2008], and learnt monocular depth estimation [Saxena et al., 2006, Godard et al., 2017].

However, without prior geometric models, very few of these methods apply to refractive objects

as we will later discuss in Chapter 3. It is therefore worth considering stereo and other camera

systems to exploit more views for depth information.

26 2.3. STEREO CAMERAS

Figure 2.7: Diagram illustrating the circle of confusion c for a point source passing through a

lens of diameter d. The point source is focused behind the lens (top), in focus (middle) and

focused in front of the lens (bottom).

2.3 Stereo Cameras

Stereo camera systems use two cameras and the known geometry between them to obtain depth

through triangulation in a single sensor capture. Given the corresponding image points p1, p2

and both camera poses, the 3D location of world point P can be determined. Epipolar geometry

defines the geometric relationship between the two images captured by the stereo camera system

is illustrated in Fig. 2.8, and can be used to simplify the stereo matching process required for

stereo triangulation.

As in Fig. 2.8, the centre of projection for each camera is given as {1} and {2}. The 3 points,

P , {1} and {2} define a plane known as the epipolar plane. The intersection of the epipolar

plane and the image plane for cameras 1 and 2 define the epipolar lines, l1 and l2, respectively.

These lines constrain where P is projected into each image at p1 and p2. Given p1, we seek p2

in I2. Rather than searching the entire image, we need only search along l2. Conversely, given

p2, we can find p1 on l1.


(a)

Figure 2.8: Epipolar geometry used for stereo camera systems. The epipolar plane is defined by

point P and camera centres of {1} and {2}. Note that {1} and {2} define the reference frames

of cameras 1 and 2, respectively. The intersection of the epipolar plane with the two image

planes I1 and I2 define the epipolar lines l1 and l2, respectively. Given 1p, knowledge of l1and l2 can reduce the search for the corresponding image point in the second image plane from

a 2D to a 1D problem. Seven pairs of corresponding image points p are required to estimate

the fundamental matrix, in order to recover the translation 1T2 and rotation 1R2 of {2} in the

reference frame of {1}.

This important geometrical relationship can be encapsulated algebraically in a single matrix

known as the fundamental matrix F [Corke, 2013],

F = K−1T×RK, (2.6)

where K is the camera matrix, T× is the skew symmetric matrix of the translation vector T ,

and R is the rotation between the two camera poses. For any pair of corresponding image points

1x and 2x, the F satisfies

2xTF 1x = 0. (2.7)

28 2.4. MULTIPLE CAMERAS

The fundamental matrix is a 3 × 3 matrix with 7 degree-of-freedom (DOF). Thus a minimum

of 7 unique pairs of points are required to compute F .

A typical stereo camera arrangement is for two cameras with parallel optical axes, both or-

thogonal to their baseline. This yields horizontally-aligned epipolar lines and further simplifies

the correspondence search from image lines to image rows. For this setup, it is assumed that

both cameras have the same focal length f and a baseline separation b. In the case of a typical

horizontally-aligned stereo camera system, for image points p1(u1, v1) and p2(u2, v2), the dis-

parity d is given as d = u2 − u1. The disparity is a measure of motion parallax. Then the depth

Z can then be computed using

Z =fb

d, (2.8)

which shows that d is inversely proportional to depth.

However, stereo methods are limited to a single fixed baseline. In this configuration, edges

parallel to this baseline do not yield easily observable disparity (especially when the edge spans

the with of the image), and thus their depths are not easily determined. Additionally, only

two views of the same scene means that feature correspondence must be performed on just two

images. In Lambertian scenes, this is sufficient; however, stereo vision can fail under significant

appearance changes, for example in the presence of occlusions and non-Lambertian objects.

2.4 Multiple Cameras

Additional cameras yield more views and with them more redundancy; however, the configura-

tion of their relative poses is important. Introducing a third camera creates a trinocular camera

system. While stereo uses the 3×3 fundamental matrix, tri-camera methods use a 3×3×3 ten-

sor, known as the trifocal tensor. These tensors can be determined from a set of corresponding

image points from 2 and 3 views, respectively. These tensors can then be decomposed into the

cameras’ projection matrices, after which triangulation can be used to recover the 3D positions


of the points. According to Hartley and Zisserman, the quadrifocal tensor exists for 4 views,

but it is difficult to compute, and the tensor method does not extend to n views [Hartley and

Zisserman, 2003].

Furthermore, multi-camera vision systems are not necessarily limited to regularly-sampled grids

aimed at the same scene. For example, a common commercial multi-camera configuration is

to have six 90◦ FOV cameras mounted together to provide a 360◦ panoramic view. In this

configuration, there is very little scene overlap between the cameras, which provides very little

redundancy. As we will later show in Ch. 5, redundancy of views from different perspectives at

regular intervals is extremely important for characterising the appearance of refractive objects

as a function of viewpoint.

2.5 Light-Field Cameras

LF cameras are based on the idea of computational photography, in which a large part of the

image capture processes is performed by software rather than hardware. LF cameras belong to

the greater class of generalised cameras [Li et al., 2008, Comport et al., 2011]. In this section,

we begin by introducing the plenoptic function as a means of modelling light from all possible

views in space. Under certain assumptions and restrictions, we explain how we can reduce

the plenoptic function to a 4D LF that captures a more manageable representation of multiple

views. We then discuss the most common LF parameterisation and camera architectures of

cameras that capture LFs, how captured LFs are decoded from raw sensor measurements to its

parameterisation and why we expect light-field cameras to be suitable for dealing with refractive

objects.

30 2.5. LIGHT-FIELD CAMERAS

2.5.1 Plenoptic Function

Light is more than a “2D image plus depth” for a single perspective. Light is a much higher-

dimensional phenomenon. Adelson and Bergen [Adelson and Bergen, 1991] introduced the

plenoptic function as a means of representing light and encapsulating all possible views in

space. The term “plenoptic” was coined from the root words for “all” and “seeing”, so the

plenoptic function conceptualizes all the properties of light in a scene. Light is modelled as

rays, each of which can be described using seven parameters: three spatial coordinates (x, y, z)

that define the ray’s position (of the source), two orientation coordinates (θ, φ) define the ray’s

elevation and azimuth, the specific wavelength λ accounts for the colour of light and time t.

Together, these 7 parameters yield the plenoptic function,

P (px, py, pz, θ, φ, λ, t), (2.9)

which is the intensity of the ray as a function of space, time and colour, also illustrated in

Fig. 2.9. Thus the plenoptic function represents all the light flowing through every point in

a scene through all space and time. The significance of the plenoptic function is best put in

Adelson and Bergen’s own words [Adelson and Bergen, 1991]:

The world is made of three-dimensional objects, but these objects do not commu-

nicate their properties directly to an observer. Rather, the objects fill the space

around them with the pattern of light rays that constitutes the plenoptic function,

and the observer takes samples from this function. The plenoptic function serves

as the sole communication link between physical objects and their corresponding

retinal images. It is the intermediary between the world and the eye.

To explain how cameras typically sample the plenoptic function, we consider a monocular,

monochrome camera. First, time is sampled by setting a small shutter time on the camera. The

camera’s photosensor integrates over a small amount of time as the photosites are exposed to


Figure 2.9: The plenoptic function models all the light flowing through a scene in 7 dimensions

of the plenoptic function, 3 for position, 2 for direction, 1 for time and 1 for wavelength.

the scene, and incoming photons are counted by the sensor. Exposure time and aperture size

directly affect the exposure of an image by establishing a trade-off between depth of field, image

brightness and motion blur.

Second, the wavelength is sampled from the plenoptic function by integrating the incoming

light over a small band of wavelengths. Each photosite uses a filter to select a specific range of

wavelengths (typically red, green, or blue), although in practice, the luminosity curves overlap,

especially for red and green.

Third, the position is sampled in a camera by setting an aperture. The aperture determines

the positions of the rays seen by the camera. This is typically idealized as an infinitely small

pinhole, whereby all the light in the scene passes through to project an inverted image of the

scene at one focal length from the pinhole. The location of this pinhole is the camera origin, a

3D point known as the nodal point.

Finally, direction is sampled from the plenoptic function in a conventional camera. Each pixel

integrates the scene luminance over a range of both direction angles. The range of directions

that the camera can capture is called the FOV. The parameters that determine the FOV are the

focal length and the pixel size and number in both the x- and y-directions. As we integrate

over the directions, we also project the scene onto the camera sensor, which is where scale


information is lost. Additionally, only unoccluded objects are projected onto the sensor in a

Lambertian scene. Occluded objects are therefore not measured by a conventional camera, with

the exception of transparency and translucency.

Although not part of the original 7D plenoptic function, the 8D plenoptic function includes

polarisation and is worth mentioning [Georgiev et al., 2011]. Linear polarisation is when light

waves have vibrations only in a single plane, while unpolarised light have vibrations that occur

in many planes, or equivalently many directions that can also change rapidly. Unpolarised

light may be better thought of as a mixture of randomly polarised light. In terms of sampling

polarisation, a camera can measure different polarisations by employing polarisers that sample

only certain polarisations. Since humans do not have any natural polarisation filters built into

their eyes, regular cameras are not typically built with polarisers. Thus most cameras measure

unpolarised light by integrating over all the different polarisations.

Therefore, with a monocular camera we are integrating over small intervals of position, time and

wavelength and over a larger range of direction. This means we are sampling only 2 dimensions

of the 7D plenoptic function, which results in a 2D image. We note that RGB cameras provide

colour images, which may seem like a sampling over wavelength. Wavelength is measured at

3 (relatively) small intervals that are unevenly spaced and overlap to some degree; however,

wavelength is not sampled in the signal processing sense of multiple, regularly-spaced mea-

surements along the spectrum of wavelengths. Thus we consider images from RGB cameras

as 2D images. In order to overcome the aforementioned limitations of 2D images, we must

consider how to capture multiple views within a single sensor capture. This can be achieved by

capturing the light field.

2.5.2 4D Light Field Definition

The light field was first defined by Gershun in 1936 as the amount of light travelling in every

direction through every point in space [Gershun, 1936], but was only reduced from the 7D


plenoptic function to the more tractable 4D LF as a function of both position and direction

in free space by Levoy and Hanrahan [Levoy and Hanrahan, 1996] and Gortler et al. [Gortler

et al., 1996] in 1996. Interestingly, the 4D LF was initially developed by the computer graphics

community to render new views of a scene given several views of the existing scene, without

involving the complexities associated with geometric, lighting and surface models [Levoy and

Hanrahan, 1996]. However, the 4D LF has recently proved useful for computer vision and

robotics for solving the inverse problem: extracting scene structure given several images of the

scene.

In order to reduce the 7D plenoptic function to a 4D light field, we first integrate over small

intervals of time and wavelength. This reduces the plenoptic function to 5D. Since the radiance

along rays in free space is constant in non-attenuating media, we can further reduce the plenoptic

function from 5D to 4D. This means that the light rays do not change their value as they pass

through the scene, implying that rays do not pass through objects1 and do not change in their

intensity as they pass through the air.

An alternate geometric way of understanding the dimensional reduction from 5D to 4D is to

consider a ray defined by a point in 3D space, and a normalized direction. The ray is thus

defined by 5 parameters. In free space, the value of the ray does not change as we move the

point along the ray’s axis and thus the value of the plenoptic function is the same for many

combinations of these 5 parameters. If we fix our point to be on the xy-plane, i.e. set z = 0,

then we have 4 independent parameters that describe the ray; thus we have reduced the 7D

plenoptic function to a 4D LF representation.

The 4D LF is the smallest sample of the plenoptic function needed to encode multiple views

of the scene. Multiple views are of interest because they contain much more information about

the scene. As illustrated in Fig. 1.3, the classic pinhole camera model can be considered a tiny

1Seemingly, this implies that the 4D LF cannot capture the behaviour of refractive objects; however, in the

subsequent chapters of this thesis, we will show that this is not the case. We can look at the relative changes

between views in the LF to infer the distortion caused by refractive objects.


peephole through a wall that grants only a single view of the scene from a single viewpoint,

while an LF can be thought of as a window through the wall that grants multiple views of the

scene as we move behind the window. In relation to conventional cameras, an LF image can be

thought of as a set of 2D images of the same scene, taken from a range of 3D positions in space.

Typically these 3D positions are constrained to a planar array for simplicity. The LF is valid for

non-attenuating medium. Novel views can be rendered from the LF. Occlusions are reproduced

correctly in the LF, but we cannot render views behind occluding objects.

2.5.3 Light Field Parameterisation

There are many different parameterisations of the LF, but the simplest and most common is

the two-plane parameterisation (2PP) [Levoy and Hanrahan, 1996]. With this parameterisation,

a ray of light is described by a set of coordinates φ = [s, t, u, v]T, which are the ray’s points

of intersection with two parallel reference planes separated by an arbitrary distance D. The T

represents the vector transpose. The two reference planes are denoted by (s, t) and (u, v). By

convention, the (s, t) plane is closest to the camera and the (u, v) plane is closer to the scene [Gu

et al., 1997], shown in Fig. 2.10.

Figure 2.10: The two-plane parameterisation (2PP) of the 4D LF. Shown here is the relative

parameterisation, where u, v are defined relative to s, t between the two planes, separated by

distance D. From a Lambertian point P , a light ray φ passes through both planes, and can be

represented by the four coordinates from the two planes, s, t, u and v.


In the relative parameterisation, u and v are expressed relative to s and t, respectively. In the

absolute parameterisation, u and v are expressed in absolute coordinates. We note that all four

dimensions are required to define position and direction. It is a matter of convention to discuss

which plane defines position or direction. For the purposes of this work, we choose (s, t) as

spatial (position) and (u, v) as angular (direction) dimensions, respectively. In this sense, s, t

fix a ray’s position and u, v fix its direction. A convenient way to interpret the 2PP is as an array

of cameras with parallel optical axes and orthogonal baselines, as illustrated in Fig. 1.3. The

camera apertures are on the s, t plane facing the u, v plane. The s, t plane can be thought of

as a collection of all the viewpoints available within the LF camera. If the separation distance

D is chosen to be the focal length f of the cameras, then (u, v) correspond to the image plane

coordinates of the physical camera sensor.

Therefore, we can consider the 4D LF as a 2D array of 2D images, as shown in Fig. 2.11. In the

literature, these 2D images are sometimes referred to as sub-views, sub-images, or sub-aperture

images in the LF. Each view looks at the same scene, but from a slightly shifted viewpoint. The

key intuition is that in comparison to other robotic vision sensors like monocular cameras, stereo

cameras and RGB-D cameras, LFs can more efficiently capture the behaviour of refractive

objects in these multiple views that we might exploit.

There are alternate parameterisations to characterise the LF, such as the spherical-Cartesian

parameterisation [Neumann and Fermuller, 2003]. This method describes the ray position based

on its point of intersection with a plane and its direction with two angles, which also yields a

4D LF. The advantage of this parameterisation is that it can describe all rays passing through

in all directions, and may be well-suited for the design of wide FOV cameras. Although the

2PP cannot describe rays that pass parallel to the reference planes, the 2PP is most common

because of its simplicity and the parameterisation is easily transferable to traditional camera

design and robotic vision [Chan, 2014]. A possible solution to this limitation is to use multiple

2PP’s, oriented perpendicular to each other.


Figure 2.11: Example 4D LF as a 2D array of 2D images for a refractive sphere amongst a pile

of cards. Here, only 3 × 3 images are shown, while the actual light-field is a 17 × 17 array of

2D images. The views are indexed by s and t. Typically, we refer to the view for s0 = 0 and

t0 = 0 as the central view of the LF camera. The pixels within each view are indexed by their

u, v image coordinates. Therefore, a single light ray emitting from the scene can be indexed by

four numbers, s, t, u and v. Light field courtesy of the New Stanford Light Field Archive.

2.5.4 Light-Field Camera Architectures

Light-field cameras capture multiple views of the same scene from slightly different viewpoints

in a dense and regularly-sampled manner. The most common LF camera architectures are the

light-field gantry, the camera array and the plenoptic camera, shown in Fig. 2.12.


2.5.4.1 Light-Field Camera Gantry

The camera gantry captures the LF using a single camera, moving it to different positions over

time. Thus the positions of the camera map to s, t and the image coordinates of each 2D image

map to u, v. Although Yamamoto was one of the earliest to consider camera gantries for 3D

reconstruction [Yamamoto, 1986]. Levoy and Hanrahan were some of the first to consider

computer-assisted camera gantries for recording light fields [Levoy and Hanrahan, 1996]. The

camera gantry used to help digitize ten statues by Michelangelo is shown in Fig. 2.12a [Levoy

et al., 2000]. The camera gantry can offer much finer angular resolution in the LF than camera

arrays, because camera positioning is only limited by the mechanical precision of its actuators,

while the spatial sampling interval in a camera array is limited by the physical size of the

cameras. Additionally, there is only one camera to calibrate. However, there are high precision

requirements for camera placement and in particular, the LF is not captured within a single shot.

This usually limits the camera gantry to be used in static scenes.

(a) (b) (c)

Figure 2.12: Different light-field camera architectures, (a) a camera gantry [Levoy et al., 2000],

(b) a camera array [Wilburn et al., 2005], and (c) a lenslet-based camera [Ng et al., 2005]. These

architectures all capture 4D LFs.


2.5.4.2 Light-Field Camera Array

The camera array is probably the most easily understood architecture for light-field cameras.

The array uses multiple cameras arranged in a grid to capture the LF. 2D images are collected

in an array, which straightforwardly maps to a 4D LF with camera position s, t and pixel position

in each camera image from u, v. A typical configuration is to arrange the cameras on a plane

with regular spacing. This architecture was first developed by Wilburn et al. [Wilburn et al.,

2005], shown in Fig. 2.12b. Camera arrays do not require special optics like plenoptic cameras;

however, there are synchronization, bandwidth, calibration and image correction challenges to

contend with. The discrete nature of the image capture can also cause aliasing artefacts in the

rendered images. Camera arrays have been historically physically large, requiring several dis-

crete sensors, although this also allows for relatively large baselines in comparison to plenoptic

cameras.

Additionally, arrays of cameras can be created virtually. A single monocular camera pointed at

an array of mirrors has been used to capture LFs [Fuchs et al., 2013, Song et al., 2015]. This

LF camera design trades-off mass, bandwidth and synchronization issues with a different set

of calibration issues and limited FOV, depending on the design. In Chapter 4, we use an array

of planar mirrors to make a virtual array of cameras to collect LFs for visual servoing. The

use of additional an array of small lenses to create virtual camera arrays leads the discussion to

plenoptic cameras.

2.5.4.3 Lenslet-based LF Camera

The lenslet-based LF camera, which is sometimes referred to as a plenoptic camera, is a type

of light-field camera that has an array of micro-lenses, often referred to as lenslets, mounted

between the main lens and the image sensor, which split the image from the main aperture into

smaller components, based on the incoming direction of the light rays, as shown in Fig. 2.13.


Lippman first proposed to use microlenses to create crude integral photographs in 1908 [Lipp-

mann, 1908]. It was not until 1992 that Adelson and Wang placed the microlenses at the focal

plane of the camera’s main lens [Adelson and Wang, 1992]. Ng et al. [Ng et al., 2005] designed

and commercialized the “standard plenoptic camera” design of the lenslet-based LF camera,

making it hand-held and accessible to a large user base.

In the standard plenoptic camera, the main lens focuses the scene onto the lenslet array and the

lenslet array focuses the pixels at infinity. Fig 2.13b shows how the angular components of the

incoming light rays are divided by the lenslets. Each pixel underneath each lenslet corresponds

to part of the image from a particular direction. This arrangement results in a virtual camera

array in front of the main lens. For a camera with N ×N pixels underneath each lenslet, there

are N ×N virtual cameras. This yields a series of lenslet images, as in Fig. 2.14, which can be

decoded into discrete sub-views to obtain the 4D LF structure previously discussed [Dansereau

et al., 2013].

One of the drawbacks of the standard plenoptic camera is that the final resolution of each de-

coded sub-view is limited by the number of lenslets. Georgiev and Lumsdaine developed the fo-

cused plenoptic camera, also known as the plenoptic camera 2.0, which also places the lenslets

behind the main lens, but the main lens focuses the scene inside the camera before the light

reaches the lenslets [Lumsdaine and Georgiev, 2008]. The focused plenoptic camera displays a

focused sub-image on the sensor, allowing for higher spatial resolutions at the cost of angular

resolution. Equivalently, there are fewer s, t for higher u, v. Although the lower angular resolu-

tion can produce undesirable aliasing artefacts, the key contribution of this camera design was

to decouple the trade-off between the number of lenslets and the achievable resolution [Lums-

daine and Georgiev, 2009]. Commercial cameras that utilize the plenoptic camera 2.0 design

include the Raytrix [Perwass and Wietzke, 2012].


point

source

main lens image sensorfocal plane

pixels are the sum

of all the rays

(a)

point

source

main lens image

sensorfocal plane microlens

array

decoded

sub-images

(b)

Figure 2.13: (a) For a conventional monocular camera, light rays from a point source are all

integrated over all the directions that pass through the main aperture into a single pixel value,

such that the pixel’s value depends only on pixel position. (b) For a lenslet-based LF camera, a

microlens array is placed in front of the sensor, such that pixel values depend on pixel position

as well as incoming ray angle. Decoded sub-images equivalent to that of an LF camera array can

be obtained by combining pixels from similar ray directions behind each microlens (or lenslet).

2.5.4.4 Light-Field Cameras vs Stereo & Multi-Camera Systems

The difference between LF cameras compared to general multi-camera systems and stereo sys-

tems is not apparent at first glance. Stereo, multi-camera systems and LF camera arrays typically

use multiple cameras to capture multiple views of the scene. The main difference is the level of

sampling of the plenoptic function. Stereo only samples the plenoptic function twice along one

direction with a fixed baseline. This means stereo can only measure motion parallax (and thus

depth) along the direction of its fixed baseline. Multi-camera systems and LF cameras sample

the plenoptic function from multiple viewpoints, and so have both small and long baselines,

as well as baselines in multiple directions (typically vertically and horizontally). This yields

depth measurements with more redundancy and thus more reliability. However, the density and

uniformity of sampling the plenoptic function matters. Multi-camera systems are not limited to

physical camera configurations where each camera is aimed at the same scene in a regular and

tightly-spaced manner.

On the other hand, LF cameras sample densely and uniformly. This simplifies the processing in

the same way that uniformly sampled signals are easier to sample than non-uniformly sampled

signals. For example, consider 2D imaging devices: non-uniform 2D imaging devices are ex-

tremely rare. A few designs have been proposed, such as the foveated vision sensor, where the


(a) (b) (c)

Figure 2.14: (a) A raw plenoptic image of a climber’s helmet captured using a Lytro Illum. This

cropped section consists roughly of 100×100 lenslets. (b) Zoomed in on the raw plenoptic im-

age, each lenslet is visible. Each pixel in the lenslet image (roughly 13×13 pixels) corresponds

to the directional component of a measured light ray. (c) A decoded 100×100 pixel sub-image

from the light-field. The central view of the 4D light-field, which is roughly comprised of the

central pixel from each lenslet image across the entire raw image. There are 13×13 decoded

sub-images in this 4D light-field.

pixel density is varied similar to the non-uniform distribution of cones in the human eye [Yeasin

and Sharma, 2005]. However, such designs are not common in industrial applications or the

consumer marketplace. The dominant 2D imaging devices use a rectangular, uniform distribu-

tion of pixels, which are much simpler to manufacture and process algorithmically. Therefore,

LF cameras can be considered a specific class of multi-camera systems that exploit the camera

geometry to simplify the image processing.

In particular, the dense and regular sampling of LF cameras motivates their use for visual ser-

voing and dealing with refractive objects. As we will show in Ch 4, LF cameras can be used

for visual servoing towards small and distant targets in Lambertian scenes and enable better

performance in occluded scenes. Later in Ch 5, we show that capturing these slightly differ-

ent views is sufficient to differentiate changes in texture from camera motion versus distortion

from refractive objects. Finally in Ch 6, we show that LF cameras can be used to servo towards

refractive objects.

42 2.6. 4D LIGHT-FIELD VISUALIZATION

2.6 4D Light-Field Visualization

Visualizing the data is an important part of understanding the problem. While visualizing 2D

and 3D data has become common in modern robotics research, visualizing 4D data is signif-

icantly less intuitive. In order to examine the characteristics of the 4D LF, the conventional

approach is to slice the LF into 2D images. For example, a u, v slice of the LF fixes s and t to

depict the LF as u varies with respect to v. Recalling the 2PP in Fig. 2.10, it is clear that this

2D slice is analogous to viewpoint selection and corresponds to what is captured by a single

camera in a camera array. Nine different examples of u, v slices depicting the 4D LF as a 3× 3

grid of 2D images are shown in Fig. 2.15 for different values of s and t, although the actual LF

is comprised of 17× 17 2D images.

Further insight can be gained from the LF by considering different pairings of dimensions from

the LF. Consider the horizontal s, u slice, shown at the top of Fig. 2.15. This 2D image is taken

by stacking the rows of image pixels (all the u) from the highlighted yellow, red and green lines

(all the s), while holding t and v constant. Similarly, the vertical t, v slice is taken by stacking

all of the columns of pixels (all the v) from the highlighted turquoise, blue and purple lines (all

the t), while holding s and u constant, shown on the right side of Fig. 2.15.

Visualizing slices using this stacking approach is only meaningful due to the uniform and dense

sampling of the LF. This method was employed by Bolles et al. [Bolles et al., 1987] for a single

monocular camera with linear translation and capturing images with a uniform and dense sam-

pling. Their volume of light was 3D and they referred to the 2D slices of light as epipolar planar

images (EPIs). They were able to simplify the image feature correspondence problem from per-

forming multiple normalized cross-correlation searches across each image, to simply finding

lines in the EPIs. Furthermore, for Lambertian scenes, these lines are characteristic straight

lines with slopes, which, as discussed in Section 2.7, reflect depth in the scene. However, as we

will show in Ch. 5, these lines can become distorted and nonlinear by refractive objects, which

can be exploited for refractive object detection.


Figure 2.15: Visualizing subsets of the 4D LF as 2D slices. The u, v slice can the seen as a

conventional image from a camera positioned at the s, t plane. In this figure, there are 3× 3u, v slices for different s, t depicting the 4D LF; the full LF is a 17× 17 grid of images. The

s, u slice, illustrated by stacking the yellow/red/pink rows of image pixels of u with respect

to s, and the t, v slice, depicted by stacking the turquoise/blue/purple columns of image pixels

of v with respect to t, are sometimes referred to as EPIs. For Lambertian scenes, EPIs show

characteristic straight lines with slopes that reflect depth. However, these lines can become

distorted and nonlinear by refractive objects, such as the refractive sphere in the centre of these

images. LF courtesy of the New Stanford Light-Field Archive.

44 2.7. 4D LIGHT-FIELD GEOMETRY

2.7 4D Light-Field Geometry

In this section, we discuss the geometry of the LF. We start by defining geometric primitives in

2D and follow their extensions to 3D and 4D. We then go into detail with the 2PP of the LF

and discuss the point-plane correspondence and the concept of slope and depth in the LF. We

show that a ray in 3D intersects the two planes of parameterisation twice, defined by two pairs

of image coordinates in s, t, u and v, and subsequently that a Lambertian point in 3D induces a

plane in the 4D LF. This theory serves as the basis for understanding the properties of the 4D LF,

which we exploit throughout this thesis for the purposes of visual servoing and discriminating

against refractive objects.

2.7.1 Geometric Primitive Definitions

First, we provide the definitions of several typical geometric primitives, including a point, a

line, a plane and a hyperplane. The definitions of dimensions and manifolds are also included

for clarity.

• Dimension: The definition of dimension, or dimensionality, varies somewhat across

mathematics. The dimension of an object is often thought of as the minimum number

of coordinates needed to specify any point within the object. More formally, dimension is

defined in linear algebra as the cardinal number of a maximal linearly independent subset

for a vector space over a field, i.e. the number of vectors in its basis.

• Degree(s) of Freedom: (DOF) The number of degrees of freedom in a problem is the

number of parameters which may be independently varied. Informally, degrees of free-

dom are independent ways of moving, while dimensions are independent extents of space.

Thus, a rigid three-dimensional object can have zero DOF if it is not allowed to change

it’s pose, six DOF if it is allowed to translate and rotate, or anything combination of

translation and rotation.


• Point: A point is a 0-DOF object that can be specified in n-dimensions as an n-tuple of

coordinates. For example, a 2D point is defined as (x, y), a 3D point as (x, y, z), and

a 4D point as (x, y, z, w), which can be described by a minimum of 2,3 and 4 param-

eters, respectively. Points are synonymous with coordinate vectors. Basic structures of

geometry (eg. line, plane, etc) are built from an infinite number of points in a particular

arrangement. One might go as far as to say life without geometry is pointless.

• Manifold: A manifold is a topological space that is locally Euclidean, in that around

every point there is a neighbourhood that has the same properties as the point itself.

• Line: A line is a 1-DOF object that has no thickness and extends uniformly and infinitely

in both directions. A line is a specific case of a 1D manifold. Informally, a line extends

in both directions with no wiggles.

• Plane: A plane is a 2-DOF object that is spanned by two linearly independent vectors. A

plane is a specific case of a 2D manifold.

• Hyperplane: In an n-dimensional space, a hyperplane is any vector subspace that has

n− 1 dimensions [Weisstein, 2017]. For example, in 1D, a hyperplane is a point. In 2D,

a hyperplane is a line. In 3D, a hyperplane is a plane. In 4D, the hyperplane has 3 DOF

and the standard form of a hyperplane is given as

ax+ by + cz + dw + e = 0. (2.10)

In n-dimensions, for a space X = [x1, x2, · · · , xn], xi ∈ R, let a1, a2, . . . an be scalars

not all equal to 0. Then the hyperplane in R⋉ is given as

a1x1 + a2x2 + . . .+ anxn = c, (2.11)

where c is a constant. There are n + 1 parameters, but we can divide though by xi for a

minimum of n parameters to describe the hyperplane in nD.


2.7.2 From 2D to 4D

In this section, we describe the geometry of primitives in increasing dimension and discuss the

minimum number of parameters each primitive can be described by in the different dimensions.

The parameters and their primitives are categorized in Table 2.1. In the rest of the section, we

explain why each primitive requires a certain number of parameters to be described. Note that

the minimum number of parameters to fully describe a primitive is different from its DOF. For

example, a point has 0 DOF. In 2D, a point requires a minimum of two parameters, but in 4D, a

point requires four parameters to describe. We also discuss the equations used to describe these

geometric primitives.

Table 2.1: Minimum Number of Parameters to

Describe Geometric Primitives from 2D to 4D

Primitive 2D 3D 4D

Point 2 3 4

Line 2 4 6

Plane — 3 6

Hyperplane 2 3 4

2.7.2.1 2 Dimensions

A Point in 2D In 2D, the space is defined by x and y. A 2D point is defined with two

equations

x = a, y = b, (2.12)

where a, b ∈ R. Thus a point in 2D requires a minimum of two parameters to be fully described.


A Line in 2D A line in 2D has the standard form of

ax+ by + c = 0, (2.13)

where a, b, c ∈ R are three parameters. We can re-write (2.13) as

a

cx+

b

cy = −1, (2.14)

which has two free parameters if we consider a/c and b/c to be two parameters. Thus a line in

2D requires a minimum of two parameters.

Intersection of 2D Hyperplanes We note that a line in 2D is a hyperplane. Consider the

two lines,

ax+ by + c = 0, (2.15)

and

dx+ ey + f = 0, (2.16)

where a, b, c, d, e, f ∈ R. Thus, we can describe a 2D point by the intersection of 2 lines.

a b

d e

x

y

=

−c

−f

. (2.17)

This is a 2 × 2 system of equations for the 2D intersection of two 2D hyperplanes. Assuming

that these two lines are neither collinear nor parallel, we can solve this system of equations to

yield a 2D point. We will refer back to this observation as we journey through the intersection

of three 3D hyperplanes, and four 4D hyperplanes.



A point in 3D A point in 3D is defined by three equations as

x = a, y = b, z = c, (2.18)

where a, b, c ∈ R. Thus a point in 3D requires a minimum of three parameters to be completely

described.

A Line in 3D As illustrated in Fig. 2.16, a line in 3D can be described by two points p1 and

p2

x = p1 + (p2 − p1)k, (2.19)

where x = [x, y, z] ∈ R3, p1,p2 ∈ R

3 and k ∈ R. With p1 and p2, we have six parame-

ters to describe the line; however, these parameters are not independent. Since the line is one

dimensional, we can imagine sliding either p1 or p2 along the line and retaining the line’s def-

inition. There are infinitely many pairs of 3D points along the line that can describe the line.

Thus for each point, we can hold one of its three coordinates constant to describe the same line

without any lose of generality. Therefore, a line in 3D can be described by a minimum of four

parameters.

We can also describe a line in 3D as a point p1 and a direction (vector) r. In this case, a similar

argument holds: both the point and direction can be reduced to two parameters each, yielding a

total of four parameters.

Plücker coordinates have also been used to specify a line in 3D. Two points on the line can

specify the direction (vector) of the line d. Another vector describes the direction to a point

on the line from the origin p. The cross product between these two vectors is independent of

the chosen point, and uniquely defines the line. Plücker coordinates are defined as the line’s


direction vector d together with the cross product, given by

(d;p× d), (2.20)

where d is normalised to unit length, p× d is the cross product, often known as the ‘moment’,

and p is an arbitrary point on the line. We can then select this point relative to a constant without

any loss of generality. Thus, even with Plücker coordinates, four parameters are required to

describe a line in 3D.

(a) (b) (c)

Figure 2.16: Describing a line in 3D with (a) two points, (b) a point and a vector, and (c) the

intersection of two planes.

A Plane in 3D The standard form for a plane in 3D is

ax+ by + cz + d = 0, (2.21)

where a, b, c, d ∈ R. Similar to a line in 2D (2.14), we can describe the plane in 3D with

a minimum of three parameters. The direction of a plane’s normal can be described by two

parameters and the plane must intersect some axis at some scalar distance from the origin;

therefore, we have three parameters.


Intersection of 3D Hyperplanes We also note that a hyperplane in 3D is a 2D plane. As

with the 2D case, if we consider two two hyperplanes in 3D as

ax+ by + cz + d = 0, (2.22)

and

ex+ fy + gz + h = 0, (2.23)

where a, b, c, d, e, f, g, h ∈ R. We can then describe the intersection of these two hyper-

planes in 3D, as a 2× 3 system of equations

a b c

e f g

x

y

z

=

−d

−h

. (2.24)

We can row-reduce this system of equations to

1 0 ca− b

a

(ga−ce

fa−be

)

0 1 ga−ce

fa−be

x

y

z

=

−da− b

a

ga−ce

fa−be

−ha−defa−be

. (2.25)

If we let

α =c

a−

b

a

(ga− ce

fa− be

)

(2.26)

β =−d

a−

b

a

ga− ce

fa− be(2.27)

γ =ga− ce

fa− be(2.28)

η =−ha− de

fa− be, (2.29)

then we can rewrite (2.25) as


1 0 α

0 1 γ

x

y

z

=

β

η

. (2.30)

Clearly, the intersection of two non-coplanar, non-parallel planes in 3D describes a line in 3D,

which depends on a minimum of four parameters. Fig. 2.16c shows the intersection of two such

planes in 3D, Π1 and Π2 ∈ R3, forming a line in 3D.

Another way to consider the minimum number of parameters to describe a line in 3D is that

from (2.21), each plane can be described by three parameters. If we consider the intersection

of Π1 and Π2, we have two equations (six parameters total); however, each equation constrains

the system by one. Thus, six minus two yields four parameters that are required to describe a

line in 3D.

Additionally, similar to the 2D case in Section 2.7.2.1, we can also describe the intersection of

three 3D hyperplanes as a 3× 3 system of equations, which intersect at a 3D point.

a1 b1 c1

a2 b2 c2

a3 b3 c3

x

y

z

=

d1

d2

d3

, (2.31)

where the three hyperplanes are their subscripts. In other words, three planes that have unique

normals intersect at a point in 3D. In Section 2.7.2.3, we will show that the intersection of two

hyperplanes in 4D describes a plane in 4D, and that the intersection of four hyperplanes in 4D

forms a 4D point.



The journey to the fourth dimension can be intimidating and shrouded in mystery due to the

limitations of our perception and understanding. If we can only perceive the world in 3 spatial

dimensions, how can we understand a fourth spatial dimension? In physics and mathematics,

4D geometry is often discussed in terms of a fourth spatial dimension. Consider the axes of x,

y and z form a basis for R3. The fourth spatial dimension along w in 4D is orthogonal to the

other 3 axes. Such discussions lead to speculations of the limitations on human perception and

the 4D equivalent of a cube, known as a tesseract [Hinton, 1884]. Fortunately in this thesis,

we are not concerned with a 4th spatial dimension, but rather 4 dimensions with respect to the

sampling of light, as per the plenoptic function in Section 2.5.1. Our 4 dimensions in the LF

s, t, u and v differ slightly from the four spatial dimensions x, y, z and w, in that s, t, u and v

are constrained via the 2PP. However, much of the geometry from 4 spatial dimensions carries

over to dealing with the 4D LF. In this section, we illustrate the geometric primitives in 4D with

respect to the 2PP for light fields. We illustrate these primitives as projections on a grid of 2D

images, similar to how they would appear in the LF.

A Point in 4D A point in 4D can be described by four equations as

x = a, y = b, z = c, w = d, (2.32)

where a, b, c, d ∈ R. A point in 4D requires a minimum of four parameters to be completely

described. Examples of two 4D points are shown in Fig. 2.17. In the 2PP, a point in 4D

describes a ray in 3D. However, not all rays can be represented by the 2PP, because the 2PP

cannot describe rays that are parallel to the two planes.

A Line in 4D Similar to the 3D case, a line in 4D can be written as a function of two 4D

points, which require eight parameters in total. There are an infinite number of pairs of 4D


(a) (b)

Figure 2.17: The projection of two different points in 4D using the 2PP, shown in red. The 2PP

is illustrated as a grid of squares. Each square is considered to be a view. Each view has its own

set of coordinates that describe a location within the view. Both (a) and (b) show 4D points,

defined for specific values of s, t, u and v. Note that s and t values correspond to which view,

while u and v correspond to a specific view’s coordinates (similar to image coordinates). In our

case, a single 4D point must be defined by its view and its coordinates within the view; hence,

a minimum of 4 parameters to describe a 4D point.

points that can describe the 4D line. Each of these points can be “fixed” in the same manner as

a line in 3D (Section 2.7.2.2), reducing the minimum number of parameters to six to describe a

line in 4D. Several examples of 4D lines are shown in Fig. 2.18.

(a) (b) (c) (d)

Figure 2.18: The projection of four different lines in 4D using the 2PP. A 4D line still has one

DOF. (a) t, u and v are held constant, while s is allowed to vary. (b) s, u, and v are held constant

while t is allowed to vary. (c) s, t and u are held constant, while v is allowed to vary. (d) s and

t are held constant, while u and v vary linearly.

A Hyperplane in 4D A hyperplane in 4D is given as

ax+ by + cz + dw + e = 0, (2.33)


where a, b, c, d, e ∈ R. In the 2PP, we can write as + bt + cu + dv + e = 0. Similar to

the 3D case, this equation can be divided by e, yielding a minimum of four parameters to

describe the hyperplane in 4D. Alternatively, the hyperplane in 4D can be described by its

normal n = [a, b, c, d] and a distance to the origin. Four examples of 4D hyperplanes are shown

in Fig. 2.19.

(a) (b) (c) (d)

Figure 2.19: The projection of four different hyperplanes in 4D using the 2PP. A hyperplane

is only constrained along 1 dimension. (a) The hyperplane is constrained along u, such that

cu+e = 0. (b) The hyperplane is constrained along v, such that dv+e = 0. (c) The hyperplane

is constrained along u and v through a linear relation, such that cu + dv + e = 0. (d) The

hyperplane is constrained along s, such that as+ e = 0.

A Plane in 4D & Intersection of 4D Hyperplanes Similar to the analogy of a line in 3D

that can be represented by the intersection of two hyperplanes in 4D that have unique normals.

With unique normals, we can say that the hyperplanes are not parallel and one hyperplane is not

entirely contained within the other hyperplane. Mathematically, let us assume we have two 4D

hyperplanes, given as

ax+ by + cz + dw + e = 0, (2.34)

and

fx+ gy + hz + iw + j = 0, (2.35)


where a, b, c, d, e, f, g, h, i ∈ R. From (2.34), we can isolate w as

w =−1

d(e+ ax+ by + cz), (2.36)

and substitute this expression into (2.35) as

fx+ gy + hz + i(−1

d(e+ ax+ by + cz)

)

+ j = 0, (2.37)

which can be simplified to

(

f −ia

d

)

x+(

g −ib

d

)

y +(

h−ic

d

)

z +(

j −ie

d

)

= 0. (2.38)

This equation matches the standard form of a plane in 3D, given in (2.21). From this, it is clear

that the intersection of two hyperplanes in 4D forms a plane in 4D. Each hyperplane can be

described using four parameters, to a total of eight. In Section 5.2, we show equivalently that

the intersection of two 4D hyperplanes can be described with two equations in (5.5).

We note that (2.38) appears to imply that we can describe a plane in 4D with just four numbers;

however, each coefficient in (2.38) is an equation that has two constraints in total, d 6= 0 and at

least two of the coefficients in front of x, y or z must be non-zero. We can further illustrate this

relation by illustrating two hyperplanes in the 2PP, as in Fig. 2.20. Two different hyperplanes

are pictured in green and purple. Their intersections, highlighted in red, represent the plane in

the 4D LF.

Additionally, we can show that the intersection of four hyperplanes in 4D intersect at a point.

In 2D, the intersection of two 2D hyperplanes resulted in a 2 × 2 system of equations, which

could be solved for a 2D point. In 3D, the intersection of 3 3D hyperplanes resulted in a 3 × 3

system of equations, which could be solved for a 3D point. In 4D, the intersection of four 4D


(a) (b)

Figure 2.20: The projection of two different planes in 4D using the 2PP. In both (a) and (b), the

two hyperplanes are shown in green and purple. Their intersection in red represents the plane

in the 4D LF.

hyperplanes results in a 4× 4 system of equations, which results in a 4D point,

a1 b1 c1 d1

a2 b2 c2 d2

a3 b3 c3 d3

a4 b4 c4 d4

x

y

z

w

=

e1

e2

e4

e4

, (2.39)

where the four hyperplanes are represented by their subscripts.

2.7.3 Point-Plane Correspondence

A particularly relevant question for robots striving to interact in a 3D world is how do observa-

tions in the LF translate to the 3D world? In this section, we will further discuss the intersections

of hyperplanes in 4D to show that a point in 3D manifests itself as a plane in the 4D LF. This

manifestation was coined the point-plane correspondence in [Dansereau and Bruton, 2007],

although a similar relationship was determined for translating monocular cameras in [Bolles

et al., 1987].

Recall the relative two-plane parameterisation (2PP) [Levoy and Hanrahan, 1996]. A ray with

coordinates φ = [s, t, u, v], is described by two points of intersection with two parallel reference


planes. An s, t plane is conventionally closest to the camera, and a u, v plane is conventionally

closer to the scene, separated by arbitrary distance D. The rays emitting from a Lambertian

point in 3D space, P = [Px, Py, Pz]T can be illustrated in the xz-plane, shown in Fig. 2.21. The

same ray can be shown in the su-plane in Fig. 2.22.

For the xz-plane, if we define θ as the angle between the intersecting ray and the z-axis direc-

tion, then by similar triangles, we have

tan θ =Px − u− s

Pz −D=

Px − s

Pz

. (2.40)

Then

u = Px −(Px − s

Pz

)

(Pz −D) + s

u =D

Pz

(Px − s). (2.41)

We can also plot (2.41) to yield projections in the 2PP similar to Fig. 2.19a. Plotting (2.42)

yields projections similar to Fig. 2.19b.

We can follow a similar procedure for the yz-plane, resulting in

v =D

Pz

(Py − t). (2.42)

We can combine (2.41) and (2.42) into a single equation as

u

v

=

(D

Pz

)

Px − s

Py − t

. (2.43)

We can recognize (2.43) as two hyperplanes in 4D that intersect to describe plane in 4D, as well

as a point in 3D. Therefore, light rays from a Lambertian point in 3D manifests as a plane in the

4D LF.


Figure 2.21: Light-field geometry for a point in space for a single view (black), and other views

(grey), whereby u is defined relative to s and varies linearly with s for all rays originating from

P (Px, Pz).

We can re-write (2.43) into the form,

DPz

0 1 0

0 DPz

0 1

s

t

u

v

=

DPx

Pz

DPy

Pz

. (2.44)

From (2.44), we note that the hyperplane normals only depend on Pz, and not Px or Py. The

normals are similar for both hyperplanes for a Lambertian point in that their elements have the

same values, but in different columns (such that the two normals are still linearly-independent)

in su and tv, respectively. Equation (2.43), and thus (2.44), map out the ray space (all rays)

emitting from point P .

2.7.4 Light-Field Slope

In 2D, a line’s direction and steepness, i.e. its rate of change of one coordinate with respect

to the other coordinate, is referred to as the slope. In the 4D LF, if we consider two different

measurements from a Lambertian point P (Px, Py, Pz) as (s1, t1, u1, v1) and (s2, t2, u2, v2),


the difference between these two measurements for the xz-plane can be written as

u2 − u1 =D

Pz

(Px − s2)−D

Pz

(Px − s1), (2.45)

which simplifies to

u2 − u1 = −D

Pz

(s2 − s1). (2.46)

We then refer to the rate of change between the linear relation of u with respect to s, as slope

w, which is often visualized in a 2D EPI slice of the LF, as in Fig. 2.15 and is given as

w =u2 − u1

s2 − s1= −

D

Pz

. (2.47)

We note that a similar procedure follows for the y − z plane and yields an identical expression,

w =v2 − v1t2 − t1

= −D

Pz

. (2.48)

The slope w relates the image plane coordinates for all rays emitting from a particular 3D point

in the scene. Fig. 2.21 shows the geometry of the LF for a single view of P . As the viewpoint

changes, that is, s and t change, the image plane coordinates vary linearly according to (2.43).

In Fig. 2.22, we show how u varies as a function of s, noting that v varies as a similar function

of t. The slope of this line w, comes directly from (2.43), and is given by

w = −D

Pz

. (2.49)

By working with slope, akin to disparity from stereo algorithms, we deal more closely with the

structure of the light field.

In this section, we explored geometric primitives such as points, lines, planes and hyperplanes

from 2D to 4D. We explained the number of parameters that are required to describe the prim-

itives. By describing the underlying mathematics behind these geometric primitives, we gain


u

s

Figure 2.22: For the situation illustrated in Fig. 2.21, the corresponding line in the s, u plane

has a slope w.

insight into how light rays emit from a Lambertian point are represented in the 2PP of the LF.

We showed that a light ray emitting from a Lambertian point can be described as a 4D point

in the LF. A Lambertian point induces a plane in the 4D LF, and a plane in the 4D LF can be

described by the intersection of two 4D hyperplanes. In future chapters, we will use these re-

lations to propose light-field features for visual servoing and detecting refracted image features

using an LF camera, and servoing towards refractive objects.

Chapter 3

Literature Review

In this chapter, we provide a review of the literature relevant to this thesis. First, we introduce

image features, from 2D to 4D. Then we review visual servoing in the context of LF cameras

and refractive objects. Third, we investigate the state of the art for how refractive objects are

handled in robotics. Finally, we summarize the review by identifying the research gaps that this

thesis seeks to address.

3.1 Image Features

Features are distinct aspects of the scene that can be reliably and repeatedly identified from dif-

ferent viewpoints and/or across different viewing conditions. Image features are those features

recorded in the image as a set of pixels by the camera that can then be automatically detected

and extracted as a vector of numbers, which is referred to as an image feature vector. Image fea-

ture vectors abstract raw and dense image information into a simpler, smaller and more compact

relevant representation of the data. Much of the literature does not make a significant distinc-

tion between these three concepts. Good image features to track are those that can repeatedly

be detected and matched across multiple images [Shi and Tomasi, 1993]. There are typically

61

62 3.1. IMAGE FEATURES

two aspects to finding an image feature vector, an image feature detector and a image feature

descriptor. For brevity, we refer to both as a detector and descriptor, respectively. The detector

is a method of determining whether there is a suitable image feature at a given image location.

The detector is usually represented by a pair of image coordinates, a set of curves, a connected

region, or area [Corke, 2013]. The descriptor is a method of describing the image feature’s

neighbourhood. The descriptor typically takes the form of a vector for correspondence. In

this section, we review geometric image features, as well as photometric image features in the

context of refractive objects and light fields from 2D to 4D. We then briefly discuss image fea-

ture correspondence and why refractive objects are particularly challenging for image feature

correspondence.

3.1.1 2D Geometric Image Features

Traditionally, the most common image features are geometric image features that represent a 2D

or 3D geometric shape in a 2D image. Most robotic vision methods use 2D geometric image

features, such as regions and lines [Andreff et al., 2002], line segments [Bista et al., 2016],

moments (such as the image area, the coordinates of the centre of mass and the orientation of an

image feature) [Mahony et al., 2002,Tahri and Chaumette, 2003,Chaumette, 2004], and interest

points (sometimes referred to as keypoints) [Chaumette and Hutchinson, 2006,McFadyen et al.,

2017]. For image points, Cartesian coordinates are normally used, though polar and cylindrical

coordinates have also been developed [Iwatsuki and Okiyama, 2005]. Interest points are better

suited to handle large changes in appearance, which may be caused by refractive objects. One

of the earliest and most popular interest point detectors is the Harris corner detector [Harris and

Stephens, 1988]; however, Harris corners do not distinguish interest points of different scale—

they operate at a single scale, determined by the internal parameters of the detector. In the

context of wide baseline matching and object recognition, there is an interest in features that

can cope with scale and viewpoint changes. Harris corners are computationally-cheap, but do

CHAPTER 3. LITERATURE REVIEW 63

not provide accurate feature matches across different scales and viewpoints [Tuytelaars et al.,

2008, Le et al., 2011].

To achieve scale invariance, a straightforward approach is to extract points over a range of scales

and use all of these points together to represent the image, giving rise to multi-scaled features.

Of particular note, Lowe developed the scale invariant feature transform (SIFT) feature detector

based on finding the extrema of a multi-scaled pyramid of the Difference of Gaussian (DoG)

responses [Lowe, 2004]. Bay et al. further reduced the computational cost of SIFT features by

considering the Hessian of Gaussians and other numerically-efficient approximations to create

speeded-up robust feature (SURF) feature detectors [Bay et al., 2008].

SIFT and SURF features also include descriptors that are based on using histograms. These

histograms describe the distribution of gradients and orientations of the feature’s support re-

gion in the image for illumination and rotational invariance. Dalal et al. developed the more

advanced histogram of gradients (HoG) feature descriptor, which uses normalized weights

based on nearby image gradients for each sub-region, making HoG descriptors less sensitive

to changes in contrast than SIFT and SURF, and better at matching in cluttered scenes [Dalal

and Triggs, 2005]. While Lowe’s SIFT descriptor was limited to a single scale, Dong et al.

recently improved the SIFT descriptors by pooling (combining) the gradient histograms over

all the sampled scales, calling the new descriptor domain-size pooled SIFT (DSP-SIFT) [Dong

and Soatto, 2015], which represents the state of the art in terms of point feature descriptors for

SfM tasks [Schoenberger et al., 2017].

Features from accelerated segment test (FAST) features were developed by Rosten [Rosten

et al., 2009] as an exceptionally cheap binary feature detector that exploits the relative rela-

tionship of nearby feature pixel values directly. binary robust independent elementary features

(BRIEF) descriptors selects random pixels within the neighbourhood of the feature to make

binary comparisons in sequence as a binary descriptor, which was computationally cheap and

reliable except for in-plane rotation [Calonder et al., 2010]. Oriented FAST and rotated BRIEF


(ORB) features were developed as computationally cheaper alternatives to SIFT and SURF fea-

ture detectors for real-time robotics. The ORB features build on the FAST detector by using

Harris corner strength for ranking, and SIFT’s multi-scale pyramids for scale invariance [Rublee

et al., 2011]. ORB descriptors use the BRIEF descriptor and augment it with an intensity-

weighted centroid. This assumes a small offset between the image’s corner intensity and the

corner centre which defines a measure of orientation to provide rotational invariance. Over-

all, the result is that ORB is a much more computationally efficient detector and descriptor

with comparable performance to SURF, which has proven to be very successful in the robotics

literature.

Recent work in machine learning and convolutional neural networks (CNN)s have also given

rise to learned 2D features. Verdie et al. [Verdie et al., 2015] developed a temporally invari-

ant learned detector (TILDE) that detects keypoints for outdoor scenes, despite lighting and

seasonal changes. They demonstrated better repeatability over 3 different datasets than hand-

crafted feature detectors, such as SIFT. Unfortunately, Verdie’s approach was only done for a

single scale and without any viewpoint changes.

Yi et al. trained a deep network to learn the thresholds of the entire SIFT feature detection

and description pipeline in a unified manner [Yi et al., 2016]. They called this method the

Learned Invariant Feature Transform (LIFT). LIFT out-performed all other hand-crafted fea-

tures in terms of repeatability and the nearest neighbour mean average precision, a metric that

captures how discriminating the descriptor is by evaluating it at multiple descriptor distance

thresholds.

However, an extensive experimental evaluation of hand-crafted to learned local feature descrip-

tors showed that learned descriptors often surpassed basic SIFT and other hand-crafted descrip-

tors on all evaluation metrics in SfM tasks [Schoenberger et al., 2017]. However, more advanced

hand-crafted descriptors such as DSP-SIFT performed on par, or better than the state-of-the-art


learned feature descriptors including LIFT, for tasks in SfM, which showed a high variance

across different datasets and applications, unlike hand-crafted features.

Many robotic vision systems have used hand-crafted and learned features [Kragic and Chris-

tensen, 2002,Bourquardez et al., 2009,Low et al., 2007,Tsai et al., 2017,Lee et al., 2017,Pages

et al., 2006]. The majority of them assume a similar appearance for the support region during

correspondence. For non-Lambertian scenes where the support regions change significantly in

appearance with respect to viewing pose, incorrect matches can occur. Moreover, refractive ob-

jects can cause features to distort, rotate, scale and flip. Feature descriptors that only account for

scale and rotation will not reliably match refracted content because the additional distortion and

flips caused by refractive objects change the very neighbourhood that the descriptors attempt to

describe. Thus, these 2D image features may not perform well for scenes containing refractive

objects.


The fundamental limitation with using 2D features to describe the 3D world is that significant

information is lost during the image formation process of conventional cameras. The perspec-

tive transformation is an irreversible process that projects the 3D world into a 2D image. Full

3D information can greatly improve robot vision algorithms to more reliably handle changes

due to viewing position and lighting conditions. We refer to incorporating 3D information into

image features as 3D geometric image features.

Measurements of 3D data can come from a variety of sensors, including stereo, RGB-D or

LIDAR. Sensor measurements are then turned into one of many different 3D feature represen-

tations. Most conventional 3D feature descriptors are based on histograms, similar to SIFT’s

2D gradient-based histogram descriptors. The most common is Johnson’s spin image [Johnson

and Hebert, 1999]. For a given point, a cylindrical support volume is divided into volumetric

ring slices. The number of points in each slice are counted and summed about the longitudinal


axis of the volume. This makes the spin image rotationally invariant about this axis. Finally,

the spin image is binned into a 2D histogram.

Tombari et al. built on spin images by using a spherical support volume and examining the

surface normals of all the points within the support, referred to as the Signature of Histograms

of OrienTations (SHOT) feature descriptor [Tombari et al., 2010]. All of these approaches use a

similar strategy, geometric measurements are taken about a support volume and are binned into

a histogram. The shape of the histogram is used to compare similarity to given points. Salti et al.

extended the SHOT descriptors to include both surface geometry as well as colour texture [Salti

et al., 2014]. They demonstrated improved repeatability by including texture; however, their

method remains untested for refractive objects and we anticipate reduced performance since

colour texture may change with viewpoint for refractive objects.

Quadros recently developed 3D features from LIDAR, defined by ray-tracing a set of 3D line

segments in space [Quadros, 2014]. If these lines reach behind a surface or encounter a large

gap in the data, unobserved space is registered by the method. Unobserved space is assumed to

be occlusions in 3D point clouds. The authors report that accounting for occlusions facilitates

more robust object recognition, although their method does not consider refractive objects.

Recently, Gupta et al. learned 3D features from RGB-D images for object detection and seg-

mentation [Gupta et al., 2014] and Gao et al. for SLAM [Gao and Zhang, 2015]. However, none

of these methods have been implemented for visual servoing and these methods rely on 3D data

from RGB-D and LIDAR sensors, which return erroneous measurements for refractive objects

and other view-dependent effects.


All of the previous features have been developed for 2D images or 3D representations. LFs

are parameterised in 4D, which requires a re-evaluation of feature detectors and descriptors.


Most previous work using LFs have only used 2D image features [Johannsen et al., 2015,Smith

et al., 2009], or simply use the LF camera as an alternate 3D depth sensor in Structure from

Motion and SLAM-based applications [Dong et al., 2013, Marto et al., 2017]. These works do

not take advantage of all the information contained within the full 4D LF, which can capture

not only shape and texture, but also elements of occlusion, specular reflection and in particular,

refraction.

Ghasemi et al. proposed a global feature using a modified Hough transform to detect changes

in the lines of slope within an EPI [Ghasemi and Vetterli, 2014]. However, their method is

a global feature used to describe the entire scene, which is inappropriate for most SfM and

IBVS methods that require local features. More recently, Tosic et al. focused on developing

a SIFT-like feature detector for LFs by incorporating both scale-invariance and depth into a

combined feature space, called LISAD space [Tosic and Berkner, 2014]. Extrema of the first

derivative of the LISAD space was taken as 3D feature points, yielding a feature described

by image position (u, v), scale and slope (equivalently depth). However, we note that Tosic’s

work assumes no occlusions or specular reflections and does not discuss feature description to

facilitate correspondence over multiple light fields. Furthermore, Tosic’s choice of using an

edge-detector in the epipolar plane images (EPIs) amounts to a 3D edge detector in Cartesian

space, which is a poor choice when unique points are required by SfM and IBVS. Edge points

are not unique and are easily confused with their neighbours. Additionally, we anticipate these

LF features may not perform well for refractive objects, because the depth analysis assumes

Lambertian scenes.

Also pursuing more reliable LF features, Texeira et al. found SIFT features in all sub-views of

the LF and projected them into their corresponding EPIs [Teixeira et al., 2017]. These projec-

tions were filtered and grouped into straight lines in their respective EPIs, and then counted.

Features with higher counts were observed in more views and thus considered more reliable.

In other words, Teixeira imposed 2D epipolar constraints on 2D SIFT features, which does not

take full advantage of the geometry of the 4D LF.


Similarly, Johannsen et al. considered 3D line features based on Plücker coordinates and im-

posed 4D light-field constraints in relation to LF-based SfM [Johannsen et al., 2015]. Zhang et

al. considered the geometry of 3D points and lines transforming under light field pose changes

[Zhang et al., 2017]. They derived line and plane-based correspondence methods between sub-

views of the LF and imposed these correspondences in LF-based SfM. Doing so resulted in im-

proved accuracy and reliability over conventional SfM, especially in challenging scenes where

image feature points were sparse, but lines and planes were still visible. These previous LF-

based works largely focused on matching between large differences in viewpoint. However,

incremental pose changes, such as those found in visual servoing and video applications, also

warrant consideration. How the LF changes with respect to these small pose changes is similar

in concept to the image Jacobian for IBVS, which has not yet been well-explored.

In considering LF cameras with respect to refractive objects, Maeno et al. proposed to model

an object’s refraction pattern as image distortion and developed the light-field distortion (LFD)

feature based on the differences in corresponding points in the 4D LF [Maeno et al., 2013]. The

authors used the LFD for transparent object recognition. However, their method did not impose

any LF geometry constraints, leading to poor performance with respect to changes in camera

position. Xu et al. built on Maeno’s LFD to develop a method for refractive object image

segmentation [Xu et al., 2015]. Each pixel was matched between each sub-view of the light

field and then fitted to a single normal of a 4D hyperplane that is characteristic for a Lambertian

point. A threshold was applied to this error to distinguish a refracted pixel. However, we will

show in Chapter 5 that a 3D point is not described by a single hyperplane in 4D. Rather a 3D

point manifests as a plane in 4D, which can be described as the intersection of two hyperplanes.

Both of these must be considered when considering a feature’s potentially refractive nature

when it passes through a refractive object.


3.1.4 Direct Methods

In contrast to geometric image features that represent a geometric primitive from the image(s),

direct methods establish some geometrical relationship between two images using pixel intensi-

ties directly. For this reason, they are also known as featureless, intensity-based, or photometric

methods [Collewet and Marchand, 2011]. These methods avoid image feature detection, extrac-

tion and correspondence entirely by directly using the image intensities by way of minimising

the error between the current and desired image to servo towards the goal pose. A common

measure of photometric image error is the sum of squared differences. Although this operation

involves many calculations over the image as a whole, it involves very few calculations per

pixel, each of which are relatively simple and easily computed in parallel. This allows many

direct methods to potentially run faster that feature-based VS methods.

Despite these benefits, VS methods using photometric image features typically suffer from

small convergence domains compared to geometric feature-based methods [Collewet and Marc-

hand, 2011]. Recently, to improve the convergence domain, Bateux et al. projected the current

image to a several poses, which were tracked by a particle filter. The error between the pro-

jected and current images drove the robot towards the convergence area, whereupon the method

switched to conventional photometric VS [Bateux and Marchand, 2015]. Although the error

was minimised between the current and next images, the poses projected by the particle fil-

ter were random, resulting in a path towards the goal pose that was not necessarily smooth or

optimal with respect to the amount of physical motion required to reach the goal.

Furthermore, photometric image features typically assume that the scene’s appearance does not

change significantly with respect to viewpoint. Thus, they do not perform well for changes in

pose which result in large changes in scene appearance [Irani and Anandan, 1999]. Refractive

objects tend to have large changes in appearance with respect to viewing pose and therefore

photometric VS methods are ill-suited for scenes with refractive objects.


Collewet et al. recently extended photometric IBVS to scenes with specular reflections [Collewet

and Marchand, 2009]. This was accomplished by considering the Phong light reflection model,

which provides image intensity as a function of a diffuse, specular and ambient component,

given a point light source [Phong, 1975]. Collewet’s approach compared the derivative of

the light reflection model to the image Jacobian from photometric VS [Collewet and Marc-

hand, 2011] to arrive at an analytical description of the image Jacobian relating pixel values to

the light reflection from the Phong model. However, their approach requires a light reflection

model, which in other words requires complete knowledge of all the lighting sources and their

relative geometry. A similar strategy would likely only be viable for refractive objects if a 3D

geometric model of the object was available.

3.1.5 Image Feature Correspondence

The classical approach to many robotic vision algorithms involves detecting, extracting and then

image features to compare the current and some goal image feature sets. Often, the success of

the algorithm depends significantly on accurate feature correspondence. Correspondence is a

data association problem, finding the same set of image features in a pair of images. Cor-

responding image features is typically divided into two categories: large-baseline and small-

baseline matching. Large-baseline matching aims to correspond features between two images

that were taken from relatively different viewpoints, large baselines, or different viewing condi-

tions. Small-baseline matching aims to correspond features between two images that were taken

from relatively similar viewpoints, or narrow baselines. While both approaches aim to match

image features between two images, the general assumptions and approaches differ. However,

large-baseline matching can also apply to small-baseline situations, and in the context of VS,

the image feature error that VS seeks to minimise, relies on corresponding image features be-

tween the current and goal images. The goal image may have been captured from a relatively

different viewpoint. Thus, in this thesis we focus on large-baseline image feature matching.


For matching, the nearest neighbour distance between two feature descriptor vectors is com-

monly used as putative matches; however, exhaustive methods are inefficient for large feature

databases. Advanced features, like SIFT use search data structures, such as k-d trees to more

efficiently find matches [Lowe, 2004]. Muja et al. proposed multiple randomized k-d trees to

approximately find the nearest neighbour with much faster speeds than linear search, with only

a minor loss in accuracy [Muja and Lowe, 2009]. However, the actual comparison of tradi-

tional image feature correspondence methods is often based on some abstraction of the image

feature’s appearance. This inherently assumes that the appearance of the image feature does not

change significantly between views. Refracted objects can significantly change the appearance

of a feature, which makes matching based on appearance particularly challenging.

To reduce the possibility of mismatches and remove outliers, putative matches are refined ac-

cording to some consistency measure (or model). For example, in two-view geometry, the

image reprojection error from the fundamental matrix is used. The standard approach is ran-

dom sampling and consensus (RANSAC) [Bolles and Fischler, 1981], where candidate points

are randomly chosen to form a hypothesis, which is tested for consistency against the remain-

ing data. The hypothesis process is iteratively repeated until a thresholded number of inliers is

reached. Building on RANSAC, Torr et al. proposed maximum likelihood estimator sampling

and consensus (MLESAC) that maximises the likelihood that the data was generated from the

hypothesis [Torr and Zisserman, 2000].

Most outlier rejection methods, such as RANSAC and MLESAC, are based on two assump-

tions: first, that there are sufficient data to describe the model and second, that the data are

mostly inliers—there are few outliers. Most robotic vision algorithms do not account for re-

fraction and thus rely on these outlier rejection methods to remove these inconsistent features

(such as refracted features) from the inlier set. In a scene that has mostly Lambertian features

with only a small number of refracted features, outlier rejection methods work well. However,

for scenes that are mostly covered by a refractive object, such as when a robot or camera di-

rectly approaches a refractive object, outlier rejection methods are much less reliable because

72 3.2. VISUAL SERVOING

the second assumption is broken [Kompella and Sturm, 2011]. Therefore, traditional feature

correspondence methods may not work reliably for features that pass through refractive objects.

3.2 Visual Servoing

Visual servoing (VS) is a form of closed-loop feedback control that uses a camera in the loop

to directly control robot motion. The term VS was introduced by Hill & Park to distinguish

their approach from the common “look-then-move”, or equivalently “sense-then-act”, approach

to robotics in 1979 [Hill, 1979], and has since covered a wide range of applications, from

controlling robot manipulators in manufacturing and agricultural fruit/vegetable picking [Mehta

and Burks, 2014,Baeten et al., 2008,Han et al., 2012], to flying quadrotors [Bourquardez et al.,

2009], and even docking of planetary rovers [Tsai et al., 2013]. VS is a promising technique

for robotics because it does not necessarily require a 3D geometric model of its target, the

accuracy of the operations do not entirely depend on accurate robot control and calibration,

and historically, the simplicity of the VS approach has led to faster interaction in docking,

manipulation and grasping tasks, as well as shorter time cycles in sensing the environment,

which have translated to more reliable robot performance.

Hutchinson et al. were some of the first researchers to clearly distinguish the different types of

VS systems in 1996 [Hutchinson et al., 1996]. This classification was based on how the visual

input was used and what computation was involved, grouping them into either position-based

visual servoing (PBVS) or image-based visual servoing (IBVS) systems. In this section, we

provide a comparison and review of PBVS and IBVS systems.


3.2.1 Position-based Visual Servoing

The purpose of PBVS is to minimise the relative pose error between the target (some desired

pose), and the camera’s pose. Image features are extracted from the image and used with a

geometric model of the target and a known camera model to estimate the relative pose of the

target with respect to the camera, as shown in Fig. 3.1a. Feedback is then computed to reduce

the error in the estimated relative pose. PBVS is traditionally referred to as position-based

VS, although the approach may be more realistically referred to as pose-based VS. The main

advantage of PBVS is that it is straight-forward to incorporate physical constraints, spatial

knowledge and direct manoeuvres (such as obstacle avoidance).

PBVS requires an estimate of the target object pose in order to derive feedback control in the

task space. The approach can be computationally demanding, sensitive to noise and highly

dependent on camera calibration. Most research involving PBVS has focused on Lambertian

scenes, i.e. scenes that are predominantly Lambertian and so do not contain refractive objects,

specular reflections, or other surfaces or materials that cause non-Lambertian light transfer.

PBVS has been demonstrated in full 6 DOF control by Wilson et al. [Wilson et al., 1996] and

in real time using object models with a monocular camera by Drummond et al. [Drummond

and Cipolla, 1999]. More recently Tsai et al. implemented PBVS using a stereo camera for a

tether-assisted docking system [Tsai et al., 2013]. Teuliére et al. demonstrated successful PBVS

(a) (b)

Figure 3.1: Architectures for (a) position-based visual servoing and (b) image-based visual

servoing, which does not require explicit pose estimation. Image courtesy of [Corke, 2017].


using an RGB-D camera even when partial occlusions were present [Teulière and Marchand,

2014].

PBVS towards refractive objects was recently considered by Mezouar et al. for transparent pro-

tein and crystal manipulation under a microscope [Mezouar and Allen, 2002]. However, the 2D

nature of the microscope workspace greatly simplified the visual servoing process. More im-

portantly, the microscope and the backlighting reduced the image processing to a thresholding

problem, making the objects’ refractive nature irrelevant.

Recently, Bergeles et al. used PBVS for controlling the pose of a microrobotic device inside

a transparent human eye for surgery by accounting for the visual distortions caused by the

eye [Bergeles et al., 2012]. Their method required extremely precise model calibration of both

the eye and the robot in order to avoid potential injury. In our application of servoing towards

refractive objects of more general shapes, models of the cameras are not always accurate, and

prior models of the objects are not necessarily available or can be difficult to obtain. For this

reason, PBVS methods are sometimes referred to as model-based VS [Kragic and Christensen,

2002].

PBVS is not commonly used in practice because the visual features used for servoing are not

guaranteed to stay in the FOV during the approach, and more importantly, it requires estimation

of the target pose, which in turn requires a geometric model of the target object and model of

the camera. As we will discuss in Section 3.3.2, 3D information of refractive objects, and in

particular their 3D models and 3D pose information is extremely difficult to obtain. Experi-

mental setups that can obtain the required 3D measurements on the refractive objects are likely

too bulky for mobile robot applications such as VS. Additionally, monocular pose estimation

is poorly conditioned numerically [Kragic and Christensen, 2002]. Therefore, there is real in-

terest in compact IBVS systems that tend to keep the target in the FOV by the very nature of

the algorithm, that avoid the ill-conditioned pose estimation, and do not necessarily require 3D

geometric models of the refractive objects.


3.2.2 Image-based Visual Servoing

In IBVS, robot control values are directly computed based on image features, as shown in

Fig. 3.1b. Typically, image features from the current view of the robot are detected and ex-

tracted. These image feature vectors are matched to a set of goal image feature vectors. The

image feature error is computed as the difference between the two image feature sets. Then the

estimated camera velocity that attempts to drive the image feature error to zero is computed.

This cycle is repeated until the image feature error is sufficiently small. The negative feedback

helps to reduce system fluctuations and promotes settling to equilibrium, which makes IBVS

more robust to uncertainty, noise and camera/robot modelling and calibration errors that often

plague traditional open-loop sense-then-act approaches. IBVS works because the camera pose

is implicit in the image feature values. This eliminates the need for an explicit 3D geomet-

ric model of the goal object, as well as an explicit pose-based motion planner [Chaumette and

Hutchinson, 2006].

3.2.2.1 Image Jacobian for Monocular IBVS

At the core of IBVS systems is the interaction matrix, which is sometimes referred to as a visual-

motor model, but more commonly referred to as an image Jacobian [Kragic and Christensen,

2002]. The image Jacobian J represents a first-order partial derivative function that relates the

rate of change of image features to camera velocity. Consider

p = J(p, cP ;K)cν, (3.1)

where cP ∈ R3 is the coordinate of a world point in the camera reference frame, p ∈ R

2 is

its image plane projection, K ∈ R3×3 is the camera intrinsic matrix, cν = [v; ω] ∈ R

6 is

the camera’s spatial velocity in the camera reference frame, which is the concatenation of the


camera’s translational velocity v = [vx, vy, vz]T and rotational velocity ω = [ωx, ωy, ωz]

T in the

camera reference frame.

The control problem is defined by the initial (observed) and desired image coordinates, p# and

p∗ respectively, from which the required optical flow

p∗ = λ(p∗ − p#) (3.2)

can be determined, where λ > 0 is a constant. This equation implies straight line motion in the

image because the image feature error is only taken as the difference between initial and desired

image coordinates. Combining both equations we can write

J(p, cP ;K)ν = λ(p∗ − p#), (3.3)

which relates camera velocity to observed and desired image plane coordinates. It is important

to note that VS is a local method based on J , the linearisation of the perspective projection

equation. In practice it is found to have a wide basin of attraction.

The monocular image-based Jacobian for image point features p = (u, v), is given as [Chaumette

and Hutchinson, 2006]

J =

− fxPz

0 uPz

uvfy

−f2x+u2

fx

fyv

fx

0 − fyPz

vPz

f2y+v2

fy−uv

fx−fxu

fy

, (3.4)

where fx, fy are the x and y focal lengths1, respectively, and Pz is the depth of the point. We

note that the first three columns of J depend on depth, implying that image feature velocity in

the image plane is inversely proportional to depth, while the feature velocity due to the angular

velocity of the camera is largely unaffected by depth.

1Typically, fx and fy are equal. These terms are in units of pixels, i.e. pixel size is included.


Equation (3.3) suggests we can solve for camera velocity ν, but for single point, the system

is under-determined. Thus it not possible to uniquely determine the elements of ν for a single

observation p. To address this issue, the typical approach is to stack (3.4) for each of N image

features,

J(p1,cP1;K)

...

J(pN ,cPN ;K)

ν = λ

p∗1 − p

#1

...

p∗N − p

#

N

(3.5)

and if N ≥ 3 we can solve uniquely for ν

ν = −λ

J1

...

JN

+

p1 − p∗1

...

pN − p∗N

, (3.6)

where J+ represents the left Moore-Penrose pseudo-inverse of J . Equation (3.6) is similar

to the classical proportional control law for VS [Hutchinson et al., 1996], except that we use

the pseudo-inverse because we may have noisy observations forming a non-square matrix; the

pseudo-inverse finds a solution that minimises the norm of the camera velocity. The constant λ

is the control loop’s gain, which serves to amplify the resulting control.

There are two important issues with (3.4) and (3.6) with respect to lack of depth information

and stability. First, (3.4) depends on depth Pz. Any method that uses this form of Jacobian

must therefore estimate or approximate Pz. However monocular cameras do not measure depth

directly. A common assumption is to fix Pz, which then is a control gain for the translational

velocities [Chaumette and Hutchinson, 2006]. A variety of other approaches exist to estimate

depth online [Papanikolopoulos and Khosla, 1993, Jerian and Jain, 1991, De Luca et al., 2008];

however, monocular depth estimation techniques are often nonlinear and difficult to solve be-

cause they are typically ill-posed [Kragic and Christensen, 2002]. Moreover, visual servoing

stability issues can arise from these approaches because variable depth can lead to local minima

and ultimately unstable behaviour of the robot system [Chaumette, 1998].


Second, Chaumette showed that the IBVS system is only guaranteed to be stable near Pz, since

J is a linear approximation to the nonlinear robotic vision system [Chaumette and Hutchinson,

2006]. Local asymptotic stability is possible for IBVS, but global asymptotic stability cannot

be ensured. Determining the size of the neighbourhood where stability and convergence are

ensured is still an open issue, even though this neighbourhood is large in practice. Furthermore,

the stacking in (3.6) relies on stacking N image point feature Jacobians Ji, each of which may

have different Pz,i, depending on the scene geometry. Malis et al. showed that incorrect Pz,i can

cause the system to fail [Malis and Rives, 2003]. In other words, the depth distribution affects

IBVS convergence and stability, and in the case of unknown target geometry, accurate depth

estimates are actually needed.

One example of undesirable behaviours in IBVS is camera retreat, where the camera may move

backwards for large rotations [Chaumette, 1998]. Camera retreat is caused by the coupled

nature of the rotation and translation components in the image Jacobian. This poses a perfor-

mance issue because in real systems, such backwards manoeuvres may not be feasible. Corke

et al. showed that camera retreat was a consequence of requiring straight line motion on the

image plane with a rotating camera (as in (3.3)). This was then addressed by decoupling the

translation motion components from the z-axis rotation components into two separate image

Jacobians [Corke and Hutchinson, 2001]. Recently, Keshmiri et al. proposed to decouple all

six of the camera’s velocity screw elements [Keshmiri and Xie, 2017]. Their approach enables

better Cartesian trajectory control compared to traditional IBVS systems at the cost of more

computation.

Almost all IBVS methods rely on accurate image feature correspondence in order to accurately

compute image feature error. McFayden et al. recently proposed an IBVS method that jointly

solves the image feature correspondence and motion control problem as an optimal control

framework [McFadyen et al., 2017]. Image feature error is computed for different feature cor-

respondence permutations. As the robot moves closer to the desired pose, the system converges

towards smaller error and the correct permutation. However, their approach focused on an ex-


haustive approach for the number of image features and thus does not scale well for a large

number image features, such as those detected when using natural features typically found in

most robotic vision algorithms.

3.2.2.2 IBVS on Non-Lambertian Objects

An interesting approach to IBVS on featureless objects was proposed by Pages et al. whereby

coded, structured light was projected into the scene to create geometric visual features for fea-

ture correspondence [Pages et al., 2006]. By defining the projection pattern as a particular grid

of coloured dots, many point features were quickly and unambiguously detected and matched.

However, the structured light required that the ambient light did not overpower the projector,

limiting usage to indoor applications. Additionally, this method may not work reliably for re-

fractive objects, because the projected pattern would be severely distorted, scaled, or flipped,

which would greatly complicate the feature detection and correspondence problem.

Recently, Marchand and Chaumette used planar mirrored reflections to overcome the limited

FOV of a single camera in IBVS [Marchand and Chaumette, 2017]. They derived the image

Jacobian for servoing the mirror relative to the camera to track an object. However, only Lam-

bertian features were tracked through the mirror and it was assumed that the image features

were always within the mirror (thus all the reflected features always showed consistent motion).

Furthermore, image feature distortion that could arise from non-planar mirrors, somewhat simi-

lar to the distortion from refractive objects, was also not considered. In summary, this approach

may not be directly transferable to tracking image features through refractive objects because

nonlinear image feature motion—potentially caused by inconsistent feature/mirror motion or

non-planar mirrors—was not considered in their approach.


3.2.2.3 IBVS using Multiple Cameras

IBVS has been extended to stereo and multi-camera vision systems. Assuming the pose be-

tween both cameras is known, each camera’s Jacobian can be transformed into a common ref-

erence frame. Stacking the same type of image features from both cameras and solving the

system yields camera motion [Chaumette and Hutchinson, 2007]. Malis et al. [Malis et al.,

2000] extended this concept to multiple cameras with a similar stacking of image features;

more cameras yielded more features. Comport et al. derived an IBVS framework for gener-

alised cameras [Comport et al., 2011], though the focus was on non-overlapping FOV camera

configurations, rather than the overlapping FOV camera configurations of LF cameras. Ad-

ditional IBVS systems were discussed in Section 3.1.2. All of these previous works rely on

accurate feature correspondence. They assume Lambertian point correspondences, which do

not necessarily apply in the case of refractive objects. Therefore, we expect that none of these

systems would perform reliably in the presence of refractive objects.

In the area of VS, Malis et al. first proposed to use 3D geometric image features, but referred

to this concept as 2.5D visual servoing [Malis et al., 1999]. Using 3D geometric image features

does not necessarily require any geometric 3D model of the target object and is less limited by

the relatively small convergence domain and depth issues that plague monocular image-based

visual servoing (M-IBVS). In a slightly different manner, Chaumette proposed that it may be

advantageous for robot systems to plan large steps using PBVS, while small intermediate steps

are maintained by IBVS [Chaumette and Hutchinson, 2007].

For stereo vision systems, Cervera et al. used the 3D coordinates of points [Cervera et al., 2003],

and Bernardes et al. used 3D lines [Bernardes and Borges, 2010] for visual servoing. Malis et

al. used homographies [Malis et al., 1999,Malis and Chaumette, 2000] and both Mariottini et al.

and Cai et al. used epipolar geometry [Mariottini et al., 2007,Cai et al., 2013] in visual servoing.

Both homography and epipolar-based approaches determine a geometric relationship between

the current and desired views to control robot motion. The geometric relationship is either


the homography matrix or the fundamental matrix, both of which can be determined using

corresponding feature points from different views. However, decomposing the homography

matrix only applies to planar scenes and stereo epipolar geometry becomes ill-conditioned for

short baselines as well as planar scenes [López-Nicolás et al., 2010].

Recently, Zhang et al. developed a trifocal tensor-based approach for visual servoing [Zhang

et al., 2018]. In simulation, a trinocular camera system was used to estimate the trifocal ten-

sor based on point feature correspondences, as in [Hartley and Zisserman, 2003]. Instead of

directly computing the camera pose via singular value decomposition (SVD), the authors chose

to use elements of the trifocal tensor, augmented with elements of scale and rotation, as visual

features. However, these methods relied on accurate feature correspondences, which are fun-

damentally based on Lambertian assumptions. Therefore, these approaches are not likely to

perform reliably in the presence of non-Lambertian scenes, such as those containing refractive

objects.

3.3 Refractive Objects in Robotic Vision

Refractive objects are particularly challenging in computer and robotic vision because these

objects do not have any obvious visible features of their own. Their appearance tends to be

largely dependent on the background, the object’s shape and the lighting conditions. Although

refractive objects have been largely ignored by the bulk of the robotics community, we review

the previous research in detecting, recognizing refractive objects and reconstructing their shape.

Although shape reconstruction is not an explicit goal of this thesis, observed structure and

camera motion are integrally linked, and it is important to review what information has been

extracted from refractive objects.

82 3.3. REFRACTIVE OBJECTS IN ROBOTIC VISION

3.3.1 Detection & Recognition

There have been a variety of approaches to detecting and recognizing refractive objects. In

this review, we have divided the different approaches into model- and image-based approaches,

based on whether or not the method in question relies on a prior 3D geometric model of the

refractive objects.

3.3.1.1 Model-based Approaches

One of the earliest model-based approaches to refractive object detection was proposed by Choi

and Christensen, where a database of 2D edge templates of projected 3D refractive object mod-

els with known poses, was used to match edge contours from 2D images [Choi and Chris-

tensen, 2012]. Image edges were extracted and matched using particle filters to provide coarse

pose estimates, which were refined via RANSAC. The authors achieved real-time refractive

object detection and tracking with 3D pose information. However, this approach required a

large database of edge templates for every conceivable model and pose, which does not scale

well for general purpose robots, although this is becoming less significant with the increasing

computational abilities of modern computers.

Most subsequent approaches adopted RGB-D cameras as a means of making putative refractive

object detections. While depth measurements of refractive objects from RGB-D cameras were

known to be inconsistent, partial depth around the refractive objects was usually observed in

the RGB-D images. Luo et al. applied a variety of morphological operations to identify 3D

regions of inconsistent depth, which were assumed to be refractive [Luo et al., 2015]. These 3D

regions were then compared to 3D object models for recognition. However, Luo obtained the

3D models of the refractive objects by first painting them so that the refractive objects became

Lambertian, which is not a practical approach for most robotic applications.


Recently, LF cameras have been considered for refractive object recognition with models. Wal-

ter et al. also used an RGB-D camera for object recognition, but combined their system with

an LF camera array to detect and replace the inconsistent depth estimates caused by specular

reflections on glass objects [Walter et al., 2015]. This was accomplished by comparing a known

3D model of the refractive object to the observed depth measurements in order to identify the in-

consistent depths. Given that LF cameras implicitly encode depth, it is possible that the RGB-D

camera was redundant in this approach.

In a particularly recent and interesting work, Zhou et al. developed an LF-based depth descriptor

for object recognition and grasping [Zhou et al., 2018]. For a Lambertian point, the light field

yields one highly redundant depth estimate, but for a refracted image feature, the light field

can yield a wide distribution of depths. Zhou proposed to use a 3D array of depth likelihoods

within a certain image region and depth range, creating a 3D descriptor for the refractive object.

By comparing this depth-based descriptor to a 3D geometric model, refractive object pose was

estimated using Monte Carlo localization. This method was sufficiently accurate for coarse

manipulation of glass objects in water and Lambertian objects behind a stained-glass window.

However, all of the previously-mentioned methods required prior accurate 3D geometric models

of the refractive objects. For a small set of simple objects this approach may be feasible, but

in general, models of refractive objects are challenging to acquire, potentially time-consuming

and expensive to obtain, or simply not available [Ihrke et al., 2010a]. Therefore, there is great

interest in methods that do not rely on 3D geometric models.

3.3.1.2 Image-based Approaches

Early work on detecting refractive objects in 2D images started with Adelson and Andandan

in 1990 by focusing on finding occluding edges caused by refractive objects [Adelson and

Anandan, 1990]. However their method was limited to 2D layered scenes with no visual texture

on planar refractive shapes, such as circles and triangles. Szeliski et al. extended this concept of


layered depth images to detect reflective and refractive objects in more general images [Szeliski

et al., 2000]; however, their approach was still limited to scenes that could be described as a

collection of planar layers. McHenry et al. noted that refracted objects tended to distort and blur

image edges, as well as appear slightly darker in the image [McHenry et al., 2005]. Thus their

method focused on finding image edges and then compared the image gradients and overall

intensity values on either side of the edge to detect refractive parts of the image. Snake contours

were then used to merge components of refractive object edges into overall refractive object

segments. However, their method assumed that the background was similar on all sides of the

glass edges, which was not true for very refractive elements or those containing bright surface

highlights.

Kompella et al. extended this work by finding regions in the image that contained even more

visual characteristics related to refracted objects [Kompella and Sturm, 2011]. In addition to the

reduced image intensity and blurred image gradients, their method also searched for an abun-

dance of highlights and caustics caused by the specular surface of most refractive objects and

lower saturation values as some light and colour is lost as it passes through refractive objects.

Characteristics were combined into a function to detect and avoid refractive objects during navi-

gation. However, their method only provided extremely coarse estimates of where the refractive

objects were located in the image and still assumed that the background was similar on all sides

of the glass edges. Therefore we anticipate that their approach would not perform well if the ob-

ject was not in front of a uniform background, which is not practical for mobile robots working

in cluttered scenes.

Recently, Klank et al. used an RGB-D camera for detection, but noted that most refractive

objects appeared much darker in the depth images when placed on a flat table [Klank et al.,

2011]. This was likely due to the different absorption properties of glass. They segmented dark

regions in the depth images as candidate refractive objects and then identified depth inconsis-

tencies within the dark regions as refractive. However, dark regions in depth images do not

necessarily correspond to glass objects. Depending on the type of RGB-D camera used, dark


regions in depth images can also appear at regions that are actually farther away (since intensity

is correlated to depth), as well as other material types, such as felt, and sometimes at occlu-

sion boundaries. Thus their algorithms may not perform well in cluttered and occluded scenes

containing refractive objects.

LF cameras have only recently been considered for image-based refractive object detection.

Maeno et al. proposed to model an object’s refraction pattern as image distortion, based on

differences in corresponding points in the 4D LF [Maeno et al., 2013]. However, the authors

noted poor performance due to changes in appearance from both the specular reflections on the

refractive objects and the camera viewing pose. Xu et al. built on Maeno’s work to complete a

transparent object image segmentation method from a single light field capture [Xu et al., 2015].

However, as we will discuss in more detail in Ch. 5, their method does not fully describe how

a 3D point manifests in the light field. We address this to improve detection and recognition

rates.

3.3.2 Shape Reconstruction

Although shape reconstruction is not an explicit goal of this thesis, observed structure and mo-

tion are intricately linked; thus it is important to understand what has been done in this area.

Shape reconstruction of refractive objects is a particularly challenging task. Ihrke et al. proposed

a taxonomy of objects according to their increasing complexity with respect to light transport

(reflections, refractions, sub-surface scattering, etc. . . ) [Ihrke et al., 2010a]. Most techniques

have focused on opaque objects (Class 1) and have demonstrated good performance using a

sequence of images from a monocular camera relying on dense pixel correspondences [Engel

et al., 2014, Newcombe et al., 2011]. However, shiny and transparent objects are still diffi-

cult for the state-of-the-art because these methods assume Lambertian surfaces. Additionally,

traditional methods rely on rejecting inconsistent correspondences using RANSAC [Fischler

and Bolles, 1981], which can be robust to a few small specular highlights; but are insufficient


for dealing with more complex light transport phenomena (Class 3+), including refractive ob-

jects [Ihrke et al., 2010a,Tsai et al., 2019], as we will show in Ch 5. In order to reliably deal with

shiny and transparent objects, researchers have developed a variety of methods to reconstruct

the shape of refractive objects.

3.3.2.1 Shape from Light Path Triangulation

Kutulakos et al. presented the seminal work on using light-ray correspondences to estimate the

shape of refractive objects [Kutulakos and Steger, 2007]. The shape of specular and transparent

objects, defined by depths and surface normals, can be estimated by mapping the light rays that

enter and exit the object. As shown in Fig. 3.2, we can consider a convex hull two-interface

refractive object and draw a ray originating from background point P (two parameters) in some

direction r (2)2 for some distance dPA (1). At A0, the ray intersects with the refractive object

and changes direction. We estimate this direction change using Snell’s Law, which requires an

estimate of the surface normal Ni (2) and the ratio of refractive indices n1/n2 (1). Through

the object, the light ray travels for distance dAB (1) through the object, and changes direction

at the exiting interface at B0, which is defined by surface normal N (2). The light ray then

travels for distance dBL (1) to the camera. All together, a basic light path can be described by

a minimum of twelve characteristics of the scene. Alternatively, one can describe the light path

as three rays (four parameters each) linked in series, which also requires a minimum of twelve

parameters. As we will describe below, many approaches, such as shape from distortion [Ben-

Ezra and Nayar, 2003] and shape from reflection [Han et al., 2015], apply assumptions which

limit or define many of these parameters to simplify shape recovery.

Shape from distortion is an approach based on capturing multiple images from known poses,

finding visual features that correspond to the same 3D point from behind the refractive object,

and then examining how the light path has been distorted by the refractive object. For example,

2Recall that a ray can be described by four parameters.


Figure 3.2: Light paths can describe the behaviour of light as it passes through a refractive ob-

ject. Most methods rely on light path correspondence and triangulation to solve for the depths

and surface normals of the refractive object. In general for 2-interface refractive objects, light

paths are described by over twelve characteristics from the point of origin to the intersecting

lines at the refractive object boundaries, to the camera sensor. Many approaches apply assump-

tions or constraints to simplify the problem.

Kim et al. acquired the shape of axially-symmetric transparent objects, such as wine glasses,

by placing an LCD display monitor in the background and emitting several known lighting

patterns [Kim et al., 2017]. However, most methods rely on a bulky device to project a calibrated

pattern through the object [Murase, 1990, Hata et al., 1996, Kim et al., 2017] and so are not

immediately applicable to mobile robotics. Recently Ben-Ezra et al. tracked features over a

sequence of monocular camera images to capture the distortion pattern [Ben-Ezra and Nayar,

2003]. Starting with an unknown parametric model, shape and pose were simultaneously found

in an iterative, nonlinear, multi-parameter optimisation scheme. However their method could

only handle quadratic-shaped refractive objects and importantly, the features were manually

tagged because it was seen as a very hard problem to automatically detect and match image

points through refractive medium from single images.

Alternatively shape from reflection or refraction approaches typically solve light ray correspon-

dences by controlling the background behind the refractive object. Han et al. used a single cam-

era fixed in position with a refractive object placed in front of a checkerboard background [Han


et al., 2015, Han et al., 2018]. The method only required two images with the background

pattern in different known positions; however, change of refractive index was required, which

meant immersing the object in water, which is a major limitation for most robots.

In addition to background scene control, constraints on the refractive object itself can further

simplify the light path correspondence problem. For example, Tsai et al. imposed a planar

surface constraint to one side of a refractive object. With a monitor controlling the background

image, they were able to reconstruct a diamond’s shape with a single monocular image [Tsai

et al., 2015] without having to place the object in water.

Without explicit control of the background, shape can also be obtained by controlling the incom-

ing light rays using a mobile light source. Morris et al. used a static monocular camera with a

grid of known moving lights to map different reflectance values to the same surface point, from

which they reconstructed very challenging shiny and transparent structures [Morris and Kutu-

lakos, 2007]. Miyazaki and Ikeuchi used a rotating polariser in front of a monocular camera

to capture multiple images of different polarisation settings, but also required a known back-

ground surface and known lighting distribution to estimate the shape of the transparent object

[Miyazaki and Ikeuchi, 2005]. However, both Morris’ and Miyazaki’s methods require known

light sources with bulky configurations that are impractical for mobile robotic applications.

The majority of the state-of-the-art methods for refractive object shape reconstruction based

on light paths roughly rely on feature correspondence between multiple views to find common

features for triangulation. Because of the complexity and sheer number of unknowns of the

problem, most of these approaches apply assumptions and constraints to the problem to make

it more tractable. In doing so, the application window of their methods become too narrow,

making them fragile and unreliable to be considered for practical robot applications that must

contend with many conditions or environments, or the methods require equipment too bulky to

be considered for most mobile robot applications.


3.3.2.2 Shape from Learning

Recent work in robotics has seen an explosion in the area of learning features using CNN.

CNNs use a large number of images to train several layers of parameters to minimise some

cost function. CNNs use the convolution operation on images that are input to the network to

approximate how neurons from the brain respond to visual stimulus in the receptive field of

the visual cortex [Krizhhevsky et al., 2012]. By feeding it large training sets of images and an

objective function, the CNN is able to “learn” the visual stimulus relevant for a given task (such

as image classification or object detection).

Deep learning approaches use more layers than CNNs to handle more complex tasks, and ad-

vanced recognition performance. Deep learning has achieved state-of-the-art performance for

many classification and recognition tasks, but few have explored their use for refractive objects.

Saxena demonstrated a data-driven method for recognizing grasping points on a variety of ob-

jects, including some refractive objects [Saxena et al., 2008]. However, recovering the shape

of such objects still remains a challenge due to the large amount of ground-truthed images re-

quired to train CNNs. For learning approaches on opaque objects, ground truth comes from

RGB-D cameras; however, RGB-D cameras are unable to provide reliable depth information

on refractive objects and 3D models of refractive objects are not always available.

3.3.2.3 Shape (Structure) from Motion

Shape estimation techniques based on multiple viewpoints are closely related to structure from

motion techniques [Wei et al., 2013]. For shape estimation, scene depth is usually determined

given the viewing pose for each viewpoint (although surface normals are also often computed).

On the other hand, for SfM, scene depth and viewing pose are simultaneously computed from

multiple 2D images3. SfM is generally considered to be a well understood problem in the-

3SfM is also closely related to visual servoing, which we review in Section 3.2.


ory [Hartley and Zisserman, 2003]. The typical pipeline of SfM includes detecting image

features, establishing image correspondences, filtering outliers, estimating camera poses and

locations of 3D points, followed by optional refinement with bundle adjustment [Triggs et al.,

2000]. However, classical SfM does not produce reliable results for refractive objects because

of poor performance with feature correspondence [Ihrke et al., 2010b].

Ham et al. presents a shape estimation method that may be loosely described as structure from

motion on occluding edges of refractive objects [Ham et al., 2017]. The authors use multiple

views with known pose to extract the position and orientation of occluded edge features. Oc-

cluding edge features are visible edges in an image that lie on the boundary of an occlusion or

depth discontinuity. They appear as edges in the image and unlike textural edges (flat patterns

on a surface), they are view-dependent and their surfaces are tangential to the camera view.

Ham’s method can handle very general object shapes and does not require pre-existing knowl-

edge of the object. Bulky equipment setups are not required. However, their method relies on

a monocular camera, which must be moved to different poses to acquire multiple views, which

may make dynamic scenes more challenging. An LF camera may be able to capture a sufficient

number of multiple views in a single shot from a single sensor position (ie, without moving

the camera). Furthermore, Ham’s method is focused on reconstructing the scene. While Ham’s

method requires full pose information, our methods look to detect refracted features and servo

towards them, which are entirely image-based, and thus do not require full pose information.

3.3.2.4 Shape from Light Fields

Using LF cameras for estimating the shape of objects is a relatively recent development. Most

research has been focused on reconstructing Lambertian objects. Tao et al. recently used cues

from both defocus and correspondence within a single LF exposure to obtain depth. The two

measures of depth were combined to provide more accurate dense depth maps than from either

method alone [Tao et al., 2013]. Luke et al. provided a framework to estimate depth by working


directly with the 4D LF in terms of gradients, as opposed to other methods that only exploited

2D epipolar slices of the 4D LF [Luke et al., 2014]. Wanner & Goldluecke formulated a struc-

ture tensor for each pixel to give local estimates of the lines of slopes from the epipolar plane

images. A global optimisation method was used to combine these local depth estimates in a

consistent manner [Wanner and Goldluecke, 2012]. Their approach yielded high quality, dense

depth maps, but required significant computation time, easily over four hours for a single light

field, which may not be practical for online robotics applications. Recently, Strecke et al. devel-

oped a method to jointly estimate depth and normal maps from a 4D light field on Lambertian

surfaces using focal stacks generated by a single light field [Strecke et al., 2017]. However,

none of these methods considered refractive objects.

Wanner et al. were the first to recover the shape of planar specular and transparent surfaces

from an LF [Wanner and Golduecke, 2013]. They assumed that the observed light was a linear

combination of the real surface and the reflected or refracted image. Then the epipolar planar

image can be described as a super-position of two lines of slope that are related to depth. Both

depths were determined and used to separate the scene into a layer closer to the camera and a

layer farther from the camera. However, Wanner’s method was limited to single reflection cases

and planar reflective or transparent surfaces. Our interest is in interacting with more general

object shapes.

Furthering the work on slightly more general shapes, Wetzstein et al. reconstructed the shape

of transparent surfaces based on the distortion of the light field’s background light paths [Wet-

zstein et al., 2011]. Their method relied on a light-field probe that consisted of a lenslet array in

front of a monitor to encode two dimensions in position and two dimensions in direction. Thus

a monocular camera could measure a 4D LF in a single 2D image. The thin refractive object

was placed between the probe and the monocular camera. Since the start of the light paths

were known by calibration, the difference between incoming and exiting angles θi and θo were

computed assuming known refractive indices of the two media. The surface normals were sub-

sequently determined. However, this approach relied on placing the light-field probe behind the

92 3.4. SUMMARY

object, while photographing its front and only applied to thin objects. Thus the general place-

ment of refractive objects in cluttered scenes and mobile applications would be problematic for

this approach.

Recently, Ideguchi et al. proposed an interesting approach to transparent shape estimation based

on comparing the different disparities between sub-images for a given visual feature in the light

field, which they called light-field convergency [Ideguchi et al., 2017]. It is known that as a

visual feature approaches an occluding edge of a refractive object in a light field, it appears in-

creasingly Lambertian. A deeper analysis of their approach suggests that their method performs

an approximate Lambertian depth estimate similar to focus stacking and then fills in inconsis-

tent or missing depths using traditional hole-filling methods that assume smooth surfaces. This

approximation is only valid near occluding edges of refractive objects; thus, their method was

unable to handle thick and wide shapes, such as spheres.

Overall, the bulk of shape estimation techniques using LF cameras has been focused on Lam-

bertian cases, leaving the topic of refractive objects little explored. Those works that have

addressed refractive objects have been limited in terms of the types of objects they apply to, or

require bulky equipment that is not practical for mobile robots.

3.4 Summary

In summary, we have reviewed the topics have been explored in the realm of features and visual

servoing in the context of LF cameras and refractive objects. Our motivation is to enable visual

control around refractive objects using LF cameras.

Most image features in robotic vision have been limited to 2D and 3D and rely heavily on the

Lambertian assumption. Recent 4D LF-specific features have been proposed, but still predomi-

nantly only consider Lambertian or occluded scenes. LF features in relation to refractive objects

are still not yet well-explored.


For visual servoing, PBVS methods appear to be impractical because they require a model of the

refractive object. Various IBVS methods have been developed, but the focus has been largely

on Lambertian scenes. To the best of our knowledge, IBVS in the context of refractive objects

or LF cameras remains unexplored.

Finally, model-based solutions for refractive object detection have been explored; however, 3D

geometric models of refractive objects are time-consuming and difficult to obtain accurately or

simply not available. Thus there is interest in approaches that do not require models. Image-

based detection methods are so far limited in their application, unreliable for changes in viewing

pose, or incomplete in describing a refracted feature’s behaviour in the light field. Additionally,

most solutions require bulky equipment that is impractical for mobile robotic platforms, while

others rely on assumptions that significantly narrow their application window. Clearly there is

a gap for methods that are compact and apply to a wide variety of object shapes.

94 3.4. SUMMARY

Chapter 4

Light-Field Image-Based Visual

Servoing

In the background section, we introduced LF cameras and saw that they were good for capturing

scene texture, depth and view-dependent lighting effects, such as occlusion, specular reflection

and refraction. In the following chapters, we will elaborate on how we will use them to reliably

perceive refractive objects and servo towards them for grasping and manipulation. However,

the first practical issue that must be addressed is how to actually perform visual servoing (VS)

with an LF camera in Lambertian scenes. This chapter focuses on how to directly control

robot motion using observations from an LF camera via image-based visual servoing (IBVS)

for Lambertian scenes. This work was published in [Tsai et al., 2017].

4.1 Light-Field Cameras for Visual Servoing

VS is a robot control technique that makes direct use of visual information by placing the camera

in the control loop. VS is widely applicable and generally robust to errors in camera calibration,

robot calibration and image measurement [Hutchinson et al., 1996, Chaumette, 1998, Cai et al.,

95

96 4.1. LIGHT-FIELD CAMERAS FOR VISUAL SERVOING

2013]. Most VS techniques fall into one of two categories. Position-based visual servoing

(PBVS) uses observed image features and a geometric object model to estimate the camera-

object relative pose and adjust the camera pose accordingly; however, geometric object models

are not always available. In contrast, Image-based visual servoing (IBVS) uses observed image

features and a reference image, from which a set of reference image features are extracted, to

directly estimate the required rate of change of camera pose, which does not necessarily require

a geometric model.

However, most IBVS algorithms are focused on conventional monocular cameras that inher-

ently suffer from lack of depth information, provide limited observations of small or distant

targets relative to the camera’s FOV, and struggle with occlusions, specular highlights and re-

fractive objects. LF cameras offer a potential solution to these problems. As a first step in

exploring LF for IBVS, this chapter considers the multiple views and depth information im-

plicit in the LF structure. To the best of our knowledge, light-field image-based visual servoing

(LF-IBVS) has not yet been proposed.

The main contribution of this chapter are as follows:

• We provide the first derivation, implementation and experimental validation of LF-IBVS.

• We derive image Jacobians for the LF.

• We define an appropriate compact representation for LF features that is close to the form

measured directly by LF cameras.

• In addition, we take a step towards truly 4D plenoptic feature extraction by enforcing LF

geometry in feature detection and correspondence.

We assume a Lambertian scene and sufficient scene texture for classical 2D image features,

such as SIFT and SURF. We validate our proposed method for LF-IBVS using both a simu-

lated camera array and a custom LF camera adapter, shown in Fig. 4.1, which we refer to as

CHAPTER 4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING 97

(a) (b) (c)

Figure 4.1: (a) MirrorCam mounted on the Kinova MICO robot manipulator. Nine mirrors

of different shape and orientation reflect the scene into the upwards-facing camera to create 9

virtual cameras, which provides video frame-rate LFs. (b) A whole image captured by the Mir-

rorCam and (c) the same decoded into a light-field parameterisation of 9 sub-images, visualized

as a 2D tiling of 2D images. The non-rectangular sub-images allow for greater FOV overlap.

MirrorCam, mounted on a robot manipulator. We describe MirrorCam in detail in AppendixA.

Finally, we show that LF-IBVS outperforms conventional monocular and stereo IBVS for ob-

jects occupying the same FOV and in the presence of occlusions.

The remainder of this chapter is organized as follows. Section 4.2 discusses the related work,

formulates the VS problem and explains the LF parameterisation. Section 4.4 explains the

derivations for LF image Jacobians, features, correspondence and the control system. Sec-

tion 4.5 describes our experimental setup with the MirrorCam. Section 4.6 shows our results,

and provides a comparison to conventional monocular and stereo IBVS. Lastly, in Section 4.7,

we conclude the chapter and explore future work for LF-IBVS.

4.2 Related Work

LF cameras offer extra capabilities for robotic vision. Table 4.1 compares conventional and LF

camera systems for different capabilities and tolerances related to VS, given similar configu-

rations, such as sensor size and number of pixels. Notably, stereo provides depth for a single

baseline along a single direction (typically horizontally), but multi-camera and LF systems pro-

98 4.2. RELATED WORK

vide more detailed depth information. They can have both small and long baselines, and have

baselines in multiple directions (typically vertically and horizontally). LF cameras have an ad-

vantage over conventional multi-camera systems for tolerating occlusions and specular reflec-

tions (or more generally non-Lambertian surfaces). This is largely due to the regular sampling,

and because only LF cameras capture the refraction, transparency and specular reflections na-

tively. As such, LF cameras can benefit from methods that exploit these capabilities [Dansereau,

2014].

Table 4.1: Comparison of camera systems’ capabilities and tolerances for VS

Systems Perspectives Field Baseline Baseline Aperture Occlusion Specular

of View Direction Problem Tolerance Tolerance

Conventional Cameras

Mono 1 wide zero none significant no no

Stereo 2 wide wide single moderate weak no

Trinocular 3 wide wide two moderate moderate no

Multiple cameras n wide wide multiple minor moderate no

Light-Field Cameras

Array n2 wide wide multiple minor strong yes

MLA a n2 wide narrow multiple minor moderate yes

MirrorCam b n2 narrow wide multiple minor strong yes

a Based on n2 pixels per lensletb Based on n2 mirrors

Johannsen et al. recently applied LFs in structure from motion [Johannsen et al., 2015]. They

derived a linear relationship using the LF to solve the correspondence problem and compute a

3D point cloud. They achieved an increase in accuracy and robustness, although their 3D-3D

approach did not take full advantage of the 4D LF. Dong et al. focused on Simultaneous Lo-

calization and Mapping (SLAM), and demonstrated that an optimally-designed low-resolution

LF camera allowed them to develop a SLAM implementation that is more computationally

efficient, and more accurate than SLAM for a single high-resolution camera [Dong et al.,

2013]. Dansereau et al. derived “plenoptic flow” for closed-form, computationally efficient

visual odometry with a fixed operation time regardless of scene complexity [Dansereau et al.,

2011]. Zellet et al. extended Dansereau’s plenoptic flow to narrow FOV visual odometry and


showed how LF cameras can enable SLAM for narrow FOV systems, where monocular SLAM

normally fails [Zeller et al., 2015]. That work also showed that using LF cameras with their

visual odometry method improved the depth estimation error by an order of magnitude. Re-

cently, Walter et al. used LF cameras to analyse specular reflection and detect features specific

to specular reflections, which enabled robots to interact with glossy objects, and outperform

their stereo counterparts [Walter et al., 2015]. These motivate the application of LFs for robotics

and LF-IBVS.

4.3 Lambertian Light-Field Feature

Recall from 2.7, the rays emitting from a point in space, cP = [Px, Py, Pz]T follow a pair of

linear relationships [Bolles et al., 1987, Dansereau and Bruton, 2007], as shown in Fig. 2.21

and 2.22,

u

v

=

(

DPz

)

Px − s

Py − t

, (4.1)

where each equation describes a hyperplane in 4D, F(s, t, u, v) ∈ R3, and their intersection

describes a plane L(s, t, u, v) ∈ R2.

We define our LF feature with respect to the central view of the LF as W = [u0, v0, w]T, where

u, v is the direction of the ray entering the central view of the LF, i.e.

u0

v0

=

u

v

s,t=0

=

(

DPz

)

Px

Py

. (4.2)

As discussed in Section 2.7.4 , the slope w relates the image plane coordinates for all rays

emitting from a point in the scene. Fig. 2.21 shows the geometry of the LF for a single view

of cP . As the viewpoint changes, that is, s and t change, the image plane coordinates vary

lineararly according to (4.1), as in Fig. 2.22. The slope of this line w, comes directly from (4.1),

100 4.4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING

and is given by

w = −D

Pz

, (4.3)

noting that this slope is identical in the s, u and t, v planes. We exploit this aspect of the LF in

the feature matching and correspondence process, described in Section 4.5.1. By working with

slope, akin to disparity from stereo algorithms, we deal more closely with the structure of the

LF.

Our LF feature is similar to the Augmented Image Space of [Jang et al., 1991] for perspective

images where the image plane coordinates are augmented with Cartesian depth. Also similar are

the plenoptic disk features developed for the calibration of lenslet-based LF cameras in [O’Brien

et al., 2018]. In plenoptic disk features, image feature coordinates are augmented with the radius

of the plenoptic disk, which is related (by similar triangles) to a Lambertian point’s depth.

4.4 Light-Field Image-Based Visual Servoing

In this section, we derive the image Jacobians for our LF feature, which are used for image-

based visual servoing. Image Jacobians relate image feature velocity (in image space) to camera

velocity in translation and rotation. We first consider the continuous-domain, where s, t, u, v are

distances. Then we consider the discrete-domain, where i, j and k, l are discrete versions of s, t

and u, v, and typically correspond to different views and pixels, respectively.

4.4.1 Continuous-domain Image Jacobian

Following the derivation for conventional IBVS, we wish to relate the camera’s velocity to the

resulting change in an observed feature W through a continuous-domain image Jacobian JC

W = JCν, (4.4)


where ν = [v;ω] ∈ R6 is the camera spatial velocity in the camera reference frame. ν is the

concatenation of the camera’s translational velocity v = [vx, vy, vz]T and rotational velocity

ω = [ωx, ωy, ωz]T in the camera reference frame.

Differentiation of (4.2) and (4.3) yields

u0 = D(PxPz − PxPz)/P2z , (4.5)

v0 = D(PyPz − PyPz)/P2z , (4.6)

w = DPz/P2z , (4.7)

where u0, v0 and w are the feature positions and velocities with respect to the central camera

frame.

We can write the apparent motion of a 3D point as

cP = −(ω × cP )− v, (4.8)

yielding three components cP expressed in terms of cP and ν. Substituting these expressions

into (4.5)–(4.7) allows us to factor out the continuous-domain Jacobian

JC =

w 0 −wu0

Du0v0D

−D −u2

0

Dv0

0 w −wv0D

D +v20

D−u0v0

D−u0

0 0 −w2

Dwv0D

−wu0

D0

. (4.9)

While conventional image Jacobians require an estimate of depth, we note that JC instead has

slope w—an inverse measure of depth, which we can observe directly in the LF. The slope w

is explicit in all columns of (4.9) except the last one, because the LF camera array spans both

the x- and y-axes, and can therefore only observe motion parallax with respect to the camera’s

x- and y-axes. The optical flow for the final column is due to rotation about the optical axis,

and is therefore invariant to depth. In contrast, depth is not explicit in the monocular image

102 4.4. LIGHT-FIELD IMAGE-BASED VISUAL SERVOING

Jacobian for rotations about the x- and y-axes. Trinocular and multi-camera system image

Jacobians would have similar depth dependencies to JC . Multiple views make parallax, and

thus depth, observable in rotations about the x- and y-axes for the LF camera array. We note

that the derivation for JC is for the central view of the LF camera array. Jacobians derived for

the off-axis in-plane views would contain elements of slope in the last column. Additionally, JC

has a rank of three, which implies that the stacked image Jacobian (as in (3.5)) will be full rank

with a minimum of two points for LF-IBVS, in contrast to a minimum of three image points for

M-IBVS.

4.4.2 Discrete-domain Image Jacobian

In the discrete domain, we define to i, j and k, l as the discrete versions of s, t and u, v, as units

of “views” and pixels, respectively. We observe our discrete-domain feature M as the discrete

position and slope M = [k0, l0,mx,my]T, where [k0, l0] are observations taken from the central

view in i, j, and separate slopes mx in the i, k dimensions and my in j, l. The general plenoptic

camera is described by an intrinsic matrix H relating a homogeneous ray φ = [s, t, u, v, 1]T to

the corresponding sample in the LF n = [i, j, k, l, 1]T as in

φ = Hn, (4.10)

where in general H is of the form

H =

h11 0 h13 0 h15

0 h22 0 h24 h25

h31 0 h33 0 h35

0 h42 0 h44 h45

0 0 0 0 1

, (4.11)

and the matrix H is found through plenoptic camera calibration [Dansereau et al., 2013].


However, we limit our development to the case of a rectified camera array, for which only

diagonal entries and the final column are nonzero [Dansereau, 2014]. In this case h11 and h22

are the horizontal and vertical camera array spacing, in meters, and h33 and h44 are given by

D/fx and D/fy, i.e. the inverse of the horizontal and vertical focal lengths of the cameras,

expressed in pixels, scaled by the reference plane separation. The final column encodes the

centre of the LF, e.g. for Nk samples in k, h15 = -h11(Nk/2 + 1/2) and k = Nk/2 + 1/2 is the

centre sample in k. We also note that mx and my encode the same information following the

relationship

mx =h11h44

h22h33

my. (4.12)

We wish to express the image Jacobian of (4.4) in the discrete domain,

M = [ ˙k0,˙l0, mx]

T = JDν, (4.13)

where the observation is expressed relative to the LF centre, k0 = k0+h35/h33, l0 = l0+h45/h44.

From (4.10), we can relate the discrete and continuous-domain observations as

u0 = h33k0, v0 = h44l0, w =h33

h11

mx =h44

h22

my, (4.14)

from which it is trivial to express the derivatives of the discrete observation in terms of the

continuous variables:

˙k0 = h-133u0,

˙l0 = h-144v0, mx =

h11

h33

w, my =h22

h44

w. (4.15)

104 4.5. IMPLEMENTATION & EXPERIMENTAL SETUP

Substituting the continuous-domain derivatives in (4.4), and (4.9) and discrete/continuous rela-

tionships in (4.14) into (4.15) allows us to factor out the discrete-domain Jacobian

JD =

mx

h11

0 -h33

h11

k0mx

Dh44

k0 l0D

-h33k20

D− D

h33

h44

h33

l0

0 my

h22

-h44

h22

l0my

Dh44

l20

D+ D

h44

-h33k0 l0D

-h33

h44

k0

0 0 -h33

h11

m2x

Dh44

l0mx

D-h33

k0mx

D0

. (4.16)

4.5 Implementation & Experimental Setup

In this section, we discuss the implementation details of our LF-IBVS approach, including

how we exploit the LF structure for feature matching and correspondence. We then validate our

proposed derivation of LF-IBVS using closed loop control and the experimental setup described

below.

4.5.1 Light-Field Features

To our knowledge all prior work on LF features operates by applying 2D feature detectors to

2D slices in the u, v dimensions [Johannsen et al., 2015]. In this paper, we do the same. Our

implementation employs Speeded-Up Robust Features (SURF) [Bay et al., 2008], though the

proposed method is agnostic to feature type. However, as a first step towards truly 4D features,

we augment the 2D feature location with the local light-field slope, implicitly encoding depth.

Operating on 2D slices of the LF, feature matches are found between the central view and

all other sub-images. Each pair of matched 2D features is treated as a potential 4D feature. A

single feature pair yields a slope estimate, which defines an expected feature location in all other

sub-images. We introduce a tunable constant that determines the maximum distance between

observed and expected feature locations, in pixels, and reject all matches exceeding this limit.

We also reject features that break the point-plane correspondence discussed in Section 4.3. By


selecting only features that adhere to the planar relationship (4.1), we can remove spurious and

inconsistent detections.

A second constant NMIN imposes the minimum number of sub-images in which feature matches

must be found. In the absence of occlusions, this can be set to require feature matches in all

sub-images. Any feature that is below the maximum distance criterion in at least NMIN images is

accepted as a 4D feature, and a mean slope estimate is formed based on all passing sub-images.

NMIN was set to 4 out of 8 sub-image matches for our experiments.

Feature matching between two LFs again starts with conventional 2D methods. A conventional

2D feature match finds putative correspondences between the central sub-images of the two

LFs. Outlier rejection is performed using the M-estimator Sample Consensus algorithm [Torr

and Zisserman, 2000].

4.5.2 Mirror-Based Light-Field Camera Adapter

There is a scarcity of commercially available LF cameras appropriate for robotics applications.

Notably, no commercial camera delivers 4D LFs at video frame rates. Therefore, we constructed

our own LF video camera, the MirrorCam, by employing a mirror-based adapter based on pre-

vious work [Fuchs et al., 2013,Song et al., 2015]. The MirrorCam is depicted in Fig. 4.1a. The

MirrorCam design, optimisation, construction, calibration, and image decoding processes are

described in the Appendix A [Tsai et al., 2016]. This approach splits the camera’s field of view

into sub-images using an array of planar mirrors, as shown in Fig. 4.1c. By appropriately posi-

tioning the mirrors, a grid of virtual views with overlapping fields of view can be constructed,

effectively capturing an LF. We 3D-printed the mount based on our optimization, and populated

it with laser-cut acrylic mirrors. Note that the LF-IBVS method described in this chapter does

not rely on this particular LF camera design, and applies to 4D LFs in general.

106 4.5. IMPLEMENTATION & EXPERIMENTAL SETUP

4.5.3 Control Loop

The proposed LF-IBVS control loop is depicted in Fig. 4.2. Notably, this control loop is similar

to that of standard VS. Goal light-field features f ∗ ∈ R3 are compared to observed light-field

features f ∈ R3 to produce a light-field feature error. The camera spatial velocity ν can then be

calculated as in (3.6) by multiplying the light-field feature error with the pseudo-inverse of the

stacked image Jacobians and then multiplying it by a gain λ.

Velocity control is formulated in (3.6). We assume infinitesimal motion to convert ν into a

homogeneous transform cT that we use to update the camera’s pose. A motion controller moves

the robot arm. After finishing the motion, a new light field is taken and the feedback loop repeats

until the light-field feature error converges to zero.

An important consideration in LF-IBVS is the feature representation, because the choice of

feature representation in IBVS influences the Cartesian motion of the camera [Mahony et al.,

2002]. We have the option of computing the 3D positions of the points obtained from the

LF; however, this would be very similar to PBVS. Instead, we chose to work more closely

to the native LF representation, working with projected feature position, augmented by slope.

Doing so avoids unnecessary computation, and is more numerically stable as depth computation

involves inverting slope.

InverseJacobian

Grab Image

DecodeExtract

λMotion

controller

f*

f

+

-

ν CT

I

Figure 4.2: The control loop for the VS system. Goal features f ∗ are given. Then f ∗ and f are

compared, the J+ is computed, and camera velocity ν is determined with gain λ and converted

into a motion cT . A motion controller moves the robot arm. After finishing the motion, a new

image is taken and the feedback loop repeats until image features match.


We define the terminal condition for LF-IBVS as a threshold on the root mean square (RMS)

error between all of the observed LF features and the goal LF features. We combine all of M ,

and note that (u0, v0) are in meters, and (k0, l0) are in pixels, but the slope w is unit-less. This

issue can be addressed by weighting the components; however, for the discrete case, in practice

we found that mx and my had similar relative magnitudes. The relative magnitudes of the light-

field feature elements are important because they define the error term, which in turn drives

the system to minimise light-field feature error. Extremely large magnitudes for slope could

potentially place more emphasis on more z-axis or depth-related camera motion, compared to

x- or y-axis camera motion. Additionally, we typically use a small λ of 0.1 in order to generate

a smooth trajectory towards the goal view.

For the robotic manipulator, we found that the manufacturer’s built-in inverse kinematics soft-

ware became unresponsive for small pose adjustments1. Therefore we implemented a resolved-

rate motion control method using a manipulator Jacobian to command camera spatial velocities

to desired joint velocities [Corke, 2013]. We also changed the proportional, integral and deriva-

tive controller gains for all joints to KP = 2.0, KI = 4.8, and KD = 0.0, respectively. With

these implementations, we achieved sufficient positional accuracy and resolution to demonstrate

LF-IBVS.

4.6 Experimental Results

In this section, we present our experimental results in from camera array simulation and arm-

mounted experiments using a custom mirror-based light-field camera. First, we show LF camera

and light-field feature trajectories over the sequence of a typical visual servoing manoeuvre in

simulation. Second, we compare LF-IBVS to monocular and stereo IBVS in a typical unoc-

cluded scene. Finally, we compare the three same VS systems in an occluded scene.

1Limits were determined experimentally and confirmed by the manufacturer.

108 4.6. EXPERIMENTAL RESULTS

4.6.1 Camera Array Simulation

In order to verify our LF-IBVS algorithm, we first simulated a 3 × 3 array of pinhole camera

models from the Machine Vision Toolbox [Corke, 2013]. Four planar world points in 3D were

projected into the image planes of the 9 cameras. A typical example of LF-IBVS is shown in

Fig. 4.3. For this example, a small gain λ = 0.1 was used to enforce small steps and produce

smooth plots as shown in Fig. 4.3a. The Cartesian positions and orientations relative to the goal

pose converge smoothly to zero, as shown in Fig. 4.3b. Similarly, the camera velocity profiles in

Fig. 4.3c converge to zero. Fig. 4.3d shows the image Jacobian condition number first increases,

and then decreases to a constant lower value, indicating that the Jacobian becomes worse and

then better conditioned, as the features move closer and then further apart, respectively. To-

gether, these figures show the system converges, indicating that LF-IBVS was successful in

simulation. Similar to conventional IBVS, a large λ results in a faster convergence, but a less

smooth trajectory.

Fig. 4.4a shows the view of the central camera, and the image feature paths as the camera array

servos to the goal view. We see that the image feature paths are almost straight due to the

linearisation of the Jacobian. Fig. 4.4b shows the trajectories of the top-left corner of the target

relative to the goal features, which also converge to zero. We note the slope profile matches the

inverse of the z-position profile in the top red line of Fig. 4.3b, as it encodes depth.

For large initial angular displacements, we note that like regular IBVS, this formulation of

LF-IBVS exhibited camera retreat issues. Instead of taking the straight-forward screw motion

towards the goal, the camera retreats backwards, before moving forwards to reach the goal view.

In these situations, the Jacobian linearisation is no longer valid, since the image feature paths

are optimally curved, rather than linear. This poses a performance issue because in real systems,

such backwards manoeuvres may not be feasible; however, retreat can be addressed [Corke and

Hutchinson, 2001] by decoupling the translation components from the z−axis rotation into two

separate image Jacobians, and will be considered in future work.


0 10 20 30 40 50 60

Time Steps

0

50

100

150

200

250

300

350

Err

or

[pix

]

(a)

0 10 20 30 40 50 60-0.1

0

0.1

0.2

Positio

n [m

] X Y Z

0 10 20 30 40 50 60

Time Steps

-5

0

5

10O

rienta

tion [deg]

θX

θY

θZ

(b)

0 10 20 30 40 50 60

Time Steps

-0.03

-0.02

-0.01

0

0.01

0.02

Cart

esia

n v

elo

city [m

/s, ra

d/s

]

vx

vy

vz

ωx

ωy

ωz

(c)

0 10 20 30 40 50 60

Time Steps

48

49

50

51

52

53

54

55

Jacobia

n c

onditio

n n

um

ber

(d)

Figure 4.3: Simulation of LF-IBVS, with (a) error (RMS of f − f ∗) decreasing over time,

(b) camera motion profiles relative to the goal pose, (c) Cartesian velocities, and (d) image

Jacobian’s condition number for λ = 0.1. Ideally, the condition number is low (or decreases

over time), which means the system is well-conditioned and therefore less sensitive to changes

or errors in the input. Error, relative pose and velocities all converge to zero.


0 500 1000 1500 2000

u [pix]

0

500

1000

1500

v [pix

]

(a)

0 10 20 30 40 50 60-500

0

500

Positio

n E

rror

[pix

]

k0

l0

0 10 20 30 40 50 60

Time Steps

0

50

100

Slo

pe E

rror

[pix

/pix

]

mx

(b)

Figure 4.4: Simulation of view (a) of the initial target points (blue), servoing along the image

plane feature paths (green) to the target goal (red), and (b) the feature trajectory profile of

M −M ∗, corresponding to the top left corner of the target, which converges to zero.

4.6.2 Arm-Mounted MirrorCam Experiments

We also validated LF-IBVS using the MirrorCam mounted to the end of a Kinova MICO arm

robot, shown in Fig. 4.1a. The robot arm and camera were controlled using the architecture

outlined in Fig. 4.2. We then performed two experiments for M-IBVS, stereo image-based

visual servoing (S-IBVS) and LF-IBVS. The first involved a typical approach manoeuvre in a

Lambertian scene to evaluate the nominal performance of our LF-IBVS system. The second

involved adding occlusions after the goal image/light field was captured in order to explore the

effect of occlusions on LF-IBVS.

4.6.2.1 Lambertian Scene Experiment

We first tested the MirrorCam on a scene similar to Fig. 4.1b, with complex motion involving

all 6 DOF from the initial pose in a Lambertian scene. In a typical VS sequence, we move the

robot to a goal pose, record the camera pose and goal features, then move the robot to an initial

pose and use the features to servo back to the goal.


Fig. 4.5 shows the performance of our LF-IBVS algorithm for the scene with λ = 0.15. Fig. 4.5a

shows the error decreasing over time as the camera approaches the goal view, and converges

after 20 time steps. We attribute the non-zero error to the arm’s limited performance, which we

address at the end of this section. Fig. 4.5b shows the relative pose of the camera to the goal in

the camera frame converging smoothly to zero. Note that the goal pose is never the objective of

LF-IBVS; rather, the image features captured at the goal pose drive LF-IBVS. Fig. 4.5c shows

the commanded camera velocities also converge to zero. Fig. 4.5d shows the condition number

for the image Jacobian, which decreases slightly as the system converges. We also note that

despite only an approximate camera-to-end-effector calibration, the system converged, which

suggests the robustness of the system against modelling errors.

We compared LF-IBVS against conventional M-IBVS and S-IBVS. Using the sub-images from

the MirrorCam in Fig. 4.1c, we used the view through the central mirror for M-IBVS, and the

two horizontally-adjacent views to the centre from the MirrorCam for S-IBVS. This maintained

the same FOV and pixel resolution. Implementations were based on [Corke, 2013, Chaumette

and Hutchinson, 2006]. The average scene depth was provided for M-IBVS and S-IBVS to

compute the Jacobian, although we note depth, or disparity can be measured directly from

stereo. All three IBVS methods were tested ten times on the same goal scene and initial pose.

A typical case for S-IBVS is shown in Fig. 4.6. The image feature error is not uniformly

decreasing at the start, but eventually converges after 25 time steps. The camera moves in an

erratic motion at the start in the x- and y-axes, but still manages to converge to the goal pose,

as seen in the relative pose trajectories and camera velocities in Fig. 4.6b and 4.6c. This is

probably not because λ was too high for S-IBVS; smaller gains were tested for S-IBVS, but

yielded the same poor performance.

Instead, we observe that the S-IBVS Jacobian condition number in Fig. 4.6d was an order of

magnitude higher than LF-IBVS, producing an almost rank-deficient Jacobian; such a Jacobian

becomes an inaccurate approximation of the spatial velocities, and yields erratic motion. We


0 10 20 30 40

Time Steps

0

5

10

15

20

25

Err

or

[pix

]

(a)

0 10 20 30 40

0

0.05

0.1

Positio

n [m

] X Y Z

0 10 20 30 40

Time Steps

-4

-2

0

2

4

6O

rienta

tion [deg]

θX

θY

θZ

(b)

0 10 20 30 40

Time Steps

-0.01

-0.005

0

0.005

0.01

0.015

0.02

0.025

Cart

esia

n v

elo

city [m

/s, ra

d/s

]

vx

vy

vz

ωx

ωy

ωz

(c)

0 10 20 30 40

Time Steps

50

100

150

200

250

300

Jacobia

n c

onditio

n n

um

ber

(d)

Figure 4.5: Experimental results of LF-IBVS with MirrorCam on the robot arm, illustrating

(a) the error (RMS of M − M ∗) that converges after 20 time steps, (b) the camera motion

profiles relative to the goal, which converge to zero, (c) the camera velocity profiles, which

converge to zero, and (d) the image Jacobian condition number. Referring also to Fig. 4.6,

we note that LF-IBVS outperforms S-IBVS; the motion profiles are much smoother, and the

velocities and condition numbers are an order of magnitude smaller than those from S-IBVS.


0 10 20 30 40

Time Steps

0

10

20

30

40

50

Err

or

[pix

]

(a)

0 10 20 30 40

-0.05

0

0.05

0.1

Positio

n [m

]

X Y Z

0 10 20 30 40

Time Steps

-10

-5

0

5

Orienta

tion [deg]

θX

θY

θZ

(b)

0 10 20 30 40

Time Steps

-0.01

-0.005

0

0.005

0.01

0.015

0.02

0.025

Cart

esia

n v

elo

city [m

/s, ra

d/s

]

vx

vy

vz

ωx

ωy

ωz

(c)

0 10 20 30 40

Time Steps

500

1000

1500

2000

2500

Jacobia

n c

onditio

n n

um

ber

(d)

Figure 4.6: Experimental results of S-IBVS with narrow FOV sub-images from the MirrorCam,

on the robot arm, illustrating the performance in (a) the error (RMS of p − p∗) that eventually

converges after 25 time steps; however, the scale is almost double compared to Fig. 4.5.a, (b)

the camera motion profiles relative to the goal that show an erratic trajectory at the start, (c)

the camera velocity profiles that also vary greatly, and (d) the extremely large image Jacobian

condition number, indicating a potentially unstable system (it can exhibit very large changes in

camera velocity output for very small changes in image feature error).


attribute this poor performance to the narrow FOV of the MirrorCam, which is approximately 20

degrees horizontally. The lack of perspective change, which is required to differentiate rotation

from translation, particularly about the x- and y-axes, can therefore be attributed to the poor

S-IBVS performance.

During the experiments, M-IBVS exhibited much worse performance than S-IBVS, to the extent

that such erratic motion caused the robot to completely lose view of the goal scene within two

or three time steps. Therefore, M-IBVS velocity profiles are not shown in the results.

Equivalently, the projected scale of the object being servoed against affects the performance

of IBVS; smaller or more distant objects yield poorly-conditioned image Jacobians. These ob-

servations are not new or surprising [Dong et al., 2013]. LF-IBVS outperformed both of our

constrained implementations of M-IBVS and S-IBVS, as LF-IBVS converged with a smooth

trajectory regardless of the narrow FOV constraints of the MirrorCam. These improvements

were likely due to a much lower Jacobian condition number in LF-IBVS, which we attributed

to the LF camera providing the perspective change required to differentiate rotation from trans-

lation, unlike the stereo and monocular systems. Therefore, the narrow FOV constraints of the

MirrorCam can generalize to other camera systems as small or distant targets relative to the

camera, where increasing the FOV would not help the system converge to the target.

4.6.2.2 Occluded Scene Experiment

Experiments with occlusions were also conducted using a series of black wires to partially

occlude the scene. The setup is illustrated in Fig. 4.7 and 4.8. The goal, or reference image,

was captured without the occlusions at a specified goal pose. An example image is shown in

Fig. 4.8a. Next, the robot was moved to an initial pose, where the occlusions did not obscure the

scene. Then the robot was allowed to servo towards the goal, along a path where the occlusions

gradually obscured the goal view. The final goal image was partially occluded, as shown in

Fig. 4.8b. M-IBVS, S-IBVS and LF-IBVS were run using the same setup. With the partially


occluded views, M-IBVS and S-IBVS failed; whereas the LF-IBVS method servoed to the

original goal pose.

Fig. 4.9 compares the number of features matched by LF-IBVS, M-IBVS, and S-IBVS in the

occlusion experiment. Without any occlusions, we note that all three methods have a similar

number of matched features at the goal view, although stereo and mono have slightly more

matches than LF-IBVS throughout the experiment. This is likely because all 3 methods used

similar 2D feature detection methods; however, our LF-IBVS approach also rejected those fea-

tures that were inconsistent with LF geometry. In our experiment with occlusions, M-IBVS

failed at time step 5, when it was unable to match sufficient features. Similarly, the perfor-

mance of S-IBVS in our experiment quickly degraded at time step 10, as the occlusions covered

most of the left view and significant portions of the right view.

On the other hand, in the presence of occlusions, LF-IBVS had fewer matches than the un-

occluded case, but still matched a consistent and sufficient number of features throughout its

trajectory to converge. It was therefore apparent that LF-IBVS could utilize the LF camera’s

multiple views and baseline directions to handle partial occlusions. To further illustrate this,

consider the scene where a 3D point occluded from one of the LF camera’s sub-views, but still

visible in at least one other of the LF camera’s sub-views. A single LF is captured; thus there

is no physical camera motion. Conventional image feature matching would fail for stereo vi-

sion systems in this situation, because the 3D point is occluded from one of two views, and

therefore not viable for image matching. However, our LF-camera-based method would still

able to perform matching using the other unoccluded views, provided a sufficient baseline. By

setting a minimum number of views that an image feature must be visible in (NMIN ), we have

made it harder for image features to be matched (thus there are fewer image feature matches).

Those that are matched are therefore more consistent for motion estimation applications, such

as visual servoing. Thus, our feature extension from 2D to 4D enables our method to better deal

with the presence of occlusions. Trinocular camera systems may also benefit from the occlusion

tolerance that we demonstrated in Fig. 4.9 (albeit far less tolerance due to significantly fewer


occluded goal view

unoccluded initial view

camera trajectory

MirrorCam

partial occlusions

scene features

field of view

Figure 4.7: Occlusion experimental setup, showing the initial view of the scene (red) with no

occlusions, the camera trajectory that gradually becomes more occluded, and converging to the

goal view with partial occlusions (green).

(a) (b)

Figure 4.8: Occlusion experiments showing (a) the goal view with no occlusions from the

MirrorCam, and (b) the goal view, partially occluded by a box of black wires. The arm was able

to reach the partially-occluded goal view using LF-IBVS, but not M-IBVS or S-IBVS. Images

shown are flipped vertically.


views—three compared to n × n views, where n is typically three or greater), but would lack

tolerance to specular highlights and other non-Lambertian surfaces as discussed in Table 4.1.

4.7 Conclusions

In this chapter, we have proposed the first derivation, implementation, and validation of light-

field image-based visual servoing. We have derived the image Jacobian for LF-IBVS based on a

LF feature representation that is augmented by the local light-field slope. We have exploited the

LF in our feature detection, correspondence, and matching processes. Using a basic VS control

loop, we have shown in simulation and on a robotic platform that LF-IBVS is viable for con-

trolling robot motion. Further research into alternative feature types and Jacobian decoupling

strategies may address camera retreat and improve the performance of LF-IBVS.

Our implementation takes 5 seconds per frame to operate as unoptimized MATLAB code. The

decoding and correspondence processes are the current bottlenecks. Through optimization,

real-time LF-IBVS should be possible.

Our experimental results demonstrate that LF-IBVS is more tolerant than monocular and stereo

methods to narrow FOV constraints and partially-occluded scenes. Robotic applications op-

erating in narrow, constrained and occluded environments, or those aimed at small or distant

targets would benefit from LF-IBVS, such as household grasping, medical robotics, and in-

orbit satellite servicing. In future work, we will investigate other LF camera systems, how to

further exploit the 4D nature of the light-field features, and explore LF-IBVS in the context of

refractive objects, where the method should benefit significantly from the light field.

118 4.7. CONCLUSIONS

0 10 20 30 40

Time Steps

0

10

20

30

40

50

60

Nu

mb

er

of

Fe

atu

res

LF

Stereo

Mono

LF Occluded

Stereo Occluded

Mono Occluded

Figure 4.9: Experimental results for number of features matched over time with occlusions

(dashed), and without (solid), for LF-IBVS (red), S-IBVS (blue), and M-IBVS (black). Both

stereo and monocular methods fail at time step 5 and 10, respectively, but LF-IBVS maintains

enough feature matches to converge to the goal pose, which demonstrates that LF-IBVS is more

robust to occlusions.

Chapter 5

Distinguishing Refracted Image

Features with Application to Structure

from Motion

Robots for the real world will inevitably have to perceive, grasp and manipulate refractive ob-

jects. However, refractive objects are particularly challenging for robots because these objects

are difficult to perceive—they are often transparent and their appearance is essentially a dis-

torted view of the background, which can change significantly with respect to small changes

in viewpoint. The amount of distortion depends on the scene geometry, as well as the shape

and refractive indices of the objects involved. As the robot approaches the refractive object, the

refracted background can move differently compared to the rest of the non-refracted scene. In-

tuitively, the key to detecting refractive objects is to understand and characterise the background

distortion caused by the refractive object.

This chapter is concerned with discriminating the appearance of features whose image features

have been distorted by a refractive object—refracted image features—from the surrounding

Lambertian features. This is because robots will need to reliably operate in scenes with re-

119

120

fractive objects in a variety of applications. Unfortunately, refractive objects can cause many

robotic vision algorithms, such as structure from motion (SfM), to become unreliable or even

fail. This is because they all assume a Lambertian world, and do not know not to use refracted

image features when estimating structure and motion.

Outlier rejection methods such as RANSAC have been used to remove refracted image features

(outliers compared to the perceived relative motion of the robot) when the number of refracted

image features are small relative to the number of Lambertian image features. However, there

is a trade-off between computation and robustness when dealing with outlier-rich image feature

sets1. More computation is required to deal with increasingly outlier-rich image feature sets.

With limited computation, outlier rejection may return a sub-optimal inlier set, potentially lead-

ing to failure of the robotic vision system. Therefore, starting with a higher-quality set of image

features for applications such as SfM are preferred to reduce computation, power consumption

and probability of failure.

In this chapter, we propose a novel method to distinguish between refracted and Lambertian im-

age features using a light-field camera. For the previous refracted feature detection methods that

are limited to light-field cameras with large baselines relative to the refractive object, our method

achieves state-of-the-art performance. We extend these capabilities to light-field cameras with

much smaller baselines than previously considered, where we achieve up to 50% higher re-

fracted feature detection rates. Specifically, we propose to use textural cross-correlation to

characterise apparent feature motion in a single LF, and compare this motion to its Lambertian

equivalent based on 4D light-field geometry.

1Outliers are by definition samples that significantly differ from other observations; normally they appear with

low probability at the far end of distributions. Thus the term “outlier-rich” may appear contradictory as this would

imply another distribution, where many of the image feature points removed do not follow an assumed distribution.

However, by “outlier-rich”, we imply that there is a much higher concentration of outliers than normal. In our

context, we might obtain an outlier-rich image feature set when the cameras’ views are dominated by a refractive

object, such that a large number of image features are refracted, and only a few image features follow a consistent

(Lambertian) image motion within the light field, or due to the robot’s own motion.

CHAPTER 5. DISTINGUISHING REFRACTED IMAGE FEATURES 121

We show the effectiveness of our discriminator in the application of structure from motion (SfM)

when reconstructing scenes containing a refractive object, such as Fig. 5.1. Structure from

motion is a technique to recover both scene structure and camera pose from 2D images, and

is widely applicable to many systems in computer and robotic vision [Hartley and Zisserman,

2003, Wei et al., 2013]. Many of these systems assume the scene is Lambertian, in that a

3D point’s appearance in an image does not change significantly with viewpoint. However,

non-Lambertian effects, including specular reflections, occlusions, and refraction, violate this

assumption, which can cause these systems to become unreliable or even fail. We demonstrate

that rejecting refracted features using our discriminator yields lower reprojection error, lower

failure rates, and more accurate pose estimates when the robot is approaching refractive objects.

Our method is a critical step towards allowing robots to operate in the presence of refractive

objects. This work has been published in [Tsai et al., 2019].

Figure 5.1: (Left) A LF camera mounted on a robot arm was used to distinguish refracted

features in a scene in SfM experiments. (Right) SIFT features that were distinguished as Lam-

bertian (blue) and refracted (red), revealing the presence of the refractive cylinder in the middle

of the scene.

In this chapter, our main contributions are the following.

• We extend previous work to develop a light-field feature discriminator for refractive

objects. In particular, we detect the differences between the apparent motion of non-

Lambertian and Lambertian features in the 4D LF to distinguish refractive objects more

reliably than previous work.


• We propose a novel approach to describe the apparent motion of a feature observed within

the 4D light-field based on textural cross-correlation.

• We extend refracted feature distinguishing capabilities to lenslet-based LF cameras that

are limited to much smaller baselines by considering non-Lambertian apparent motion in

the LF. All LFs captured for these experiments are available at

https://tinyurl.com/LFRefractive.

• We show that by distinguishing and rejecting refracted features with our discriminator,

SfM performs better in scenes that include refractive objects.

The main limitation of our method is that it requires background visual texture to be distorted by

the refractive object. Our method’s effectiveness depends on the extent to which the appearance

of the object is warped in the LF. This depends on the geometry, shape, and the refractive indices

of the object involved.

The remainder of this chapter is organized as follows. Section 5.1 describes the related work and

Section 5.2 provides background on LF geometry. In Section 5.3, we explain our method for

discriminating refracted features in the LF. We show our experimental results for detection with

different LF cameras, and validation in the context of monocular SfM in Section 5.4. Lastly

in Section 5.5, we conclude the chapter and explore future work for the detection of refracted

features.

5.1 Related Work

A variety of strategies for detecting and reconstructing refractive objects using vision have been

investigated [Ihrke et al., 2010a]. For example, reflectivity has been used to reconstruct refrac-

tive object shape. A single monocular camera with a light source moving to points in a square


grid has been used to densely reconstruct complex refractive objects by tracing the specular re-

flections from different, known lighting positions over multiple monocular images [Morris and

Kutulakos, 2007]. Additionally, light refracted by transparent objects tends to be polarised, and

thus a rotating polariser in front of a monocular camera has been used to reconstruct the front

surface of glass objects that faces the camera [Miyazaki and Ikeuchi, 2005], but their method

requires prior knowledge of the object’s refractive index, shape of the back surface, and illu-

mination distribution, which for a robot are not necessarily available. Refractive object shape

has also been obtained by measuring the distortion of the light field’s background light paths,

using a monocular camera image of the refractive object placed in front of a special optical

sheet and lighting system, known as a light-field probe [Wetzstein et al., 2011], but this method

also required knowledge of the initial direction of light rays emitting from a planar background.

Furthermore, many of these methods require known light sources with bulky configurations that

are impractical for robotic applications in everyday environments.

Recent work has been aimed at finding refractive objects within a single monocular image.

SIFT features and a learning-based approach have been used to detect refractive objects [Fritz

et al., 2009]. They trained a linear, binary support-vector machine to classify glasses versus a

Lambertian background. Their approach required many hand-labelled training images from a

variety of refractive objects under different lighting environments and backgrounds, and only

returned a bounding box, providing little to no insight into the nature of the refractive object

itself. Monocular image sequences from moving cameras have been used to recover refractive

object shape and pose [Ben-Ezra and Nayar, 2003]; however, image feature correspondence

was established manually throughout camera motion, emphasizing the difficulty of automati-

cally identifying and tracking refracted image features due to the severe magnification of the

background and image distortion from the object.

LFs have been used to obtain better depth maps for Lambertian and occluded scenes [Johannsen

et al., 2017]; however, their depth estimation performance suffers for refractive objects. Jach-

nick et al. considered using light fields to estimate scene lighting configurations and then remove


specular reflections from images of planar surfaces [Jachnik et al., 2012]. Tao et al. recently

applied a similar concept using LF cameras to simultaneously estimate depth and remove spec-

ular reflections from more general 3D surfaces (not limited to planar scenes) [Tao et al., 2016].

Wanner et al. recently considered planar refractive objects and reconstructed two different depth

layers [Wanner and Golduecke, 2013]. For example, their method provided the depth of a thin

sheet of frosted glass and the depth of the background Lambertian scene. In another example,

their method provided the depth of a reflective mirror and the apparent depth of the reflected

scene. However, this work was limited to thin planar surfaces and single reflections. Although

our work does not determine the dense structure of the refractive object, our approach can dis-

tinguish image features from objects that significantly distort the LF.

Refractive object recognition is the problem of finding or identifying a refractive object from vi-

sion. In this area, Maeno et al. proposed a light-field distortion feature (LFD), which models an

object’s refraction pattern as image distortion based on differences in the corresponding image

points between the multiple views encoded within the LF, captured by a large-baseline (relative

to the refractive object) LF camera array [Maeno et al., 2013]. Several LFDs were combined

in a bag-of-words representation for a single refractive object. However, the authors observed

significantly poor recognition performance due to specular reflections, as well as changes in

camera pose. Xu et al. used the LFD as a basis for refractive object image segmentation [Xu

et al., 2015]. Corresponding image features from all views in the LF (s, t, u, v) were fitted to the

normal of a 4D hyperplane using singular value decomposition (SVD). The smallest singular

value was taken as a measure of error to the hyperplane of best fit, for which a threshold was

applied to distinguish refracted image features. However, we will show that a 3D point cannot

be described by a single hyperplane in 4D. Instead, it manifests as a plane in 4D that has two

orthogonal normal vectors. Our approach builds on Xu’s method and solves for both normals to

find the plane of best fit in 4D; thus allowing us to discriminate refractive objects more reliably.

A key difficulty in image feature-based approaches in the LF is obtaining the corresponding

image feature locations between multiple views. It is possible to use traditional multi-view ge-


ometry approaches for image feature correspondence, such as epipolar geometry, optical flow

and RANSAC. In fact, both Maeno and Xu used optical flow between two views for correspon-

dence. However, these approaches do not exploit the unique geometry of the LF, which can

lead to algorithmic simplifications or reduced computational complexity [Dansereau, 2014].

We propose a novel textural cross-correlation method to associate image features in the LF by

describing their apparent motion in the LF, which we refer to as image feature curves. This

method directly exploits LF geometry and provides insight on the 4D nature of image features

in the LF.

Our interest in LF cameras stems from robot applications that often have mass, power and

size constraints. Thus, we are interested in employing compact lenslet-based LF cameras to

deal with refractive objects. However, most previous works have utilized gantries [Wanner and

Golduecke, 2013], or large camera arrays [Maeno et al., 2013, Xu et al., 2015]; their results

do not reliably transfer to LF cameras with much smaller baselines, where distortion is less

apparent, as we show later. We demonstrate the performance of our method using two different

LF camera architectures with different baselines. Ours is the first method, to our knowledge,

capable of identifying RFs using lenslet-based LF cameras.

For LF cameras, LF-specific image features, have been investigated. SIFT features augmented

with “slope”, an LF-based property related to depth, were proposed by the author of this thesis

for visual servoing using an LF camera [Tsai et al., 2017]; however, in Chapter 4, refractive ob-

jects were not considered. Ghasemi proposed a scale-invariant global image feature descriptor

based on a modified Hough transform [Ghasemi and Vetterli, 2014]; however, we are interested

in local image features whose positions encode the distortion observed in the refracted back-

ground. More recently, Tosic developed a scale-invariant, single-pixel-edge detector by finding

local extrema in a combined scale, depth, and image space [Tosic and Berkner, 2014]. However,

these LF image features did not differentiate between Lambertian and refracted image features,

nor were they designed for reliable matching between LFs captured from different viewpoints.

126 5.2. LAMBERTIAN POINTS IN THE LIGHT FIELD

Recent work by Teixeira et al. projected SIFT features found in all views into their correspond-

ing epipolar plane images (EPIs). Example EPIs are shown in Fig. 2.15. These projections were

filtered and grouped onto straight lines in their respective EPIs and then counted. Features with

higher counts were observed in more views, and thus considered to be more reliable Lambertian

image features [Teixeira et al., 2017]. However, this approach only looked for SIFT features

that were consistently projected on top of lines in their respective EPIs and intentionally filtered

out any nonlinear image feature behaviour; thus, this approach did not consider any nonlinear

image feature behaviour, while our method aims to detect these non-Lambertian image features,

and is focused on characterising them. Clearly, there is a gap in the literature for identifying

and characterising refracted image features using LF cameras.

In this chapter, we detect unique image features that allow us to reject distorted content and

work well for SfM. This could be useful for many other common feature-based algorithms,

including recognition, segmentation, visual servoing, simultaneous localization and mapping,

visual odometry, and SfM, making these algorithms more robust to the presence of refractive

objects. We are interested in exploring the impact of our refracted image feature discriminator

in a SfM framework. While there has been significant development in SfM in recent year for

conventional monocular and stereo cameras [Wei et al., 2013], Johannsen et al. were the first to

consider LFs in the SfM framework [Johannsen et al., 2015]. Although our work does not yet

explore LF-based SfM, we investigate SfM’s performance with respect to RFs, which has not

yet been fully explored. We show that rejecting RFs reduces reprojection error and failure rate

near refractive objects, improving camera pose estimates.

5.2 Lambertian Points in the Light Field

In this section, we provide a brief reminder on the LF geometry background provided in Sec-

tion 2.7.3; however, we have re-written (2.34) and (2.35) in the context of our refracted im-

age feature discriminator. Using the two plane paramterisation, a ray φ can be described by


φ = [s, t, u, v]T ∈ R4. A Lambertian point in space P = [Px, Py, Pz]

T ∈ R3 emits rays in many

directions, which follow a linear relationship

u

v

=

(D

Pz

)

Px − s

Py − t

, (5.1)

where each row describes a hyperplane in 4D. A hyperplane in 4D is a 3D manifold and can be

described by a single equation

n1s+ n2t+ n3u+ n4v + n5 = 0, (5.2)

where n = [n1, n2, n3, n4]T is the normal of the hyperplane. A plane is defined as a 2D manifold

and can be spanned by two linearly-independent vectors. In 4D, a plane can be described by the

intersection of two 4D hyperplanes

n1s+ n2t+ n3u+ n4v + n5 = 0 (5.3)

m1s+m2t+m3u+m4v +m5 = 0, (5.4)

where m is the normal of a second hyperplane in 4D. Equations (5.3) and (5.4) can be written

in matrix form,

n1 n2 n3 n4

m1 m2 m3 m4

s

t

u

v

=

−n5

−m5

. (5.5)

128 5.3. DISTINGUISHING REFRACTED IMAGE FEATURES

Equation (5.1) can be written into a similar form as (5.5) as

DPz

0 1 0

0 DPz

0 1

︸︷︷︸

N

s

t

u

v

=

DPx

Pz

DPy

Pz

, (5.6)

where N contains the two linearly-independent normals to the plane in 4D. The plane is defined

as the set of all s, t, u, v that follow (5.6). Therefore, a Lambertian point in 3D induces a plane

in 4D, which is characterised by two linearly-independent normal vectors that each define a

hyperplane in 4D. In the literature, this relationship is sometimes referred to as the point-plane

correspondence, as discussed in Section 2.7.3, because a point in 3D corresponds to a plane in

4D.

The light-field slope w relates the rate of change of image plane coordinates, with respect to

viewpoint position, for all rays emitting from a point in the scene. In the literature, slope is

sometimes referred to as “orientation” [Wanner and Golduecke, 2013], and other works com-

pute slope as an angle [Tosic and Berkner, 2014]. We recall that the slope comes directly from

(2.43) as

w = −D/Pz, (5.7)

and is clearly related to depth.

5.3 Distinguishing Refracted Image Features

Epipolar Planar Images (EPIs) graphically illustrate the apparent motion of a feature across

multiple views. If the entire light field is given as L(s, t, u, v) ∈ R4, the central view is an

image I(u, v) = L(s0, t0, u, v), and is equivalent to what a monocular camera would provide

from the same camera viewpoint. EPIs represent a 2D slice of the 4D LF. A horizontal EPI


is given as L(s, t∗, u, v∗), and a vertical EPI is denoted as L(s∗, t, u∗, v), where ∗ indicates a

variable is fixed while others may vary.

In practice, we construct the EPI by plotting all u pixels for view s, as illustrated in Fig. 2.15.

Then we plot all u pixels for view s + 1, stacking the row of pixels on top of the previous

plot, and repeating for all s. As each view is horizontally shifted by some baseline, the scene

captured by the u pixels shifts accordingly. As shown in Fig. 5.2a, image features or rays from a

Lambertian point are linearly distributed with respect to viewpoint due to the uniform sampling

of the LF camera.

Points with similar depths yield lines with similar slopes in the EPI. Points with different depths

yield lines with different slopes. Similar behaviour is observed considering the vertical viewing

direction along t and v. Equivalently, linear parallax motion manifests itself as straight lines for

Lambertian image features. Image features for highly-distorting refractive objects are nonlinear,

as illustrated in Fig. 5.2b. We can thus compare this difference in apparent motion between

Lambertian and non-Lambertian features to distinguish RFs.

Fig. 5.3a shows the central view and an example EPI of a crystal ball LF (large baseline) from

the New Stanford Light-Field Archive, captured by a camera array. The physical size of cameras

often necessitates larger baselines for LF capture. A Lambertian point forms a straight line in

the EPI, shown in Fig. 5.3b. The relation between slope and depth is also apparent in this EPI.

Refracted image features appear as nonlinear curves in the EPI, as seen in Fig. 5.3b. Refracted

image feature detection in the LF simplifies to finding image features that violate (5.1) via iden-

tifying nonlinear feature curves in the EPIs and/or inconsistent slopes between two linearly-

independent EPI lines (ie, EPIs sampled from two linearly-independent motions), such as the

vertical (along t) and horizontal (along s) EPIs. We note that occlusions and specular reflections

also violate (5.1), and so can potentially cause many vision algorithms to fail as well. Occlu-

sions appear as straight lines, but have intersections in the EPI, indicated in green. Edges of the


(a) (b)

Figure 5.2: A Lambertian point emits rays of light and are captured by the LF camera. (a)

Projection of the linear behaviour of a Lambertian image feature (orange), and (b) the nonlinear

behaviour of a refracted image feature with respect to linear motion along the viewpoints of an

LF (blue).

refractive objects, and objects with low distortion also appear Lambertian. Specular reflections

appear as a superposition of lines in the EPI, which may be addressed in future work.

5.3.1 Extracting Image Feature Curves

In this section, we discuss how we extract these 4D image feature curves and how we identify

refracted image features. For a given image feature from the central view (s0, t0) at coordinates

(u0, v0), we must determine the feature correspondences (u′, v′) from the other views, which

is equivalent to finding the feature’s apparent motion in the LF. In this chapter, we start by

detecting SIFT features [Lowe, 2004] in the central view, although the proposed method is

agnostic to any scale-based image feature type.

Next, we select a template surrounding the feature which is k-times the feature’s scale. We

determined k = 5 to yield the most consistent results. 2D Gaussian-weighted normalized cross-

correlation (WNCC) is used across views to yield correlation images, such as Fig. 5.4. To

reduce computation, we only apply WNCC along the central row and column of LF views.

For Lambertian image features, peaks in the correlation space for each view correspond to the

feature’s image coordinates in that view. We create another EPI by plotting the image feature’s


(a) (b)

Figure 5.3: (a) In the crystal ball LF from the New Stanford Light Field Archive, the central

view is shown. A vertical EPI (b) is sampled from a column of pixels (yellow), where nonlinear

apparent motion caused by the crystal ball are seen in the middle (blue). Straight lines corre-

spond to Lambertian features (orange). Occlusions (green) appear as intersections of straight

lines.

correlation response with respect to the views, which we call the correlation EPI. Illustrated in

Fig. 5.4, the ridge of the correlation EPI will have the same shape as the image feature curve

from original EPI.

For refracted image features, we hypothesize that the distortion of the feature’s appearance

between views will not be so strong as to make the correlation response unusable. Thus, the

correlation response will be sufficiently strong that the ridge of the correlation EPI will still

correspond to the desired feature curve. Our textural cross-correlation method allows us to

focus on the image structure, as opposed to the image intensities. Our method can be applied to

any LF camera, and directly exploits the geometry of the LF.

There are many other strategies to find and match image features across a sequence of views,

such as stereo matching and optical flow. Such approaches have previously been used in LF-

related work [Maeno et al., 2013, Xu et al., 2015]. However, both stereo matching and optical

flow typically rely on pair-wise image comparisons and must therefore be iterated across other

views of the LF. It is often more efficient and robust against noise to consider all views simulta-


Figure 5.4: The image feature curve extraction process. (Top left) The simulated horizontal

views of a yellow circle and (top right) the corresponding horizontal EPI taken along the middle

of the views from the green pixel row. A feature template is taken and used for textural cross-

correlation (bottom left). The resulting cross-correlation response is computed and shown as

the cross-correlation views for a typical scene. Yellow indicates a high response, while blue

indicates a low response. The resultant correlation EPI (bottom right) , created by stacking the

red pixel row of adjacent views. The ridge (yellow) along this correlation EPI corresponds to

the desired, extracted image feature curve (red). Note that only 3 views are shown, but the

simulated LF actually contains 9 views.

neously when attempting to characterise a trend across an image sequence. Although we only

consider 2D EPIs in the LF in this chapter, we are interested in considering full 4D approaches

for image feature extraction in future work.

5.3.2 Fitting 4D Planarity to Image Feature Curves

For Lambertian image features, the image feature disparities are linear with respect to linear

camera translation, as in (5.7). The disparities from refracted image features deviate from this

linear relation. In this section, we explain that fitting these disparities in the least squares sense

to (5.1) yields the plane of best fit in 4D. The plane in 4D can be estimated from the image

feature curves that we extracted in the previous section. The error of the 4D planar fit provides

a measure of whether or not our image feature is Lambertian.


Similar to [Xu et al., 2015], we consider the ray passing through the central view φ0(s0, t0, u0, v0).

The corresponding ray coordinates in the view (s, t) are defined as φ(s, t, u, v). The LFD is then

defined as the set of relative differences between φ0 and φ as in [Maeno et al., 2013]:

LFD(u0, v0) = {(s, t,∆u,∆v) | ∀ (s, t) 6= (s0, t0)}, (5.8)

where ∆u = u(s, t)− u0, and ∆v = v(s, t)− v0 are image feature disparities. We note that the

LFD uses φ from all other views (s, t 6= s0, t0). This differs from our proposed image feature

curves extracted from EPIs that only sample views along two orthogonal viewing directions

from which the EPIs are first created, which represents a minimal sampling of the LF in order

to discriminate against refracted image features.

As discussed in Section 5.2, our plane in 4D has two linearly-independent normals, n and m.

Then considering the LFD, we compare central view ray φ0 to ray φ. Recall that each ray is

represented by a point in 4D. Substituting each coordinate into (5.3), we can write

n1s0 + n2t0 + n3u0 + n4v0 = −n5 (5.9)

n1s+ n2t+ n3u+ n4v = −n5. (5.10)

Subtracting (5.9) from (5.10), yields

n1s+ n2t+ n3∆u+ n4∆v = 0, (5.11)

which is expressed in terms of the LFD. Recall that s0 = 0 and t0 = 0..


We can write this in matrix form as

[

n1 n2 n3 n4

]

︸︷︷︸

N

s

t

∆u

∆v

︸︷︷︸n

=

[

0

]

. (5.12)

We can estimate n by fitting rays according to

[

(s, t, ∆u, ∆v)

]

︸︷︷︸

N

n1

n2

n3

n4

︸︷︷︸n

=

[

0

]

. (5.13)

Note that the constants on the right-hand side of (5.6) cancel out in (5.13) because we consider

the differences relative to u0. We also note that N is a matrix of rank one. We require a

minimum of four additional rays (equivalently, four views of the image feature) relative to φ0

to estimate n to solve the system Nn = 0.

For an LF that can be represented by an M × N camera array, we can use all MN views to

estimate n. However, to reduce the required computation involved in the image feature curve

extraction process, we can use all N views from the horizontal image feature curve, which were

extracted from the horizontal EPI, u = fh(s; tj, vl − v0). This represents the set of all values of

u that follow the horizontal image feature curve as a function of s, given the constant tj , vl and

v0. Similarly, the vertical image feature curve can be expressed as v = fv(t; si, uk − u0), for

constant si, uk and u0.


We can substitute the image feature curve fh into N as a set of stacked rays,

s1, tj, ∆u1, vl − v0...

......

...

sN , tj, ∆uN , vl − v0

︸︷︷︸

N

n1

n2

n3

n4

=

[

0

]

. (5.14)

The matrix N is a singular N×4 matrix, and (5.14) is an overdetermined system. We can solve

this system using SVD to estimate n in the least squares sense.

It is interesting to note that for a Lambertian point, we can reduce each row of N to a function

of the other rows. We know that tj and vl − v0 are constant columns. The column s1 to sN is

a linear combination of itself, as it is simply the increment of the horizontal viewpoints, which

are linearly-spaced in the LF. Using (5.1), we can write

∆u = u− u0 =D

Pz

(Px − s)−D

Pz

(Px − s0) = −D

Pz(s− s0) = −

D

Pz

∆s (5.15)

∆u

∆s= −

D

Pz

. (5.16)

The change in u is linear with respect to changes in s, which matches our expression for LF

slope in (5.7). Therefore, N has a rank of 1 and can only yield a single hyperplane.

However, recall that a Lambertian point can be described by two hyperplanes. (5.11) and con-

sequently (5.12) must hold for both hyperplanes for Lambertian point in 3D. We are interested

in estimating hyperplane normals n and m from the LFD. Therefore, we can write (5.11) in


matrix form as the 4D plane containing φ0 and φ,

n1 n2 n3 n4

m1 m2 m3 m4

︸︷︷︸[

n m

]T

s

t

∆u

∆v

=

0

0

, (5.17)

where T is the transpose. The positions for s, t can be obtained by calibration [Dansereau et al.,

2013], although the nonlinear behaviour still holds when working with uncalibrated units of

“views”. We can then write

s, tj, ∆u, vl − v0

si, t, uk − u0, ∆v

︸︷︷︸

A

n1

n2

n3

n4

︸︷︷︸n

=

0

...

0

, (5.18)

where the first row (s, tj, ∆u, vl − v0) represents a single ray from fh, and the second row

(si, t, uk − u0, ∆v) represents a single ray from fv. A is a singular matrix and has a rank of at

least two. We still require a minimum of five rays to solve (5.18) (four plus φ0). As before, we

can stack all MN rays over the entire LF; however, we use a smaller set of M + N rays from

fh and fv. The system of equations can be written,

s1, tj, ∆u1, vk − v0...

......

...

sN , tj, ∆uN , vk − v0

si, t1, uk − u0, ∆v1...

......

...

(sM , tM , uk − u0, ∆vM

︸︷︷︸

A

n1

n2

n3

n4

︸︷︷︸n

=

0

...

0

. (5.19)


Equation (5.19) is of the form An = 0, which is a homogeneous system of equations. Since A

is a (M + N) × 4 matrix, the system is overdetermined. We use SVD to solve the system in a

least-squares sense to compute the four singular vectors ξi, and corresponding singular values,

λi, i = 1 . . . 4, where λi are sorted in ascending order according to their magnitude.

Additionally, A has a rank of two for a Lambertian point. We can show this by following

a similar arguments for the rank of N for the rays from fv. Thus we expect two non-zero

singular values and two trivial solutions for a system with no noise. With image noise and noise

from the image feature curve extraction process, it is possible to get four non-zero singular

values; whereupon the magnitudes of λ1 and λ2 are much smaller than λ3 and λ4. Importantly,

distortion caused by a refractive object can also cause non-zero singular values, and it is this

effect that we are primarily interested in.

The two smallest singular values, λ1 and λ2 and their corresponding singular vectors are related

to the two normals n and m that best satisfy (5.19) in the least-squares sense. The magnitude of

these singular values provides a measure of error of the planar fit. Smaller errors imply stronger

linearity, while larger errors imply that the feature deviates from the 4D plane.

5.3.3 Measuring Planar Consistency

From the two smallest singular values λ1 and λ2, we have two measures of error for the planar

fit. The Euclidean norm of λ1 and λ2,√

λ12 + λ2

2 may be taken as a single measure of pla-

narity; however, doing so masks the case where λ1 ≫ λ2, or λ1 ≪ λ2. This can occur when

observing a feature through a 1D refractive object (glass cylinder) that causes severe distortion

along one direction, but relatively little along the other. Therefore, we reject those features that

have large errors in either of the two hyperplanes in a manner similar to a logical OR gate. This

planar consistency, along with the slope consistency discussed in the following section, make

the proposed method more sensitive to distorted texture than prior work that considered only

the smallest singular value, which we refer to as hyperplanar consistency [Xu et al., 2015].


5.3.4 Measuring Slope Consistency

Equation (5.1), shows that a Lambertian point has a single value of slope for both hyperplanes.

However, the case for a refracted image feature can be locally approximated as

u

v

=

w1(Px − s)

w2(Py − t),

(5.20)

where w1 and w2 are the two slopes for the same image feature. Each row in (5.20) is still a

hyperplane in 4D. The intersection of these two hyperplanes also represents a plane in 4D, as

the normals are still linearly-independent.

We are interested in the horizontal and vertical hyperplanes, which are aligned to the hori-

zontal and vertical viewpoints along t0 and s0, respectively. We can compute the slopes for

each hyperplane given their normals. For the first hyperplane, we solve for the in-plane vector

q = [qs, qu]T , by taking the inner product of the two vectors n and m from (5.17) in

n1 n3

m1 m3

qs

qu

=

0

0

, (5.21)

where q is constrained to the s, u plane, because we choose the first and third elements of n and

m. This system is solved using SVD, and the minimum singular vector yields q. The slope for

the horizontal hyperplane, wsu is then wsu = qs/qu. The slope for the vertical hyperplane wtv is

similarly computed from the second and fourth elements of n and m.

We define slope consistency c as a measure of how different the slopes are between the two

hyperplanes for a given image feature. It is possible to compute this difference as

c = (w1 − w2)2. (5.22)


However, in practice, we plot the EPIs as ds/du (rather than du/ds) because there are signif-

icantly fewer views in typical LF cameras than there are pixels per view. We measure inverse

of w1 and w2, which can approach infinity as the lines of slope become vertical. We therefore

convert each slope to an angle

σ1 = tan(w1/1), σ2 = tan(w2/1). (5.23)

We compute c as the Euclidean norm of the two slope angles,

c =√

| σ1 − σ2 |2. (5.24)

Large values of c imply a large difference in slopes between the horizontal and vertical EPIs,

which in turn implies a refractive object. Overall, image features with large planar errors and

inconsistent slopes are identified as belonging to a highly-distorting refractive object.

Two thresholds for planar consistency tplanar and slope consistency tslope are used to determine

if an image feature has been distorted. If true, we refer to it as a refracted image feature,

refracted image feature =

1, if (λ1 > tplanar) ∨ (λ2 > tplanar) ∨ (c > tslope)

0, otherwise,

(5.25)

where ∨ is the logical OR operator. Note that our method is not limited to detecting distortion

aligned with the horizontal and vertical axes of the LF. Although not implemented in this work,

we can further check for λ1, λ2 and c along other axes by rotating the LF’s s, t, u, v frame and

repeating the check. In future work, we aim to consider all of the LF, in order to estimate this

rotation.



In this section, we present our experimental setup for refracted image feature detection and

show how our methods extend from large-baseline LF camera arrays to small-baseline lenslet-

based LF cameras. Finally, we use our method to reject refracted image features for monocular

SfM in the presence of refractive objects, and demonstrate improved reconstruction and pose

estimates.

5.4.1 Experimental Setup

To obtain LFs captured by a camera array, we used the Stanford New Light Field Archive2,

which provided LFs captured from a gantry with a 17× 17 grid of rectified 1024× 1024-pixel

images that were down-sampled to 256 × 256 pixels to reduce computation. We focused on

two LFs that captured the same scene of a crystal ball surrounded by textured tarot cards. The

first LF was captured with a large baseline (16.1 mm/view over 275 mm), which exhibited

significant distortions in the LF caused by the crystal ball. The second LF was captured with

a smaller baseline (3.7 mm/view over 64 mm). This allowed us to compare the effect of LF

camera baseline for refracted image feature discrimination.

Smaller baselines were considered using a lenslet-based LF camera. These LF cameras are

of interest in robotics due to their simultaneous capture of multiple views, and typically lower

size and mass compared to LF camera arrays and gantries. In this section, the Lytro Illum

was used to capture LFs with 15 × 15 views, each 433 × 625 pixels. Dansereau’s Light-Field

Toolbox [Dansereau et al., 2013] was used to decode and rectify the LFs from raw LF imagery

to the 2PP; thereby, converting the Illum to an equivalent camera array with a baseline of 1.1

mm/view over 16.6 mm. To compensate for the extreme lens distortion of the Illum, we removed

the outer views, reducing our LF to 13 × 13 views. The LF camera was fixed at 100 mm

2The (New) Stanford Light Field Archive is available at http://lightfield.stanford.edu/lfs.html.


focal length. All LFs were captured in ambient indoor lighting conditions without the need for

specialized lighting equipment. The refractive objects were placed within a textured scene in

order to create textural details for SIFT features. For repeatability, the lenslet-based camera was

mounted to the end-effector of a 6-DOF Kinova Jaco robotic manipulator, shown in Fig. 5.1.

The arm was controlled using the Robot Operating System (ROS) framework.

It is important to remember that our results depend on a number of factors. The geometry and

refractive index of a transparent object affects its appearance. Higher curvature and thickness

yield more distortion. Second, the distance between the LF camera and refractive object, as

well as the distance between the refractive object and the background, directly affect how much

distortion can be observed. Similarly, a larger camera baseline captures more distortion. When

possible, these factors were held constant throughout different experiments.

5.4.2 Refracted Image Feature Discrimination with Different

LF Cameras

In this section, we provide a qualitative comparison of our discrimination methods for the large-

baseline and small-baseline LF camera setups. Then we provide quantitative results over a larger

variety of LFs for our refracted image feature discriminator.

5.4.2.1 Large-Baseline LF Camera Observations

The large-baseline crystal ball LF was captured by a camera array. Lambertian image features

were captured by our textural cross-correlation approach as straight lines, while refracted image

features were captured as nonlinear curves, as shown in Fig. 5.5. We observed that while the

refracted image feature’s WNCC response was weaker compared to the Lambertian case, local

maxima were observed near the image feature’s corresponding location in the central view.


Thus, taking the local maxima of the correlation EPI yielded the desired feature curves. Our

textural cross-correlation method enables us to extract image feature curves without focusing

on image intensities.

5.4.2.2 Small-Baseline LF Camera Observations

Fig. 5.6 shows the horizontal and vertical EPIs for a refracted image feature taken from the

small-baseline crystal ball LF. The image feature curves appear straight, despite being distorted

by the crystal ball. However, we observed that the slopes were inconsistent, which could still

be used to discriminate refracted image features.

5.4.2.3 Discrimination of Refracted Image Features

To discriminate refracted image features, thresholds for planarity and slope consistency were

selected by exhaustive search over a set of training LFs, while evaluated on a different set of

LFs, with the exception of the crystal ball LFs where only one was available for each baseline

b from the New Stanford Light Field Archive. For comparison to state of the art, parameter

search was performed for both Xu’s method and our method independently, to allow for the

best performance of each method.

The ground truth refracted image features were identified via hand-drawn masks in the central

view. It was assumed that all features visible and passing through the refractive object were

distorted. Detecting a refracted image feature was considered positive, while returning a Lam-

bertian image feature was negative. Thus a true positive (TP) is correctly as a identified refracted

image feature, while a true negative (TN) is correctly identified as a Lambertian image feature.

A false positive (FP) is an incorrectly identified refracted image feature. A false negative (FN)

is an incorrectly identified Lambertian image feature, as shown in Fig 5.7.


(a) (b) (c)

(d) (e) (f)

Figure 5.5: Comparison of sample image feature curves extracted for a Lambertian (top) and

refracted (bottom) feature from the large-baseline LF. (a) Sample Lambertian SIFT feature with

template used for WNCC (red). (b) A 3D view of the vertical correlation EPI overlaid with the

straight Lambertian image feature curve (red). (c) The same straight Lambertian feature curve

(red) overlaid in the original vertical EPI. (d) Sample refracted SIFT feature with template used

for WNCC (red). (e) The refracted image feature curve (red) in the vertical correlation EPI

can still be extracted, despite more complex “terrain”, and still matches (f) the refracted image

feature curve, which exhibits nonlinear behaviour in the original vertical EPI. For reference, the

image feature location is shown at (t0, v0) by the red dot in the vertical EPIs.


(a) (b)

Figure 5.6: Sample (a) horizontal and (b) vertical EPIs from the crystal ball LF with small

baseline. From the image feature’s location (u0, v0) in the central view (red), extracted image

feature curves (green) match the plane of best fit (dashed blue). In the small baseline LF,

refracted image features appear almost linear and are thus much more difficult to detect.

Figure 5.7: Illustrating true positive, true negative, false positive and false negative in the con-

text of refracted image feature discrimination.

From these definitions, we can compute precision and recall as performance measures. Preci-

sion is the fraction of correctly identified refracted image features that are relevant,

Pr =TP

TP + FP. (5.26)

Recall is the fraction of correctly identified refracted image features,

Re =TP

TP + FN. (5.27)


Table 5.1: Comparison of our method and the state of the art using two LF

camera arrays and a lenslet-based camera for discriminating refracted image

features

State of the Art [Xu et al., 2015] Proposed

b [mm] TPR TNR FPR FNR Pr Re TPR TNR FPR FNR Pr Re

arr

ay

crystal ball

275 0.58 0.97 0.02 0.41 0.83 0.59 0.66 0.95 0.05 0.34 0.71 0.66

68 0.42 0.91 0.08 0.89 0.35 0.42 0.63 0.94 0.05 0.37 0.55 0.63

len

slet

sphere

1.1 0.43 0.36 0.64 0.58 0.18 0.08 0.48 0.95 0.04 0.52 0.79 0.83

cylinder

1.1 0.08 0.80 0.20 0.92 0.72 0.43 0.82 0.81 0.13 0.24 0.97 0.48

Two LF camera setups were used for the crystal ball LF, a 275mm baseline and a 68mm base-

line. For the lenslet-based camera, ten LFs from a variety of different backgrounds were used

for each object type. The discrimination results are shown in Table 5.1, which we discuss in the

following paragraphs. Fig. 5.8 shows sample views of refracted features (red) and Lambertian

features (blue).

Large-baseline LF Cameras For large-baseline LF cameras, such as the LF camera array

with 275 mm, our approach had comparable performance to the state of the art, shown by

only a 14% lower precision, but an 11% increase in recall. For large baselines, a significant

amount of apparent motion for many of the refracted image features was observed in the EPIs;

thus, refractive image features yielded nonlinear curves which strongly deviated from both 4D

hyperplanes. Therefore, a single threshold (that only accounted for a single hyperplane) was

sufficient to discriminate refracted image features.

The FPs included some occlusions, which appeared nonlinear in the EPI [Wanner and Gold-

eluecke, 2014], but were not discriminated by our implementation. However, this may still be

beneficial as occlusions often cause unreliable depth estimates, and are thus undesirable for


most robotic vision feature-based algorithms. Sampling from all the views in the LF would

likely improve the results for both methods, as more data would improve the planar fit. Interest-

ingly, more accurate depth estimates near occlusions is a common motivation to use LF cameras

over conventional vision sensors [Ham et al., 2017, Tao et al., 2013].

Small-baseline LF Cameras For small-baseline LF cameras, such as the LF camera array

with a 68 mm baseline, and the lenslet-based plenoptic camera, we observed improved perfor-

mance with our method over state of the art. For the crystal ball LF, our method had up to a

50% higher TP rate (TPR), up to a 58% lower FN rate (FNR), similar FP rates (FPR) and TN

rates (TNR), and generally better precision and recall compared to Xu’s method for the camera

array. We attributed these improvements to more accurately fitting the plane in 4D, as opposed

to a single hyperplane.

For the lenslet-based LF camera, we investigated two different types of refractive objects: a

glass sphere and an acrylic cylinder, shown in the bottom two rows of Fig. 5.8. The sphere

exhibited significant distortion along both the horizontal and vertical viewing axes, while the

cylinder only exhibited significant distortion perpendicular to its longitudinal axis.

When using the small-baseline lenslet-based LF camera, we observed significant improvement

in performance over state of the art for all object types. As shown in Table 5.1, Xu’s method

was unable to detect the refractive cylinder (TPR of 0.08), while our method succeeded with a

TPR 10 times higher. Our method had a 3.4 times increase in precision and 9.4 times increase

in recall for the sphere. The higher precision and recall imply that our method provides fewer

incorrect detections and misses fewer correct refracted image features compared to previous

work. We attribute this to accounting for slope consistency, which Xu’s method did not address.

In shorter-baseline LFs, the nonlinear characteristics of refracted image feature curves were

much less apparent, as in Fig. 5.6, but could still be distinguished by their inconsistent slopes.


Figure 5.8: Comparison of the state of the art (Xu’s method) (left), and our method (right) for

discriminating against Lambertian (blue), and refracted (red) SIFT features. The top row shows

the crystal ball captured with a large baseline LF (cropped). Both methods detect refracted

image features; however, our method outperforms Xu’s. In the second and third rows, a cylinder

and sphere captured with a small-baseline lenslet-based LF camera. Our method successfully

detects more refracted image features with fewer false positives and negatives.

We observed that features that were located close to the edge of the sphere appeared more linear,

and thus were not always detected. Other FPs were due to specular reflections that appeared

like well-behaved Lambertian points. Finally, there were some FNs near the middle of the

sphere, where there is identical apparent motion in the horizontal and vertical hyperplanes.

This is a degenerate case for the current method, due to the symmetry of the refractive object.

Principal rays that are directly aligned with the camera are not significantly refracted (their

hyperplanes would therefore appear linear and consistent to each other). However, the image of

these features appears flipped, and the scale of the object is also often changed. These indicators

may be considered in future work to address this issue.


5.4.3 Rejecting Refracted Image Features for Structure

from Motion

Since too many refracted image features in a set of input image features can cause SfM to fail,

we examine the impact of rejecting refracted image features in a SfM pipeline. We captured

10 sequences of LFs where the camera gradually approached a refractive object using the same

lenslet-based LF camera. These sequences were captured on a robot so that the sequences were

repeatable and the ground truth of the LF camera poses were known. An OptiTrack and motion

capture system was used for ground truth camera pose. We used Colmap, a publicly-available

SfM implementation which included its own outlier rejection and bundle adjustment [Schoen-

berger and Frahm, 2016]. Incremental monocular SfM using the central view of the LF was

performed on the sequences of images. Each successive image had an increasing number of re-

fracted image features, making it increasingly difficult for SfM to converge. If SfM converged,

a sparse reconstruction was produced, and the estimated poses were further analysed. The scene

is shown in Fig. 5.1a with a textured, slanted background plane behind a refractive cylinder.

For each LF, SIFT features in the central view were detected, creating an unfiltered set of fea-

tures, some of which were refracted. Our discriminator was then used to remove refracted

image features, creating a filtered set of (ideally) only Lambertian features. Both sets were

imported separately into the SfM pipeline. This produced respective “unfiltered” and “filtered”

SfM results for comparison. The unfiltered case used all of the available image features, while

our method was applied to the filtered case to remove most of the refracted image features from

the SfM pipeline.

We note that outlier rejection schemes, such as RANSAC, are often used to reject inconsistent

features, including refracted image features. While RANSAC successfully rejected many re-

fracted image features, we observed more than 53% of inlier features used for reconstruction

were actually refracted image features in some unfiltered cases. This suggested that in the pres-


ence of refractive objects, RANSAC is insufficient on its own for robust and accurate structure

and motion estimation.

We measured the ratio of refracted image features r = ir/it, where ir is the number of refracted

image features in the image, and it is the total number of features detected in the image. We

considered the reprojection error as it varied with r. Shown in Fig. 5.9, the error for the unfil-

tered case was consistently higher than the filtered case (up to 42.4% higher for r < 0.6 in the

red case). Additionally, the unfiltered case often failed to converge, while the filtered case was

successful, suggesting better convergence. Sample scenes that caused the unfiltered SfM to fail

are shown in Fig. 5.10a and 5.10b. These scenes could not be used for SfM without our method

to find consistent image features for reconstruction.

For the monocular SfM, scale was obtained by solving the absolute orientation problem using

Horn’s method between the estimated pose ps and ground truth pose pg, and only using the

scale. Fig. 5.11a shows example pose trajectories reconstructed by SfM for a filtered and unfil-

tered LF sequence with the ground truth. The filtered trajectory had a more accurate absolute

pose over the entire sequence of images. Fig. 5.11b and 5.11c show the relative instantaneous

pose error ei, computed as

ei = (ps,i − ps,i−)− (pg,i − pg,i−) (5.28)

for image i, split into translation and rotation components. To do this, we considered the po-

sition of the camera origin at image i as hi = [Px, Py, Pz]T . We can then write the translation

error etr for a sequence of images as the L2-norm of the instantaneous translation error

etr =

√√√√

nLF∑

i=1

∣∣(hi − hi−1)− (hg,i − hg,i−1)

∣∣2, (5.29)

where nLF is the number of LFs in the image sequence, and hg,i is the ground truth position at

image i. Similarly, we consider the orientation of the camera at image i as θ = [θr, θp, θy]T for


0.1 0.2 0.3 0.4 0.5 0.6 0.7

refracted feature ratio

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

rep

roje

ctio

n e

rro

r [p

ix]

unfiltered 1

unfiltered 2

unfiltered 3

filtered 1

filtered 2

filtered 3

Figure 5.9: Rejecting refracted image features with our method yielded lower reprojection er-

rors and better convergence for the same image sequences. SfM reprojection error vs refracted

image feature ratio for the unfiltered case containing all the features, including refracted image

features (dashed), and filtered case excluding refracted image features (solid). The spike in error

at 0.6 r for filtered sequence 2 was due to insufficient inlier matches for SfM to provide reliable

results.

roll, pitch and yaw in Euler angles (XYZ ordering). The rotation error erot for a sequence of

images is then the L2-norm of the instantaneous rotation error

erot =

√√√√

nLF∑

i=1

∣∣(θi − θi−1)− (θg,i − θg,i−1)

∣∣2. (5.30)

Although erot was ≈ 0.02◦, etr had larger errors up to 0.01m higher than the filtered case. This

suggested that filtering for refracted image features yielded more accurate pose estimates from

SfM.

In Table 5.2, we show filtering refracted image features leads to an average of 4.28 mm lower

etr, and 0.48◦ lower erot relative instantaneous pose errors over 5 LF sequences with different

objects, poses and backgrounds, excluding Seq. 6, where the number of inlier feature matches

for SfM dropped below 50. The number of LFs in each sequence varied, because the unfiltered

case could not converge with more images at the end of the sequence where r was higher. Seq. 7

and 8 are examples of where only our filtered case converged, so that SfM produced a trajectory


(a) (b)

Figure 5.10: Both (a) and (b) show example images for the refractive cylinder and sphere,

respectively, where SfM could not converge without filtering out refracted image features using

our method.

Table 5.2: Comparison of mean relative instantaneous

pose error for unfiltered and filtered SfM-reconstructed

trajectories

Unfiltered Filtered

Seq. #LFs etr [mm] erot [◦] #inliers etr erot #inliers

1 10 18.86 5.72 160 8.09 4.52 127

2 10 10.45 4.66 285 7.10 4.29 140

3 10 10.17 4.52 281 6.94 4.09 186

4 9 11.13 4.70 296 7.50 4.37 224

5 8 6.07 4.47 201 5.66 4.39 196

6 10 6.52 0.74 207 15.21 1.58 50

7 10 N/A N/A N/A 8.51 4.02 155

8 10 N/A N/A N/A 6.95 4.16 230

for analysis. Thus, filtering refracted image features using our method yielded more consistent

(non-refracted) image features that improved the accuracy of the SfM pose estimates compared

to not filtering for refracted image features, and made SfM more robust in the presence of

refractive objects.

For the cases where SfM converged in the presence of refractive objects, we created a sparse

reconstruction of the scene of Fig. 5.1, which was primarily the Lambertian background plane,


-0.02-0.01

0

z [m

]

0.3

0.01

0.25

0.2

x [m]

0.15

-0.060.1

-0.04

y [m]

0.05 -0.02

00

groundtruth

unfiltered

filtered

(a)

1 2 3 4 5 6 7 8 9

image #

0

0.01

0.02

etr [m

]

(b)

1 2 3 4 5 6 7 8 9

image #

0

0.1

0.2

ero

t [deg]

unfiltered

filtered

(c)

1 2 3 4 5 6 7 8 9 10

image #

0

0.2

0.4

0.6

refr

active featu

re r

atio

(d)

Figure 5.11: For cases where SfM converged, filtering out the refracted image features yielded

more accurate pose estimates. (a) Sample pose trajectory with the filtered (red) closer to ground

truth (blue), compared to the unfiltered case (green). Relative instantaneous pose error for

translation (a) and rotation (b) are shown over a sample LF sequence, where the filtered case

was consistently lower than the unfiltered case. (c) With our method, the refractive feature ratio

for the filtered case was lower than the unfiltered case.


since we attempted to remove refracted image features distorted by the cylinder. Sample recon-

structions for both the unfiltered and filtered cases are shown in Fig 5.12. Both point clouds

were centered about the origin and rotated into a common frame. For visualization, an overlay

of the scene geometry’s best fit to the background plane is provided. The unfiltered case had to

be re-scaled according to the scene geometry (as opposed to via the poses as done in Fig. 5.12)

for comparison. Scaling via scene geometry resulted in severely worse pose trajectories for the

unfiltered case, although similar observations were made: with our method, there were fewer

points placed within the empty space between the refracted object and the plane. This is an

important difference since the absence of information is treated very differently from incor-

rect information in robotics. For example, estimated refracted points might incorrectly fill an

occupancy map, preventing a robot from grasping refractive objects.

5.5 Conclusions

In this chapter, we proposed a method to discriminate refracted image features based on a

planar fit in 4D and slope consistency. To achieve this, we introduced a novel textural cross-

correlation technique to extract feature curves from the 4D LF. Our approach demonstrated

higher precision and recall than previous work for LF camera arrays, and extended the detection

capability to lenslet-based LF cameras. For these cameras, slope consistency proved to be a

much stronger indicator of distortion than planar consistency. This is appealing for mobile

robot applications, such as domestic robots that are limited in size and mass, but will have to

navigate and eventually interact with refractive objects. Future work will examine in more detail

the impact of thresholds on the discriminator through the use of precision-recall curves, as well

as relate image feature slopes to surface curvature to aid grasping.

It is important to note that while we have developed a set of criteria for refracted image features

in the LF, these criteria are not necessarily limited to refracted image features. Depending on

the surface, specular reflections may appear as nonlinear in the EPI. Such image features are


(a) Side view, unfiltered (b) Side view, filtered

(c) Top view, unfiltered (d) Top view, filtered

Figure 5.12: For the scene shown in Fig. 5.1a, (a,c) the unfiltered case resulted in a sparse

reconstruction where many points were generated between the refractive cylinder (red) and the

background plane (blue). In contrast, (b,d) the filtered case resulted in a reconstruction with

fewer such points, and the resulting camera pose estimates were more accurate. The cylinder

and plane are shown to help with visualization only. The camera (green) represents the general

viewpoint of the scene, not the actual position of the camera.

typically undesirable, and so we retain image features that are strongly Lambertian, and thus

good candidates for matching, which ultimately leads to more robust robot performance in the

presence of refractive objects.

Our experiments have shown that we can exclude refracted image features in a scene containing

spherical and cylindrical refractive objects; however, it is likely that not all planar objects, such

as windows, would be detected by our method. Some types of glass with a homogeneous

refractive index may not be detected by our method because they do not significantly distort


the LF by design, such as a glass rectangular prism. However, features viewed through curved

surfaces or non-homogeneous refractive indices, such as those commonly seen through privacy

glass and stained glass windows, should be detected based on the nonlinearities created by the

distortions of the object.

In this chapter we have explored the effect of removing the refractive content from the scene.

We have demonstrated that rejecting refracted image features for monocular SfM yields lower

reprojection errors and more accurate pose estimates in scenes that contain refractive objects.

The ability to more reliably perceive refractive objects is a critical step towards enabling robots

to reliably recognize, grasp and manipulate refractive objects. In the next chapter, we exploit

the refractive content to control robot motion.

Chapter 6

Light-Field Features for Refractive

Objects

For an eye-in-hand robot manipulator, and a refractive object surrounded by Lambertian scene

elements, we can use the Lambertian elements in the scene to approach the refractive object

using the LF-IBVS for Lambertian scenes developed in Chapter 4. The refractive object can be

partially detected via a variety of methods, such as the refracted image features as in Chapter 5,

or another different technique, such as using the occluding edges of the refractive object [Ham

et al., 2017]. However, as the camera’s FOV becomes increasingly dominated by the refractive

object, the Lambertian scene content becomes increasingly smaller to the point where it is

no longer available. In this situation, we must consider using the refractive object itself (and

thus the refracted image features) for positioning control tasks, such as visual servoing. In

this chapter, we combine the two previous chapters to develop a refracted light-field feature—a

light-field feature whose rays have been distorted by a refractive object—that will enable control

tasks, such as visual servoing towards refractive objects.

157

158 6.1. REFRACTED LF FEATURES FOR VISION-BASED CONTROL

6.1 Refracted LF Features for Vision-based Control

If we consider the physics of a two-interface refractive object, the light paths tracing from the

point of origin, along the intersecting lines at the refractive object’s boundaries to the cam-

era sensor, can be described by over twelve characteristics (see Fig. 3.2). The problem of

completely reconstructing this light path is severely under-constrained for a single LF camera

observation. However, the problem is more constrained for the task of position control, where

only several DOFs need to be controlled with respect to the object (as opposed to recovering

the complete object/scene geometry). Therefore, we approximate the local surface curvature

in two orthogonal directions, which allows us to model that part of the refractive object as a

type of lens. With an LF camera, we can observe the background projections caused by this

lens. We can describe these observations with at least five parameters in the LF, which we use

as our refracted light-field feature for refractive objects. This local description of the refractive

object is much simpler than complete surface reconstructions of the refractive object. While it

may not be sufficient to fully reconstruct the shape of a refractive object, it will be sufficient for

vision-based position control tasks, such as visual servoing.

The main contributions of this chapter are as follows:

• We propose a compact representation for a refracted LF feature, which is based on the

local projections of the background through the refractive object. We assume that the

surface of the refractive object can be locally approximated as having two orthogonal

surface curvatures. We can then model the local part of the refractive object as a toric

lens. The properties of the local projections can then be observed and extracted from the

light field.

• We provide an analysis of our refracted LF feature’s behaviour in the LF in simulation. In

particular, we illustrate the feature’s continuity with respect to LF camera pose. Doing so

CHAPTER 6. LIGHT-FIELD FEATURES FOR REFRACTIVE OBJECTS 159

shows the potential for the feature’s use in vision-based control tasks towards refractive

objects.

The rest of this chapter is organised as follows. We discuss related work in Section 6.2. In

Section 6.3, we discuss the optics of the lens elements that can describe the behaviour of our

refracted LF feature. The formulation of our refracted LF feature and method of extraction

from observations captured by the LF camera is described in Section 6.4. In Section 6.5, we

describe our our implementation and discuss experimental results that illustrate the continuity

and suitability of our feature for a variety of refractive objects in simulation, for the purposes of

visual servoing. Lastly, in Section 6.7, we conclude the chapter and explore future work.

6.2 Related Work

Grasping and manipulation of refractive objects have been considered in previous work. Choi et

al. developed a method to localise refractive objects in real-time with a monocular camera [Choi

and Christensen, 2012]. The contours from a given image, matched them to a database of

refractive object contours with known poses, and efficiently searched/matched to the database.

Walter et al. did so with an LF camera combined with an RGB-D sensor [Walter et al., 2015].

Lysenkov et al. recognised and estimated the pose of rigid transparent objects using a RGB-D

(structured-light) sensor [Lysenkov, 2013]. Recently, Zhou et al. used an LF camera to recognise

and grasp a refractive object by developing a light-field descriptor based on the distribution of

depths observed by the LF camera [Zhou et al., 2018]. However, all of these previous works

rely on having a 3D model of the object a priori. Complete and accurate geometric models of

refractive objects are extremely difficult or time-consuming to acquire.

While the reconstruction of opaque surfaces with Lambertian reflectance is a well-studied prob-

lem in computer and robotic vision, reconstructing the shape of refractive objects pose chal-

lenging problems. Ihrke et al. provide an excellent survey on transparent and specular object


reconstruction [Ihrke et al., 2010a]. Kutulakos et al. developed light path theory on refractive

objects and performed refractive object reconstruction on complex inhomogeneous refractive

objects [Kutulakos and Steger, 2007, Morris and Kutulakos, 2007]. If the light paths can be

fully determined, the shape reconstruction is solved. However, from this work, it is clear that

for a two-interface object, there are many more parameters needed than can be measured di-

rectly by an LF camera. We are left with an underdetermined system of equations, which is

insufficient for shape reconstruction.

Taking a slightly different approach, Ben-Ezra et al. used multiple monocular images to re-

cover a parameterised refractive object shape and pose [Ben-Ezra and Nayar, 2003], while

Wanner et al. used LF cameras to reconstruct planar reflective and refractive surfaces [Wanner

and Golduecke, 2013]. There are many other prior works that rely on controlling background

patterns [Kim et al., 2017, Kutulakos and Steger, 2007, Morris and Kutulakos, 2007, Wetzstein

et al., 2011], and shape assumptions [Kim et al., 2017, Tsai et al., 2015]. Many of these ap-

proaches rely on known lighting systems, large displays behind the refractive object in question

and other bulky setups that are impractical for real-world robots in general unstructured scenes.

We are interested in an approach that does not require large apparatus surrounding the refractive

object and does not require models of the entire refractive object. Our work is different from

these previous works in that we are not focused on the problem of reconstructing refractive ob-

ject surfaces. Rather, we aim to develop a refracted LF feature that will enable us to use visual

servoing to approach refractive objects.

In Chapter 4, we developed the first light-field image-based visual servoing algorithm by using

a feature based on central view image coordinates, augmented with slope [Tsai et al., 2017];

however, like many previous works, the implementation was limited to Lambertian scenes. We

revisit the Lambertian light-field feature and LF-IBVS in the context of refractive objects by

proposing a novel LF feature for refractive objects. To the best of our knowledge, refracted

light-field feature for image-based visual servoing towards refractive objects has not yet been

proposed.


For LF features, Tosic et al. developed LF-edge features [Tosic and Berkner, 2014]; however,

our interest is in keypoint features, which tend to be more uniquely identifiable are more com-

monly applied to visual servoing and structure from motion tasks. Teixeira et al. used EPIs to

detect reliable Lambertian image features [Teixeira et al., 2017]. Similarly, Dansereau recently

proposed the Light-Field Feature (LIFF) detector and descriptor [Dansereau et al., 2019], which

focuses on detecting and describing reliable Lambertian image features in a scale-invariant man-

ner. However, all of these LF features are designed for Lambertian scenes, and are not suitable

for describing refracted image features.

Maeno et al. proposed the light-field distortion feature (LFD) [Maeno et al., 2013]. Xu et al.

built on the LFD and used it for transparent object image segmentation, but only characterised

a refracted feature as a single hyperplane [Xu et al., 2015]. In Chapter 5, we then developed a

refracted feature classifier for refracted image features using an LF camera [Tsai et al., 2019].

A Lambertian point feature was identified as a planar structure in the 4D LF, which can be

described by the intersection of two 4D hyperplanes. The nature of this 4D planar structure

changes in the light field when distorted by a refractive object, and was used for discriminating

refracted image features. Previously, only a limited subset of views were used (the central

cross of the LF) were used to describe the 4D planar structure. In this chapter, we use feature

correspondences from all of the LF views and extend the theory and how we can observe, extract

and estimate the 4D planar structure of a refracted light-field feature in the LF for the purposes

of visual servoing.

6.3 Optics of a Lens

We first assume that a large, complex refractive object can be sufficiently approximated by sev-

eral smaller parts. These parts are smooth and we constrain the surface to directionally-varying

curvature by choosing two orthogonal directions on the surface. A surface defined in this man-

ner is similar to a type of astigmatic lens, known as a toric lens, which is commonly used by

162 6.3. OPTICS OF A LENS

optometrists to describe and correct astigmatisms [Hecht, 2002]. Thus, we can approximate

small local parts of the refractive object as a toric lens. In general, refractive objects can project

the background into space, and lenses do this in a predictable manner. In this section, we pro-

vide a brief background in the optics of a spherical, cylindrical and finally toric lens, in order to

better understand how the appearance of a feature may be distorted by such a lens, and how it

may be observed in the light field. We describe our reasons for choosing the toric lens for our

refracted LF feature in Section 6.4.

6.3.1 Spherical Lens

One of the most common and simple lenses is the spherical lens. A convex spherical lens surface

is derived from a slice of a sphere, such that it has equal focal lengths in all orientations (it has a

single focal length) and thus focuses collimated light to a single point. As in geometrical optics,

we assume the light acts as rays (no waves). We assume we are in air, such that the index of

refraction nair = 1. We assume the lens is thin and we assume paraxial rays. The lens formula

is then given as

1

f= (n− 1)

[1

R1

−1

R2

+(n− 1)d

nR1R2

]

, (6.1)

where n is the index of refraction of the lens material, R1 and R2 are the radii of curvature of the

front and back surfaces, and d is the thickness of the lens. For thin lenses, d is much smaller than

R1 and R2 and approaches zero. Equation (6.1) is useful because it relates surface curvature

to focal length, and can be used to derive the equation describing image formation, sometimes

called the lensmaker’s formula. As discussed in Section 2.2.2, the lensmaker’s formula is given

as

1

f=

1

zo+

1

zi, (6.2)

where zo and zi describe the distance of the object and image, respectively, along the optical

axis of the lens. Therefore, given focal length f and zo, we can determine zi formed by the lens.


6.3.2 Cylindrical Lens

Cylindrical lenses are sliced from the shape of a cylinder. Cylindrical lenses also have a single

focal length, but focus collimated light into a line instead of a point. We refer to this line as

the focal line. The focal line is parallel to the longitudinal axis of the lens. Effectively, the lens

compresses the image of the background in the direction perpendicular to the focal line. The

background image is unaltered in the direction parallel to the focal line.

6.3.3 Toric Lens

A toric lens has two focal lengths in two orientations perpendicular to each other. As shown in

Fig. 6.1, the surface of a toric lens can be formed from a slice out of a torus. The surface of

a torus can be formed by revolving a circle of radius R2, about a circle of radius R1. A slice,

shown in dashed red, forms the surface of a toric lens. The radii of curvature are related to the

focal length, as in (6.1). An astigmatic lens is the more general form of the toric lens, where (for

the astigmatic lens) the axes of the two focal lengths are not constrained to be perpendicular to

each other.

The two focal lengths cause a toric lens to focus light at two different distances from the lens,

resulting in two focal lines. A toric lens has the same optical effect as two perpendicular cylin-

drical lenses combined. Visually, this is seen as a “flattening” of rays with respect to their

respective axes at these two distances [Freeman and Fincham, 1990]. The shape of bundle

of rays passing through the astigmatic lens is known as an astigmatic pencil. Mathematician

Jacques Sturm (1838) investigated the properties of the astigmatic pencil, and thus the astig-

matic pencil is also known as Sturm’s conoid. The distance between the focal lines is known as

the interval of Sturm. The circular cross-section where the pencil has the smallest area is known

as the circle of least confusion. Fig. 6.2 shows a rendering of the visual effect of a toric lens on

a background circle.

164 6.4. METHODOLOGY

(a) (b)

Figure 6.1: (a) A torus can be defined by two radii, R1 and R2. The surface of a toric lens can

be sliced (dashed red) from a torus. (b) The toric lens surface is defined by the two radii of

curvature, and therefore two focal lengths f1 and f2. The direction of these two curvatures are

perpendicular to each other.

6.4 Methodology

There are three reasons for choosing to use the toric lens for locally modeling a large, complex

refractive object. First, it is reasonable to assume local orthogonal surface curvatures as a first

order approximation to any Euclidean surface. Second, it is one of the simplest refractive objects

that we can unambiguously use to describe a feature in relation to camera pose. Third, the toric

lens is more descriptive than a spherical lens in terms of describing the location and orientation

of the image created by projecting a Lambertian point through the lens. In this case, a spherical

lens is ambiguous in its orientation. In this section, we propose our refracted LF feature that is

based on the background projections of a toric lens. We define our refracted LF feature. Then

we describe our method to extract our feature from the LF.

A Lambertian point P emits rays of light that pass through a toric lens and into the LF camera.

Toric lenses project the background into 3D space through two focal lines, located at two differ-

ent distances from the lens that depend on the local surface curvatures. We can recover where

these focal lines occur in 3D based on the ray observations captured by the LF camera. Fur-

thermore, we can show that these vary continuously with respect to LF camera viewing pose,


Figure 6.2: A rendering of the visual effect of a toric lens on a blue background circle. In this

scene, a toric lens is aligned with the principal axis of a camera. The camera is moved along

this axis towards the lens. The toric lens is the transparent circular disk in the middle of the

images (1-9). For reference, the background is a checkerboard with a blue circle in the centre.

Far away (1), the blue circle appears as a flattened ellipse. At (3), the image of the blue circle is

almost completely flattened, and appears as a line at one of the focal lengths of the lens. As the

camera progresses closer, the effect of the two focal lengths acting on orthogonal axes balances

out. Image (6) shows the blue dot as a circle at the circle of least confusion. Moving forwards,

the circle is stretched vertically at the second focal length of the toric lens at (9). Finally, the

image appears almost undistorted at (12) when the camera is directly in front of the toric lens.

which makes these measurements suitable for positioning control tasks, such as visual servo-

ing. In sum, we propose a refracted LF feature based on the projections produced by local toric

lenses, which will be suitable for vision-based position and control tasks in scenes dominated

by a refractive object.

For our approach, we assume that the local surface curvatures of the refractive object can be

described by a toric lens. The validity of this assumption, and thus the effectiveness of our

method, depends on how smooth the surface of the refractive object is compared to the base-

line of the LF camera. A high-frequency surface curvature may make the background image

unmatchable and not locally well approximated by a toric lens. We also assume a thin lens,

although thick lenses can be considered in future work for more general refractive objects. We

assume that the background is infinitely far from the refractive object, such that we are dealing

with collimated light. Lastly, we assume that there is sufficient background texture to facilitate


image feature correspondence within the LF (i.e., between sub-images of the LF), which applies

to most feature-based robotic vision methods.

6.4.1 Refracted Light-Field Feature Definition

As described in Section 2.7, a Lambertian point in 3D induces a plane in 4D. This plane can

be described by the intersection of two 4D hyperplanes. Mathematically, the relation the 3D

point and the LF observations can be described by (4.1). Each hyperplane can be described by

a normal vector. In Chapter 5, we showed that these normal vectors are related to the light-field

slope, which is inversely proportional to the depth of the point. For a Lambertian point, the

apparent motion profiles of the feature in the LF are linear and the two slopes from the two

hyperplanes are consistent with each other—they are equal in magnitude.

However, for a refracted image feature, these two motion profiles can be nonlinear and/or the

slopes can inconsistent with each other. The latter implies that they can have different magni-

tudes. We showed this to be sufficient to discriminate Lambertian image features from refracted

image features in Chapter 5. Astonishingly, a Lambertian point projected through a toric lens,

also yields a plane in 4D. Although the normals are not necessarily equal in magnitude, as in

the Lambertian case, the apparent motion profiles are still linear. We can therefore describe the

projections from a toric lens using two slopes. We can also include a measure of orientation

of the toric lens with respect to the LF camera. In this section, we show how the 4D plane is

still formed through the projections of a toric lens, and how we can use this insight to develop a

refracted LF feature.

6.4.1.1 Two Slopes

As in Chapters 4 and 5, we parameterise the LF using the relative two-plane parameterisation

(2PP) [Levoy and Hanrahan, 1996]. A light ray φ emitting from a point P in the scene, has


coordinates φ = [s, t, u, v], and is described by two points of intersection with two parallel

reference planes. An s, t plane is conventionally closest to the camera, and a u, v plane is

conventionally closer to the scene, separated by arbitrary distance D.

(a) (b)

Figure 6.3: (a) Light-field geometry for a point in space for a single view (black), and other

views (grey) in the xz-plane, whereby u varies linearly with s for all rays originating from

P (Px, Pz). (b) A 2D (xz-plane) illustration of a background feature P that gets projected

through a toric lens (blue). The lens is characterised by focal length f and converges at the

focal line C. Note that C appears as a point here because C is a line into the page along yinto the page. C is created by the rays (red). From P to C, the image created by the lens is

upright, but from C to the LF camera, the image flips and an inverted image is observed by the

2PP of the LF camera (green). In relation to Fig. 6.3a, it is clear that the LF camera’s slope

measurements capture the depth of the toric lens’ formed image.

Considering the xz-plane, when a Lambertian point P is projected through a thin toric lens, it

forms a line at C, which is subsequently captured by the LF camera. Fig. 6.3 illustrates the rays

traced from P to the observations captured by the light-field camera. It is important to note that

in the xz-plane, C appears as a focal point; however, in 3D, C actually represents a focal line.

In relation to Fig. 6.3a, Fig. 6.3b shows that an LF camera captures the location of the toric lens’

image formation point C. The rays are arranged in such a way that the LF camera captures C’s

slope for both the xz- and yz-planes. Additionally, the position of C depends on the position of

P in the background behind the lens, as in (6.2).


Although much of this discussion has been focused on the positions of the two orthogonal

focal lines, we note that the light is focused on a continuum of distances from the toric lens

along Sturm’s conoid. However, the most salient aspects of Sturm’s conoid that can be directly

observed in the LF are its end points. Therefore, light rays emitted from P are refracted by the

toric lens and converge to two different and orthogonal focal lines. These focal lines occur at

two different depths from the LF camera’s perspective.

6.4.1.2 Orientation

We can describe the orientation of the focal lines with respect to the LF camera. In opthalmol-

ogy, the optical axis of the toric lens is typically aligned with the principal axis of the eye (the

LF camera in our situation). The lens’ orientation is then described with a single angle θ as the

rotation about the principal axis from the x-axis of the LF camera to the xy-axes of the toric

lens. Fig. 6.4 illustrates the orientation of the toric lens θ with respect to the LF camera. If we

define f1 and f2 as the two focal lengths of the toric lens, we note that as the difference between

f1 and f2, becomes small, the interval of Sturm approaches a point and the lens approaches a

spherical lens. The focal lines then intersect at a focal point, and the orientation information

becomes poorly-defined and unusable.

Figure 6.4: The blue ellipse represents the toric lens. The lens orientation θ, defined as the angle

between the refractive object frame (xr, yr) and the camera frame (xc, yc) for the axis-aligned

focal lines relative to the camera frame. For notation, s, t are aligned with xc, yc.


6.4.1.3 Combined Slopes and Orientation

Our previous LF feature for Lambertian points was p = [u0, v0, w] in Chapter 4, where u0 and

v0 were the image coordinates of the feature in the central view of the LF (s = 0, t = 0), and

w was the slope. Accounting for both slopes and orientation of the toric lens, we can augment

our Lambertian LF feature as a refracted LF feature described by

RLF = [u0, v0, w1, w2, θ], (6.3)

where w1 and w2 are the two slopes related to the distances to the two focal lines of the toric

lens from the LF camera.

Notably, for the axis-aligned case, where the principal axis of the LF camera is aligned with

the toric lens’ optical axis, our refracted LF feature follows the chief ray1 from the centre of

the toric lens to the centre of the LF camera. For the off-axis case (where the two axes are not

necessarily aligned), the refracted LF feature follows the LF camera’s chief ray to the u0, v0 in

the image plane. Regardless, each focal line must intersect the optical axis of the toric lens. In

either case, we can determine the 3D location of each of the two points of intersection, C1 and

C2, using similar triangles. The rays passing through the focal lines and into the LF camera all

pass through the line segment C1C2, which is known as the interval of Sturm. The line segment

C1C2 may be sufficient for visual servoing, as illustrated in Fig. 6.5, because as with many local

feature-based approaches, it is also possible to consider multiple refracted LF features at the

same time, for visual servoing.

Additionally, our refracted LF feature is not limited to the application of refractive objects.

For Lambertian points, the two slopes for the refracted LF feature are equal in magnitude.

The 3D line segment of the refracted LF feature therefore reduces to a 3D point. By ignoring

1In optics, the chief ray, or principal ray, is the ray that passes through the centre of the aperture. Thus, chief

rays are equivalent to rays observed by a pinhole camera.


Figure 6.5: A Lambertian point P emits a ray of light that pass through the toric lens (blue).

The ray reaches the central view of the LF camera at {L}. The refracted light field feature (red)

is shown as the 3D line segment created by the position of the two focal lines, rotated by an

orientation with respect to the LF camera’s xy-axes along the chief ray. The central view image

coordinates u0, v0 slopes w1, and w2, as well as the orientation θ define our refracted LF feature.

the orientation, our refracted LF feature generalises the Lambertian LF feature developed in

Chapter 4.

6.4.2 Refracted Light-Field Feature Extraction

In this section, we explain our method to extract the refracted LF feature from the LF. Using the

observations captured by the LF camera, we solve for the 4D plane as a 2D projection matrix.

We then decompose the projection matrix into scaling and rotation components, which allow us

to extract the slopes and orientation of the projections formed by the toric lens.

6.4.2.1 LF Observations through a Toric Lens

For the scenario outlined in Fig. 6.3b, a Lambertian point P in the background emitting rays

of light that project through a toric lens to produce a plane in the continuous domain LF. In


the discrete domain where we sample s, t in a uniform grid of points, projections appear as a

rectangular grid of points on the uv plane. As in Ch. 5, we consider the Light-Field Distortion

feature [Maeno et al., 2013] as a set of u, v relative to (u0, v0), the image coordinates of an

image feature in the central view (s0, t0). Then we can generally write the projection of P

through a toric lens as

∆u

∆v

= A

s

t

=

a1 a2

a3 a4

s

t

, (6.4)

where A is a 2 × 2 matrix. We note that if we have a spherical lens, or simply a Lambertian

point, then (6.4) simplifies to

∆u

∆v

= AL

s

t

=

w 0

0 w

s

t

, (6.5)

where w = −D/Pz. For the case of P projecting through a toric lens, in (6.4), we can factorise

A into three components of SVD as

A = ALΣAART, (6.6)

where AL is a 2 × 2 matrix, ΣA is a diagonal matrix with non-negative real numbers on the

diagonal, and AR is also a 2× 2 matrix. The diagonal entries of ΣA are the singular values of

A and represent the two slopes of the projections of the toric lens, as seen by the LF camera.

The columns of AL and AR are the left-singular and right-singular vectors of A, respectively.

Intuitively, we can reason this factorisation as three geometrical transformations, a rotation or

reflection (AL), a scaling (ΣA), followed by another rotation or reflection (AR). The orientation

from AL should be the same from AR. We can later extract slopes and orientation from these

three matrices. Therefore, in order to extract the slopes and orientation of the toric lens, we

must first recover the projection matrix A.


6.4.2.2 Projection Matrix

We can write (6.4) in terms of the elements of A as

∆u

∆v

︸︷︷︸

b

=

s t 0 0

0 0 s t

︸︷︷︸

F

a1

a2

a3

a4

︸︷︷︸x

. (6.7)

F is a matrix of at most rank two, because s = kt, where k ∈ R, which means we can reduce

the columns of F to a minimum of two independent columns. This equation has the common

form Fx = b. We can stack LF observations of s, t,∆u and ∆v for each corresponding point

in all n × n views of the LF and estimate a1, a2, a3, and a4 in the least-squares sense. We can

then form A and subsequently solve for the two slopes and the orientation.

6.4.2.3 Slope Extraction

We can extract the slopes from ΣA as the negative diagonals of ΣA. We note that singular

values are positive because the matrix ATA has non-negative eigenvalues, and singular values

are the square root of eigenvalues,

ΣA =

σ1 0

0 σ2

, (6.8)

where σ1 and σ2 are the singular values of ΣA. However, we know that the slopes for a point in

front of the LF camera with a positive D should be negative, based on (5.7). Then the slopes of


the toric lens projections are given as

w1 = −σ1 (6.9)

w2 = −σ2. (6.10)

6.4.2.4 Orientation Extraction

In order to extract θ, we must first consider a 2D rotation and a 2D reflection. A 2D rotation

matrix has the form

Rot(θ) =

cos θ − sin θ

sin θ cos θ

. (6.11)

For a 2D reflection, we can generally reflect vectors perpendicularly over a line that makes an

angle γ with the positive x-axis. The 2D reflection matrix then has the form

Ref(γ) =

cos 2γ sin 2γ

sin 2γ − cos 2γ

. (6.12)

In our case, θ = γ, so the combined reflection and rotation matrix R is given as

R = Ref(θ)Rot(θ) = Ref(θ −1

2θ). (6.13)

This reduces to

R(θ) =

cos θ − sin θ

− sin θ − cos θ

. (6.14)

Applying (6.14) to AR and AL yields two angles. The first angle represents a rotation and

reflection to the principal axes of the LF observations on the uv-plane. The singular values

represent scaling along the principal axes of the LF observations. The last angle represents the


same rotation and reflection back to the original LF observations. Since we are dealing with 2D

rotations, these two angles should be equal. Thus, we only have to extract a single angle θ.

Unlike previous work, where we only considered the central cross (horizontal and vertical) of

all the sub-views in the LF, in this work, we consider all of the sub-views. This improvement

allows us to better characterise refractive objects of different orientations (which was also not

accounted for in previous work), and simply allows us to use more information captured by the

LF for less uncertainty in the fit.


For position control tasks, we are primarily interested in feature continuity. Continuity implies

that there are no abrupt breaks or jumps in a function. For our refracted LF feature, continuity

means that u0, v0, w1, w2 and θ all vary smoothly with respect to viewing pose, locally on

the surface of the refractive object. Methods such as visual servoing, typically rely on feature

continuity to incrementally step towards the goal pose. In this section, we describe the two

implementations and preliminary experimental results for investigating the continuity of our

refracted LF feature with respect to a variety of viewing poses and different refractive object

types.

6.5.1 Implementations

We developed two implementations for investigating refracted LF feature continuity. First, we

developed a single-point ray simulation for a single Lambertian point through a toric lens. Note

that this is not a ray-tracing method in the classic sense that we propagate rays from the source

to the camera sensor. The purpose of this setup was to provide a useful figure to illustrate the

nature of the toric lens, the focal lines, and act as a proof of concept for the refracted LF feature.


Second, we performed a ray-tracing simulation of a background scene, refractive object and

LF camera using Blender, a popular and freely-available rendering tool. We used the Cycles

Renderer option, which performed physics-based ray tracing for accurate renderings through

refractive objects. Additionally, the Light-Field Blender Add-On [Honauer et al., 2016] was

used to capture a set of LF camera array views. Geometric models were rendered as refractive by

assigning Blender’s “glass BSDF” material property, which used an index of refraction of 1.450.

In our ray-tracing simulation, we attempted to assess the validity of the toric lens assumption

towards more general refractive object shapes in order to assess the limitations of the refracted

LF feature.

A rendered sample LF reduced to 3 × 3 views is shown in Fig. 6.6. In this environment, we

simulated and tested our method against a variety of different object types and poses, shown

later in Fig. 6.14. The background was kept up to 100 times farther than the distance of the

refractive object relative to the LF camera, in order to approximate collimated light from a

point source of light. We used a flat checkerboard background to provide a visual reference

of the amount of distortion caused by the refractive objects. However, our implementation is

agnostic to the background pattern because we rely on a uniquely-coloured, solid blue circle on

the top surface of the background plane in order to aid image feature correspondence between

different LFs captured from different poses. Future work will involve different backgrounds,

including more realistic, non-planar scenes.

We ensured that the tracked circle was visible in all the views of the LF through the refractive

object, as in Fig. 6.6. Segmentation for the refracted blue circle was accomplished by trans-

forming the red, green and blue (RGB) colour representation to the hue, saturation and value

(HSV) colour space. Thresholds were used to segment the angular value for the blue hue, which

ranged from approximately 240 to 300 degrees in the HSV colour space. The centre of mass of

the largest blue-coloured dot was used as the centre of the circle, which was taken as the same

background Lambertian point for feature correspondence.


Figure 6.6: Ray tracing of a refractive object using Blender. Here, a toric lens is simulated for

5 × 5 views as an LF, although only 3 × 3 views are shown here. The nature of the toric lens

is visible—the square checkerboard background is elongated in the v-direction, indicating the

longer focal length along the vertical axis of the toric lens. The large circular blue dot was used

to aid feature correspondence. The blue circular blue dot appears as an ellipse due to the nature

of the toric lens.

We note that the centre of mass of a blob (for example, the ellipses in Fig. 6.6) that has been

distorted by a refractive object does not always reflect the precise centre of the circle. There

may be cases where extreme curvature and inhomogeneous structures in the refractive object

(such as bubbles or holes) can result in significant distortion, such that the circle’s centre no

longer matches the blob’s centroid in the rendered image. However, for homogeneous (no

holes or bubbles) and relatively smooth refractive objects, the centroid provides a reasonable

approximation to the coordinates of the centre of the circle in the rendered image.


6.5.2 Feature Continuity in Single-Point Ray Simulation

For the single-point ray simulation, we know the location of the Lambertian point, as well as

the pose and optical properties of the toric lens and LF camera. We can therefore determine

the location of the focal lines. The rays can then be projected from the st viewpoints of the LF

camera, which we know the chief rays must pass through. We assume paraxial rays. Fig. 6.7

illustrates rays of light emitted from a Lambertian point projected through a toric lens and into

an LF camera. The pencil-like shape of the rays is known as an astigmatic pencil. The colours

of the rays are coded with the two focal lengths of the toric lens. The 2D side-views clearly

indicate the rays pass through the two focal lines according to the two focal lengths of the toric

lens. Feature correspondence is known because we are tracing the rays individually through the

scene from the camera viewpoints.

Fig. 6.8 shows the estimated slopes for a pure translation along the z-axis in the distance be-

tween the refractive object and the LF camera. In this motion sequence, the LF camera was

moved closer to the refractive object. The ground truth was calculated from the slope equations

in Fig. 6.5. As expected, both slopes increased in magnitude (but decreased due to the negative

sign) as the LF camera moved closer towards the focal lines, and matched the ground truth.

Orientation was correctly estimated as a constant and so is not shown. Translations in x and y

also yielded constant slopes, and are therefore not shown. Similarly, Fig. 6.9 shows the correct

estimated orientation for a pure rotation about the z-axis of the LF camera. The slopes were

also correctly estimated as constant and so are not shown. In all of these plots, the refracted

LF feature is continuous with camera pose. This experiment also demonstrated that we can

correctly extract the refracted LF feature from simulated LF observations.


(a)

0123456

z [m]

-1

-0.5

0

0.5

1

x [m

]

(b)

-0.5

0

0.5

y [m

]

0123456

z [m]

(c)

Figure 6.7: Single point ray trace simulation. (a) 3D view of a Lambertian point (black) ema-

nating light rays through the toric lens (light blue, blue). The rays are refracted and pass through

the focal lines (red, magenta). The rays pass through the uv-plane (green) and into the LF cam-

era viewpoints (blue). (b) The xz-plane showing all the light rays passing through the magenta

focus line induced by fx. (c) The yz-plane, showing all the light rays passing through the red

focus line induced by fy.


-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

LF camera displacement z [m]

-0.13

-0.125

-0.12

-0.115

-0.11

-0.105

-0.1

-0.095

slo

pe

w1,gt

w2,gt

w1,est

w2,est

Figure 6.8: Our method correctly estimates the two slopes, w1,est, w2,est of the refracted light-

field feature, compared to the ground truth w1,gt, w2,gt for changing z-translation of the LF

camera.

-15 -10 -5 0 5 10 15

gt [deg]

-20

-10

0

10

20

1 [deg]

ground truth

estimated

Figure 6.9: Our method correctly estimates the orientation θ1 of the refracted light-field feature

for changing z-rotation of the LF camera.

6.5.3 Feature Continuity in Ray Tracing Simulation

In the ray-tracing simulation experiments, we extracted our refracted LF feature from rendered

LFs. Similar to the plots from Section 6.5.2, we considered basic motion sequences of the LF

camera, and plotted the elements of the refracted LF feature with respect to camera motion to

show continuity. Fig. 6.10 depicts an LF camera starting from the left and a toric lens (blue) on

the right. The LF camera is approaching the lens in a straight line. The refracted LF feature is

shown in red. Out of the eight poses in this sequence, only three instances of the pose sequence


are shown for brevity. As the LF camera moves closer to the lens, the refracted LF feature

slopes decrease in magnitude accordingly; however, the feature’s position in 3D space remains

constant. This is because the decrease in slope (and thus decrease in distance of the feature

from the camera) is offset by the forwards motion of the position of the LF camera. Fig. 6.11

shows the corresponding two slopes as a function of LF camera displacement from the starting

position for the corresponding LF camera motion sequence. The trends in Fig. 6.11 matched

what we anticipated, based on Fig. 6.8. In this case, the refracted LF feature’s two slopes were

continuous with respect to forwards and backwards motion along the z-axis.

Figure 6.10: Refracted LF feature (red) for the approach of an LF camera (left, blue) towards

a toric lens (right, blue). For visualization, a straight line connecting the refracted LF feature

and the LF camera is shown (dashed green). As the LF camera moves closer (top to bottom),

the feature’s 3D line segment position remains constant, as we are measuring the same pencil

of light rays. Only three of the eight positions from the sequence are shown.

Similarly, Fig. 6.12 shows the recovered orientation estimates for rotating an ellipsoid about the

principal axis of the LF camera. The ellipsoid was aligned with the same axis and rotated from

-30 to 30 degrees. In this graph, we note that although the correct relative angles are recovered,

the entire line is centred about 90 degrees, instead of zero. This was likely due to the inherent

ambiguity from SVD, where 30 degrees rotation from one axis is equivalently 60 degrees from

the other axis of the toric lens. This ambiguity may be addressed by considering the heuristics


-1 -0.95 -0.9 -0.85 -0.8 -0.75 -0.7 -0.65 -0.6 -0.55 -0.5

z

-0.35

-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

slo

pe

w1,est

w2,est

Figure 6.11: Slope estimates for the entire approach towards the toric lens that was illustrated

in Fig. 6.10. Again, w1,est and w2,est represent the two estimated slopes for the toric lens. As

we approach the toric lens (decreasing z), we expect the slope to decrease in magnitude, which

we observe. We also note that the slopes appear continuous for z-translation.

of the problem, or by only considering small changes in orientation, and will be addressed in

future work.

Fig. 6.13 shows refracted LF features (red/orange/yellow) for a toric lens (light blue, right)

plotted in 3D from a grid of LF camera poses (blue squares, left). Note that a single blue square

represents an entire LF camera, as opposed to a single monocular camera. The regularity of the

grid of LF camera poses was an experimental design choice. The dashed lines (green) connect

the LF camera to its corresponding refracted LF feature. The refracted LF features are between

the LF camera poses and the toric lens, as expected. Interestingly, in traditional robotic vision,

Lambertian features do not move in 3D on their own. They are anchored in space (or attached

to some object), and are therefore clearly useful for localisation and image registration, among

other tasks. In Fig. 6.13, and many of the following refracted LF feature visualisations shown

in the following section, we note that our refracted LF features are not simply stationary. They

appear to move with the LF camera pose.


(a)

-30 -20 -10 0 10 20 30

z rotation [deg]

50

60

70

80

90

100

110

120

[d

eg

]

(b)

Figure 6.12: (a) An elongated ellipsoid that was rotated about the principal axis of the LF

camera to capture orientation change. (b) Orientation estimate, which reflects the orientation

of the principal axis of the ellipsoid relative to the horizontal. Here, the z-rotation is rotation

about the principal axis of the camera. We note that even though the ellipsoid is not an ideal

toric lens, the orientation was still correctly recovered and it was also continuous with respect

to the camera rotation.

However, the feature’s movement due to camera pose was well-defined. The slopes define the

distance of the feature to the LF camera, and these appear to be consistently at -0.2 m and 0.4

m on the z-axis. We note that the layout of the cluster of refracted LF features closely mirrors

the ray patterns of the astigmatic pencil from Fig. 6.7. The uniform grid of LF camera poses

mimics the sampling pattern of an LF camera array. The direction of the refracted LF features is

clearly dictated by the toric lens’ two orthogonal focal lines and the LF camera pose. Although

one can think of the interval of Sturm as simply a line segment along the principal axis of the

toric lens, Fig. 6.13 reminds us that the interval of Sturm is actually a collection of rays along a

continuum defined by the two focal lines of the toric lens. We also note that the direction of the

refracted LF feature appears to change in a continuous manner with camera pose. Therefore, the

alignment of our refracted LF feature implies a corresponding alignment of the LF camera pose

to the toric lens. A position and alignment task in this case could take the form of line-segment

alignment.


-0.1

-0.05

-0.1

0

x

0.05

-0.05

0.1

y

01

0.50.05

z

00.1 -0.5

-1

(a) (b)

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

z

-0.1

-0.05

0

0.05

0.1

x

(c)

-0.1

-0.05

0

0.05

0.1

y

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

z

(d)

Figure 6.13: (a) Refracted LF feature (red/orange/yellow) for a toric lens (right, light blue)

from a grid of LF camera positions (left, dark blue). Note that each blue square represents an

entire LF camera, not a single monocular camera. (b) Central view of the central LF, showing

the view of the flattened blue circle by the lens. The centre of the blue circle, (red star), was the

image feature that was tracked across the different LFs. (c) The top and (d) side views of the

refracted LF feature that clearly illustrate the focal lines of the toric lens at z of-0.2 and 0.4 m.

Note that the scale of the z-axis is much larger than the x- and y-axes, in order to clearly show

the refracted LF feature.


6.5.3.1 Different Object Types

We considered a variety of refractive object types from a set of poses in order to visualise our

refracted LF feature in 3D along with the respective LF camera poses and the refractive object

itself. Fig. 6.14 shows several of the objects, a sphere, a cylinder, and a “tornado”, along with

their corresponding refracted LF features sampled by a sequence or grid of LF camera poses.

First, Fig. 6.14b shows the case of a refractive sphere. As expected, the sphere focused the

refracted LF feature into a single 3D point. Using a spherical lens model instead of a toric lens

model is a viable refracted LF feature. A spherical refracted LF feature would be analogous to

a 3D Lambertian point for position and control tasks; however, the refracted sphere would not

be valid or as accurate for as many refractive object surfaces as the toric refracted LF feature.

Second, Fig. 6.14d shows the refracted LF features for a horizontal translation in x along a

cylinder. The features spread out in a fan at z = 12 m, which is the location of the cylinder’s

focal line. The single focal line is due to the curvature of the cylinder. The cylinder acts as a 1D

refractive element, and therefore the other slope is simply a shifted measure of the Lambertian

background. As we can see, the end points of the refracted LF features are approximately the

same for this reason.

Finally, a tornado-shaped refractive object was rendered in Blender to represent a more com-

plex, but still relatively smooth type of object, shown in Fig. 6.14f. The refracted LF features

were estimated to be in front of the tornado; however, the features did not appear to have com-

mon focal lines, like all of the refracted LF features the toric lens in Fig. 6.13. Despite the

initial intention, we also noticed that the tornado model was surprisingly bumpy with its cur-

vature. This lead to significant distortion caused by the refractive object. Several times, the

bumpiness of the refractive object separated the blue circle’s image (through the refractive ob-

ject) into two or more separate blobs, which greatly impacted the centroid measurements. LF

camera poses were selected so as to minimise and avoid this impact in our experiments. Mul-


(a)

12-2

-1

10-2

0

1

8

x

z

2

6

y

04

22

0

(b)

(c) (d)

(e)

-2

0

2

y

0 2 4 6 8 10 12 14 16 18 20

z

(f)

Figure 6.14: (a) For a sphere, the centroid of the blue circle (red star) was tracked throughout

the LF as a means of feature correspondence. (b) The sphere, with equal focal lengths in all

directions, forms an image of the background blue circle at a single point in space, which is

shown in the refracted LF features (red) that also encapsulate a point. Note that each blue LF

camera illustrated here represents a full LF camera, as opposed to a single monocular camera.

The dashed green lines indicate which refracted LF feature matches to which LF camera pose.

(c) For a cylinder, (d) the projections of the blue circle appears at the physical location for the

cylinder-aligned focal direction, as expected. (e) For a “tornado”, (f) the refracted LF features

from a grid of LF camera poses appear almost straight, as if the focal lines of the approximated

local toric lenses are far away. The tornado represented a complex refractive object, but still

yielded a continuous set of refracted LF features.

186 6.6. VISUAL SERVOING TOWARDS REFRACTIVE OBJECTS

tiple projections of the same point, caused by internal reflection and total refraction may also

need to be considered for future work image feature correspondence through refractive objects.

It is important to note that although our refracted LF feature is based on the assumption of

local surface curvatures, we cannot solve for the surface curvatures themselves given only our

refracted LF feature. Considering the lensmaker’s equation in (6.2), our method yields the

distance of image formation zi from the lens. We know that focal length f is intrinsically linked

to the surface curvature r. Therefore, in order to recover f , we require zo, the distance of the

object to the lens along the lens’ optical axis. However, despite this lack of knowledge, our

refracted LF feature is sufficient for the purposes of position control with respect to refractive

objects.

6.6 Visual Servoing Towards Refractive Objects

To put the refracted LF feature into context, in this section, we provide an illustrative example

of visual servoing towards a refractive object, shown in Fig. 6.15. This work has not yet been

investigated, and is further discussed as future work. An LF camera is mounted at the end-

effector of a robotic manipulator. A refractive object is placed in the scene with sufficient

visual texture in the background. The LF camera is moved to the goal pose in order to capture

a (set of) goal refracted LF feature(s). Then the LF camera is moved to an initial pose that

is close to the goal pose, so that the relevant refracted LF features(s) can still be observed

within the camera’s FOV. The robotic system uses a control loop similar to Fig. 4.2 in order to

visual servo towards the goal pose. At each iteration, a refracted LF feature Jacobian, which

relates LF feature changes to camera spatial velocities, is computed and used to iteratively

step towards the goal pose until the difference(s) between the current and goal refracted LF

feature(s) is/are sufficiently small; thereby completing a visual servo towards a refractive object.

Some approaches on how to compute this refracted LF feature Jacobian are mentioned in the

Section 6.7 as future work.


Figure 6.15: Concept for visual servoing towards a refractive object. At the start pose (red), a

starting refracted LF feature (red) is captured by observing the distorted images of the yellow

rubber duck in the LF. The LF camera moves (green) in order to align the current and the

goal (blue) refracted LF feature. Owing to the continuity of the refracted LF feature, feature

alignment corresponds with pose alignment, enabling the robot to reach the goal pose without

requiring a 3D geometric model of the refractive object.


6.7 Conclusions

Overall, we have developed a refracted light-field feature that may be used for positioning tasks

in robotic vision, such as visual servoing. Our feature approximates the surface of a refractive

object with two local orthogonal surface curvatures. We can describe this part of the refractive

object’s surface with a toric lens. The locations of the focal lines created by such a lens can be

measured by an LF camera. We have demonstrated that the location of these focal lines can then

be extracted from rendered light fields. By illustrating the continuity of our refracted light-field

feature from a variety of LF camera poses and for a variety of different refractive objects, this

feature can enable visual servoing and other positioning tasks without the need for a geometric

model of the refractive object.

For future work, we are interested in deriving Jacobians for our refracted LF feature. Doing

so would allow us to close the loop for servoing towards refractive objects. In order to com-

plete visual servoing towards refractive objects, a Jacobian for the refracted light-field feature

is required. Part of our feature extraction process relies on SVD, which potentially complicates

the Jacobian derivation. It may be possible to derive an analytical expression for w1, w2 and θ

via analytical expressions for the derivatives of singular values and singular vectors [Magnus,

1985]. Numerical methods could also be employed to estimate the Jacobian online [Jägersand,

1995]. An alternative approach is to simply derive Jacobians for the 3D line segments induced

by the refracted LF feature, as we illustrated in Fig. 6.10, 6.13, and 6.14. Deriving analytical

expressions for 3D points and line segments is likely more intuitive and straightforward.

Further investigation into denser LF camera pose sweeps to illustrate feature continuity in

graphical form on a larger variety of refractive objects and surface curvatures would be use-

ful to test the limitations of the toric lens assumption. It is also worth noting that the slopes

recovered in this chapter are related to the position of focal lines, and that these focal lines are

a function of surface curvature. Thus, it may be possible to use our refracted LF features to

augment techniques for refractive object surface reconstruction. Finally, it may be possible to


extend our refracted LF feature concept to include reflections, which also induce multiple depth

observations (multiple slopes) in the LF, and our orientation already provides a measure for

reflection.

Chapter 7

Conclusions and Future Work

7.1 Conclusions

At the start of this thesis, we identified an opportunity to advance robotic vision in the area of

perceiving refractive objects. Although many robotic vision algorithms have been successful

assuming a Lambertian world, the real world is far from Lambertian. Water, ice, glass and clear

plastic in a variety of shapes and forms are common throughout the environments that robots

must operate within. Our goal in this research was to help remove the Lambertian assumption

in order to broaden the range of operable scenes and perceivable objects for robots.

We considered light-field cameras as a technology unique in their ability to capture scene tex-

ture, depth and view-dependent phenomena, such as occlusion, specular reflection and refrac-

tion. Furthermore, image-based visual servoing was chosen as a particularly interesting robotic

vision technique for its wide range of applicability, robustness against modelling and calibra-

tion errors, and because it did not necessarily require a 3D geometric model of the target object

to perform positioning and control tasks. Thus, the overall aim of this thesis was to use LF

cameras to advance robotic vision in the area of visual servoing towards refractive objects.

191


We decomposed this broad goal into the more manageable and specific objectives of demon-

strating (1) image-based visual servoing using light-field cameras for Lambertian scenes; (2)

detecting refracted image features using LF cameras; and (3) developing refracted LF features

for visual servoing towards refractive objects.

In addressing these objectives, the key developments were a result of exploring the properties

of the LF and developing algorithms to exploit them. The first objective was accomplished in

Chapter 4. LF cameras were used for image-based visual servoing. Specifically, we proposed

a novel Lambertian light-field feature and used it to derive image Jacobians from the light field

that were then used to control robot motion. To deal with the lack of available real-time LF

cameras, we designed a custom mirror-based light-field camera adapter. To the best of our

knowledge, this was the first published light-field image-based visual servoing algorithm. Our

method enabled more reliable VS compared to monocular and stereo IBVS approaches for small

or distant targets that occupy a narrow part of the camera’s FOV and in the presence of occlu-

sions. Areas in robotics that may benefit from this contribution include vision-based grasping,

manipulation and docking problems in household, medical and in-orbit satellite servicing ap-

plications.

For the second objective, discrimination of refracted image features from Lambertian image

features was accomplished in Chapter 5. We developed a discriminator based on detecting the

differences between the apparent motion of non-Lambertian and Lambertian image features in

the LF using textural cross-correlation that was more reliable than previous work. We were

able extend these distinguishing capabilities to lenslet-based LF cameras, which typically are

limited to much smaller baselines that conventional LF camera arrays. Using our method to

reject refracted image features, we also enabled monocular SfM in the presence of refractive

objects, where traditional methods would normally fail. Domestic robots that clean dishes or

serve glasses, as well as manufacturing robots attempting to interact with or near clear plastic

packaging or heavily distorting refractive objects, such as stained glass or bottles of water, may

benefit from this research.

CHAPTER 7. CONCLUSIONS AND FUTURE WORK 193

Finally, for the third objective, development of a refracted light-field feature to enable visual

servoing towards a refracted object was accomplished in Chapter 6. In particular, we proposed

and extracted a novel refracted LF feature that could be described by the local projections of

the refractive object. We demonstrated that our feature’s characteristics were continuous with

respect to LF camera pose to show that our feature was suitable for visual servoing without

requiring a 3D geometric model of the target refractive object.

7.2 Future Work

In this journey of the thesis, we have scratched the surface of the unknown, only to uncover

more questions and ideas that might answer them. In this section, we propose the directions

of future research that might build upon and improve the current state of the art for the robotic

vision community.

In Chapter 6, we demonstrated the viability of our refracted light-field image feature for visual

servoing towards refractive objects. Further research in this direction is needed to achieve

a complete visual servoing system. Following our development of LF-IBVS in Chapter 4,

derivations for the refracted light-field feature Jacobian need to be performed. LF-IBVS can

also be implemented on a lenslet-based LF camera for comparison. Together, these tasks will

finally close the loop on visual servoing towards refracted objects.

Additionally, we recognise that VS only addresses part of the problem in enabling robots to

work with refractive objects. VS does not touch upon the area of interaction—grasping and

manipulation. We consider recent works that have enabled grasping of refractive objects, such

as [Zhou et al., 2018], which describe a refractive object as a distribution of depths obtained

from an LF camera. Comparisons to a 3D geometric model are made for object localization

and grasping. Zhou’s method relies on 3D models of refractive objects, while our method does

not require such explicit 3D models. Thus, there is interest in combining our two contributions

194 7.2. FUTURE WORK

to further the functionality of robotic perception for vision-based manipulation of refractive

objects.

The performance and behaviour of VS strongly depends on the choice of image feature. The

LF feature used in Chapter 4 for LF-IBVS constrains a Lambertian point in the scene to a plane

in the 4D LF with equal slopes in all directions. We showed in Chapter 5 that we can describe

a plane in 4D to discriminate refracted from Lambertian image features with nonlinear feature

curves and unequal slopes in the horizontal and vertical directions of the LF. In Chapter 6, we

extracted a more general 4D planar light-field feature from the entire LF as point with multiple

depths and an orientation, and demonstrated the feature’s potential use for VS. However, it may

be possible to servo on the more general 4D planar structure within the LF directly for VS.

Specifically, servoing based on the parameters that describe the plane in 4D (such as the plane’s

two linearly independent normals) provides a larger structure to estimate and track, compared

to individual point features, which may make the approach more robust in low light (night time)

and low contrast (foggy) conditions. This may also lead to analytical expressions of image

Jacobians for visual servoing towards refractive objects. Furthermore, recent advances in LF-

specific features, such as the Light-Field Feature detector and descriptor (LiFF) [Dansereau

et al., 2019], may similarly lead to improved performance and accuracy in VS.

An interesting research direction is refractive object shape reconstruction using LF cameras.

Previous work has shown that occlusion boundaries provide reliable depth information for re-

fractive objects; however, these approaches have either relied on monocular cameras and motion

to collect multiple views [Ham et al., 2017]. Occlusion boundaries of refractive objects may

provide areas in the LF where the depth can be estimated. Local surface curvatures may be

estimated by comparing the depths of the occlusion boundaries to the corresponding depths of

the refracted LF feature from Chapter 6. These local surface curvatures and occlusion boundary

depths may be combined to approximately reconstruct refractive object shape.

CHAPTER 7. CONCLUSIONS AND FUTURE WORK 195

Alternatively, a deep learning approach might be considered to relate the characteristic image

feature curves from Chapter 5 to object depth and surface curvature. Deep learning techniques

might also be used to separate diffuse, specular and refracted image features. Such approaches

are typically reliant upon large amounts of ground truth data; however, ground truth data for

refractive objects is difficult to obtain and often very labour intensive. To address this issue, it

may be possible to rely on simulated ground truth data that use realistic ray-tracing to form the

bulk of the training data and then rely on only a small amount of real-world data for fine-tuning

the network. We may draw on the literature from the sim-to-real field, where this approach is

referred to as a domain adaption technique.

Another interesting direction of research is to use the LF camera for virtual exploration. In

this thesis, image Jacobians were computed analytically for Lambertian scenes based on point

features that ultimately relied on an approximate model of the LF camera. In visual servoing,

there exist a variety of methods to compute the image Jacobian online without prior camera

models using a set of “test movements”, which are not part of the manipulation task [Jägersand,

1995, Piepmeier et al., 2004]. However, LF cameras capture a small amount of virtual motion

by virtue of their multiple views, similar to a local image-based derivative of robot motion.

Recently, there are also a variety of deep learning approaches to monocular VS [Lee et al.,

2017, Bateux et al., 2018]. Thus, an LF camera may be used to estimate the image Jacobian

by comparing these multiple views and the central view of the LF to some goal image. In a

related project, our recent work demonstrated that gradients from a multi-camera array could

be used to servo towards a target object in highly-occluded scenarios [Lehnert et al., 2019],

although a non-planar grid of cameras were used, as opposed to a planar grid of cameras, as in a

traditional LF camera array. Further research into these avenues may result in faster and simpler

visual servoing algorithms that can still operate in cluttered and non-Lambertian environments,

possibly without the need for LF camera calibration.

This work has been largely addressing problems in the context of robotic vision. Taking a

broader view outside of the field of robotics, this research will hopefully find research directions

196 7.2. FUTURE WORK

in other fields, such as cinematography, virtual reality, mixed or augmented reality, video games

and consumer photography. In particular, augmented reality is an emerging technology that

may rely on light-field imaging technology. The augmented reality must address a similar

problem faced by robots—how to perceive the real world using limited sensor technologies,

whilst still enabling safe and reliable interaction. However, as humans are an integral part of

augmented reality, these interactions must also be real-time and realistic in appearance. An

improved understanding of how refractive objects behave in the light field may lead to more

realistic and faster renderings of scenes with refractive objects, as well as safer and more reliable

interaction.

Bibliography

[Adelson and Anandan, 1990] Adelson, E. H. and Anandan, P. (1990). Ordinal characteristics

of transparency. Vision and Modeling Group, Media Laboratory, Massachusetts Institute of

Technology.

[Adelson and Bergen, 1991] Adelson, E. H. and Bergen, J. R. (1991). The plenoptic function

and the elements of early vision. Computational models of visual processing, 91(1):3–20.

[Adelson and Wang, 1992] Adelson, E. H. and Wang, J. Y. A. (1992). Single lens stereo with

a plenoptic camera. IEEE Transactions on Pattern Analysis & Machine Intelligence, (2):99–

106.

[Adelson and Wang, 2002] Adelson, E. H. and Wang, J. Y. A. (2002). Single lens stereo

with a plenoptic camera. IEEE Transactions on Pattern Analysis and Machine Intelligence

(TPAMI), 14(2):99–106.

[Andreff et al., 2002] Andreff, N., Espiau, B., and Horaud, R. (2002). Visual servoing from

lines. The International Journal of Robotics Research, 21(8):679–699.

[Baeten et al., 2008] Baeten, J., Donné, K., Boedrij, S., Beckers, W., and Claesen, E. (2008).

Autonomous fruit picking machine: A robotic apple harvester. In Field and Service Robotics,

pages 531–539. Springer.

197

198 BIBLIOGRAPHY

[Bateux and Marchand, 2015] Bateux, Q. and Marchand, E. (2015). Direct visual servoing

based on multiple intensity histograms. In IEEE International Conference on Robotics and

Automation.

[Bateux et al., 2018] Bateux, Q., marchand, E., Leitner, J., Chaumette, F., and Corke, P. (2018).

Training deep neural networks for visual servoing. In IEEE International Conference on

Robotics and Automation, pages 3307–3314.

[Bay et al., 2008] Bay, H., Ess, A., Tuytelaars, T., and Gool, L. V. (2008). Speeded-up robust

features (SURF). Computer Vision and image understanding, 110(3):346–359.

[Ben-Ezra and Nayar, 2003] Ben-Ezra, M. and Nayar, S. K. (2003). What does motion reveal

about transparency. In Intl. Conference on Computer Vision (ICCV). IEEE Computer Society.

[Bergeles et al., 2012] Bergeles, C., Kratochvil, B. E., and Nelson, B. J. (2012). Visually ser-

voing magnetic intraocular microdevices. IEEE Transactions on Robotics, 28(4):798–809.

[Bernardes and Borges, 2010] Bernardes, M. C. and Borges, G. A. (2010). 3D line estimation

for mobile robotics visual servoing. In Congresso Brasileiro de Automática (CBA).

[Bista et al., 2016] Bista, S. R., Giordano, P. R., and Chaumette, F. (2016). Appearance-based

indoor navigation by ibvs using line segments. IEEE Robotics and Automation Letters,

1(1):423–430.

[Bolles et al., 1987] Bolles, R., Baker, H., and Marimont, D. (1987). Epipolar-plane image

analysis: An approach to determining structure from motion. Intl. Journal of Computer

Vision (IJCV), 1(1):7–55.

[Bolles and Fischler, 1981] Bolles, R. C. and Fischler, M. A. (1981). A ransac-based approach

to model fitting and its application to finding cylinders in range data. In IJCAI, volume 1981,

pages 637–643.

BIBLIOGRAPHY 199

[Bourquardez et al., 2009] Bourquardez, O., Mahony, R., Guenard, N., Chaumette, F., Hamel,

T., and Eck, L. (2009). Image-based visual servo control of the translation kinematics of a

quadrotor aerial vehicle. Trans. on Robotics, 25(3).

[Cai et al., 2013] Cai, C., Dean-Leon, E., Mendoza, D., Somani, N., and Knoll, A. (2013).

Uncalibrated 3D stereo image-based dynamic visual servoing for robot manipulators. In

Intl. Conference on Intelligent Robots and Systems (IROS), pages 63–70. IEEE.

[Calonder et al., 2010] Calonder, M., Lepetit, V., Strecha, C., and Fua, P. (2010). Brief: Binary

robust independent elementary features. In European conference on computer vision, pages

778–792. Springer.

[Cervera et al., 2003] Cervera, E., Del Pobil, A. P., Berry, F., and Martinet, P. (2003). Improv-

ing image-based visual servoing with three-dimensional features. The International Journal

of Robotics Research, 22(10-11):821–839.

[Chan, 2014] Chan, S. C. (2014). Light field. In Computer Vision A Reference Guide, pages

447–453. Springer Link.

[Chaumette, 1998] Chaumette, F. (1998). Potential problems of stability and convergence in

image-based and position-based visual servoing. Lecture Notes in Control and Information

Sciences, 237:66–78.

[Chaumette, 2004] Chaumette, F. (2004). Image moments: a general and useful set of features

for visual servoing. IEEE Transactions on Robotics, 20(4):713–723.

[Chaumette and Hutchinson, 2006] Chaumette, F. and Hutchinson, S. (2006). Visual servo

control part 1: Basic approaches. Robotics and Automation Magazine, 6:82–90.

[Chaumette and Hutchinson, 2007] Chaumette, F. and Hutchinson, S. (2007). Visual servo

control part 2: Advanced approaches. IEEE Robotics and Automation Magazine, pages

109–118.

200 BIBLIOGRAPHY

[Choi and Christensen, 2012] Choi, C. and Christensen, H. (2012). 3D textureless object de-

tection and tracking: An edge-based approach.

[Christensen, 2016] Christensen, H. I. (2016). A roadmap for US robotics (2016) from internet

to robotics.

[Civera et al., 2008] Civera, J., Davison, A. J., and Montiel, J. M. (2008). Inverse depth

parametrization for monocular slam. IEEE transactions on robotics, 24(5):932–945.

[Collewet and Marchand, 2009] Collewet, C. and Marchand, E. (2009). Photometry-based

visual servoing using light reflexion models. In 2009 IEEE International Conference on

Robotics and Automation, pages 701–706. IEEE.

[Collewet and Marchand, 2011] Collewet, C. and Marchand, E. (2011). Photometric visual

servoing. Trans. on Robotics, 27(4).

[Comport et al., 2011] Comport, A. I., Mahony, R., and Spindler, F. (2011). A visual servoing

model for generalised cameras: Case study of non-overlapping cameras. In 2011 IEEE

International Conference on Robotics and Automation, pages 5683–5688. IEEE.

[Corke, 2013] Corke, P. (2013). Robotics, Vision and Control. Springer.

[Corke and Hutchinson, 2001] Corke, P. and Hutchinson, S. (2001). A new partitioned ap-

proach to image-based visual servo control. Transactions on Robotics and Automation,

17(4):507–515.

[Corke, 2017] Corke, P. I. (2017). Robotics, Vision and Control. Springer, 2 edition.

[Dalal and Triggs, 2005] Dalal, N. and Triggs, B. (2005). Histograms of oriented gradients for

human detection. In Intl. Conference on Computer Vision and Pattern Recognition (CVPR).

[Dansereau, 2014] Dansereau, D. G. (2014). Plenoptic Signal Processing for Robust Vision in

Field Robotics. PhD thesis, University of Sydney.

BIBLIOGRAPHY 201

[Dansereau and Bruton, 2007] Dansereau, D. G. and Bruton, L. T. (2007). A 4-D dual-fan

filter bank for depth filtering in light fields. IEEE Transactions on Signal Processing (TSP),

55(2):542–549.

[Dansereau et al., 2019] Dansereau, D. G., Girod, B., and Wetzstein, G. (2019). LiFF: Light

field features in scale and depth. In Proceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, pages 8042–8051.

[Dansereau et al., 2011] Dansereau, D. G., Mahon, I., Pizarro, O., and Williams, S. B. (2011).

Plenoptic flow: Closed-form visual odometry for light field cameras. In Intl. Conference on

Intelligent Robots and Systems (IROS), pages 4455–4462. IEEE.

[Dansereau et al., 2013] Dansereau, D. G., Pizarro, O., and Williams, S. B. (2013). Decoding,

calibration and rectification for lenselet-based plenoptic cameras. In Intl. Conference on

Computer Vision and Pattern Recognition (CVPR), pages 1027–1034. IEEE.

[De Luca et al., 2008] De Luca, A., Oriolo, G., and Robuffo Giordano, P. (2008). Feature depth

observation for image-based visual servoing: Theory and experiments. The International

Journal of Robotics Research, 27(10):1093–1116.

[Dong et al., 2013] Dong, F., Ieng, S.-H., Savatier, X., Etienne-Cummings, R., and Benosman,

R. (2013). Plenoptic cameras in real-time robotics. The Intl. Journal of Robotics Research,

32(2):206–217.

[Dong and Soatto, 2015] Dong, J. and Soatto, S. (2015). Domain-size pooling in local de-

scriptors: Dsp-sift. In Proceedings of the IEEE conference on computer vision and pattern

recognition, pages 5097–5106.

[Drummond and Cipolla, 1999] Drummond, T. and Cipolla, R. (1999). Visual tracking and

control using lie algebras. In Proceedings. 1999 IEEE Computer Society Conference on

Computer Vision and Pattern Recognition (Cat. No PR00149), volume 2, pages 652–657.

IEEE.

202 BIBLIOGRAPHY

[Engel et al., 2014] Engel, J., Schoeps, T., and Cremers, D. (2014). LSD-SLAM: Large-scale

direct monocular SLAM. European Conference on Computer Vision (ECCV).

[Fischler and Bolles, 1981] Fischler, M. and Bolles, R. (1981). Random sample consensus: a

paradigm for model fitting with applications to image analysis and automated cartography.

[Freeman and Fincham, 1990] Freeman, M. H. and Fincham, W. H. A. (1990). Optics. Butter-

worths, London, 10th edition. An optional note.

[Fritz et al., 2009] Fritz, M., Bradski, G., Karayev, S., Darrell, T., and Black, M. (2009). An

additive latent feature model for transparent object recognition.

[Fuchs et al., 2013] Fuchs, M., Kächele, M., and Rusinkiewicz, S. (2013). Design and fabrica-

tion of faceted mirror arrays for light field capture. In Computer Graphics Forum, volume 32,

pages 246–257. Wiley Online Library.

[Gao and Zhang, 2015] Gao, X. and Zhang, T. (2015). Robust rgb-d simultaneous localization

and mapping using planar point features. Robotics and Autonomous Systems, 72:1–14.

[Georgiev et al., 2011] Georgiev, T., Lumsdaine, A., and Chunev, G. (2011). Using focused

plenoptic cameras for rich image capture. IEEE Computer Graphics and Applications,

31(1):62–73.

[Gershun, 1936] Gershun, A. (1936). Fundamental ideas of the theory of a light field (vector

methods of photometric calculations). Journal of Mathematics and Physics, 18.

[Ghasemi and Vetterli, 2014] Ghasemi, A. and Vetterli, M. (2014). Scale-invariant represen-

tation of light field images for object recognition and tracking. In IS&T/SPIE Electronic

Imaging. International Society for Optics and Photonics.

[Godard et al., 2017] Godard, C., Mac Aodha, O., and Brostow, G. J. (2017). Unsupervised

monocular depth estimation with left-right consistency. In Proceedings of the IEEE Confer-

ence on Computer Vision and Pattern Recognition, pages 270–279.

BIBLIOGRAPHY 203

[Gortler et al., 1996] Gortler, S., Grzeszczuk, R., Szeliski, R., and Cohen, M. (1996). The

lumigraph. In SIGGRAPH, pages 43–54. ACM.

[Grossmann, 1987] Grossmann, P. (1987). Depth from focus. Pattern recognition letters,

5(1):63–69.

[Gu et al., 1997] Gu, X., Gortler, S., and Cohen, M. (1997). Polyhedral geometry and the two-

plane parameterisation. In Proc. Eurographics Workshop on Rendering Techniques, pages

1–12. Springer.

[Gupta et al., 2014] Gupta, S., Girshick, R., Arbeláez, P., and Malik, J. (2014). Learning rich

features from rgb-d images for object detection and segmentation. In European Conference

on Computer Vision, pages 345–360. Springer.

[Ham et al., 2017] Ham, C., Singh, S., and Lucey, S. (2017). Occlusions are fleeting - texture

is forever: Moving past brightness constancy. In WACV.

[Han et al., 2015] Han, K., Wong, K.-Y. K., and Liu, M. (2015). A fixed viewpoint approach

for dense reconstruction of transparent objects. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, pages 4001–4008.

[Han et al., 2018] Han, K., Wong, K.-Y. K., and Liu, M. (2018). Dense reconstruction of trans-

parent objects by altering incident light paths through refraction. International Journal of

Computer Vision, 126(5):460–475.

[Han et al., 2012] Han, K.-S., Kim, S.-C., Lee, Y.-B., Kim, S.-C., Im, D.-H., Choi, H.-K.,

and Hwang, H. (2012). Strawberry harvesting robot for bench-type cultivation. Journal of

Biosystems Engineering, 37(1):65–74.

[Harris and Stephens, 1988] Harris, C. and Stephens, M. (1988). A combined corner and edge

detector. In Alvey vision conference, volume 15, page 50.

[Hartley and Zisserman, 2003] Hartley, R. and Zisserman, A. (2003). Multiple View Geometry

in Computer Vision. Cambridge.

204 BIBLIOGRAPHY

[Hata et al., 1996] Hata, S., Saitoh, Y., Kumamura, S., and Kaida, K. (1996). Shape extraction

of transparent object using genetic algorithm. In Proceedings of 13th International Confer-

ence on Pattern Recognition, volume 4, pages 684–688. IEEE.

[Hecht, 2002] Hecht, E. (2002). Optics. Addition-Wesley, 4th ed. edition.

[Hill, 1979] Hill, J. (1979). Real time control of a robot with a mobile camera. In 9th Int. Symp.

on Industrial Robots, 1979, pages 233–246.

[Hinton, 1884] Hinton, C. H. (1884). What is the fourth dimension? Scientific Romances,

1:1–22.

[Honauer et al., 2016] Honauer, K., Johannsen, O., Kondermann, D., and Goldluecke, B.

(2016). A dataset and evaluation methodology for depth estimation on 4D light fields. In

Asian Conference on Computer Vision, pages 19–34. Springer.

[Hutchinson et al., 1996] Hutchinson, S., Hager, G., and Corke, P. (1996). A tutorial on visual

servo control. Transactions on Robotics and Automation, 12(5):651–670.

[Ideguchi et al., 2017] Ideguchi, Y., Uranishi, Y., Yoshimoto, S., Kuroda, Y., and Oshiro, O.

(2017). Light field convergency: Implicit photometric consistency on transparent surface.

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Work-

shops, pages 41–49.

[Ihrke et al., 2010a] Ihrke, I., Kutulakos, K., Lensch, H., Magnor, M., and Heidrich, W.

(2010a). Transparent and specular object reconstruction. Computer Graphics forum,

29:2400–2426.

[Ihrke et al., 2010b] Ihrke, I., Wetzstein, G., and Heidrich, W. (2010b). A theory of plenoptic

multiplexing. In Intl. Conference on Computer Vision and Pattern Recognition (CVPR),

pages 483–490. IEEE.

[Irani and Anandan, 1999] Irani, M. and Anandan, P. (1999). About direct methods. In Work-

shop on Vision Algorithms. Springer.

BIBLIOGRAPHY 205

[Iwatsuki and Okiyama, 2005] Iwatsuki, M. and Okiyama, N. (2005). A new formulation of

visual servoing based on cylindrical coordinate system. IEEE Transactions on Robotics,

21(2):266–273.

[Jachnik et al., 2012] Jachnik, J., Newcombe, R. A., and Davison, A. J. (2012). Real-time

surface light field capture for augmentation of planar specular surfaces. In Mixed and Aug-

mented Reality (ISMAR), 2012 IEEE Intl. Symposium on, pages 91–97. IEEE.

[Jägersand, 1995] Jägersand, M. (1995). Visual servoing using trust region methods and esti-

mation of the full coupled visual-motor jacobian. image, 11:1.

[Jang et al., 1991] Jang, W., Kim, K., Chung, M., and Bien, Z. (1991). Concepts of augmented

image space and transformed feature space for efficient visual servoing of an “eye-in-hand

robot”. Robotica, 9:203–212.

[Jerian and Jain, 1991] Jerian, C. P. and Jain, R. (1991). Structure from motion-a critical anal-

ysis of methods. IEEE Transactions on systems, Man, and Cybernetics, 21(3):572–588.

[Johannsen et al., 2017] Johannsen, O. et al. (2017). A taxonomy and evaluation of dense light

field depth estimation algorithms. In CVPR Workshop.

[Johannsen et al., 2015] Johannsen, O., Sulc, A., and Goldluecke, B. (2015). On linear struc-

ture from motion for light field cameras. In Intl. Conference on Computer Vision (ICCV),

pages 720–728.

[Johnson and Hebert, 1999] Johnson, A. and Hebert, M. (1999). Using spin images for efficient

object recognition in cluttered 3D scenes.

[Kemp et al., 2007] Kemp, C. C., Edsinger, A., and Torres-Jara, E. (2007). Challenges for

robot manipulation in human environments.

[Keshmiri and Xie, 2017] Keshmiri, M. and Xie, W.-F. (2017). image-based visual servoing

using an optimized trajectory planning technique. IEEE Transactions on Mechatronics,

22(1):359–370.

206 BIBLIOGRAPHY

[Kim et al., 2017] Kim, J., Reshetouski, I., and Ghosh, A. (2017). Acquiring axially-symmetric

transparent objects using single-view transmission imaging. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, pages 3559–3567.

[Klank et al., 2011] Klank, U., Carton, D., and Beetz, M. (2011). Transparent object detection

and reconstruction on a mobile platform. In 2011 IEEE International Conference on Robotics

and Automation, pages 5971–5978. IEEE.

[Kompella and Sturm, 2011] Kompella, V. R. and Sturm, P. (2011). Detection and avoidance

of semi-transparent obstacles using a collective-reward based approach. In 2011 IEEE Inter-

national Conference on Robotics and Automation, pages 3469–3474. IEEE.

[Kragic and Christensen, 2002] Kragic, D. and Christensen, H. (2002). Survey on visual ser-

voing for manipulation.

[Krizhhevsky et al., 2012] Krizhhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet

classification with deep convolutional neural networks.

[Krotkov and Bajcsy, 1993] Krotkov, E. and Bajcsy, R. (1993). Active vision for reliable rang-

ing: Cooperating focus, stereo, and vergence. Int. Journal of Computer Vision, 11(2):187–

203.

[Kurt and Edwards, 2009] Kurt, M. and Edwards, D. (2009). A survey of brdf models for

computer graphics. ACM SIGGRAPH Computer Graphics, 43(2):4.

[Kutulakos and Steger, 2007] Kutulakos, K. N. and Steger, E. (2007). A theory of refractive

and specular 3D shape by light-path triangulation. 76(1).

[Le et al., 2011] Le, M.-H., Woo, B.-S., and Jo, K.-H. (2011). A comparison of sift and har-

ris conner features for correspondence points matching. In 2011 17th Korea-Japan Joint

Workshop on Frontiers of Computer Vision (FCV), pages 1–4. IEEE.

[Lee et al., 2017] Lee, A. X., Levine, S., and Abbeel, P. (2017). Learning visual servoing with

deep features and fitted q-iteration. arXiv preprint arXiv:1703.11000.

BIBLIOGRAPHY 207

[Lee, 2005] Lee, H.-C. (2005). Introduction to Color Imaging Science. Cambridge University

Press.

[Lehnert et al., 2019] Lehnert, C., Tsai, D., Eriksson, A., and McCool, C. (2019). 3D Move to

See: Multi-perspective visual servoing for improving object views with semantic segmenta-

tion. In Intl. Conference on Intelligent Robots and Systems (IROS).

[Levin and Durand, 2010] Levin, A. and Durand, F. (2010). Linear view synthesis using a

dimensionality gap light field prior. In Intl. Conference on Computer Vision and Pattern

Recognition (CVPR), pages 1831–1838. IEEE.

[Levoy and Hanrahan, 1996] Levoy, M. and Hanrahan, P. (1996). Light field rendering. In

SIGGRAPH, pages 31–42. ACM.

[Levoy et al., 2000] Levoy, M., Pulli, K., Curless, B., Rusinkiewicz, S., Koller, D., Pereira, L.,

Ginzton, M., Anderson, S., Davis, J., Ginsberg, J., et al. (2000). The digital michelangelo

project: 3D scanning of large statues. In Proceedings of the 27th annual conference on

Computer graphics and interactive techniques, pages 131–144. ACM Press/Addison-Wesley

Publishing Co.

[Li et al., 2008] Li, H., Hartley, R., and Kim, J.-h. (2008). A linear approach to motion estima-

tion using generalized camera models. In 2008 IEEE Conference on Computer Vision and

Pattern Recognition, pages 1–8. IEEE.

[Lippmann, 1908] Lippmann, G. (1908). Epreuves reversibles. photographies integrals.

Comptes-Rendus Academie des Sciences, 146:446–451.

[López-Nicolás et al., 2010] López-Nicolás, G., Guerrero, J. J., and Sagüés, C. (2010). Vi-

sual control through the trifocal tensor for nonholonomic robots. Robotics and Autonomous

Systems, 58(2):216–226.

208 BIBLIOGRAPHY

[Low et al., 2007] Low, E. M., Manchester, I. R., and Savkin, A. V. (2007). A biologically in-

spired method for vision-based docking of wheeled mobile robots. Robotics and Autonomous

Systems, 55(10):769–784.

[Lowe, 2004] Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints.

Intl. Journal of Computer Vision (IJCV), 60(2):91–110.

[Luke et al., 2014] Luke, J., Rosa, F., Marichal, J., Sanluis, J., Dominguez Conde, C., and

Rodriguez-Ramos, J. (2014). Depth from light fields analyzing 4D local structure. Display

Technology, Journal of.

[Lumsdaine and Georgiev, 2008] Lumsdaine, A. and Georgiev, T. (2008). Full resolution light

field rendering. Technical report, Adobe Systems.

[Lumsdaine and Georgiev, 2009] Lumsdaine, A. and Georgiev, T. (2009). The focused plenop-

tic camera. In Computational Photography (ICCP), pages 1–8. IEEE.

[Luo et al., 2015] Luo, R., Lai, P.-J., and Ee, V. W. S. (2015). Transparent object recognition

and retrieval for robotic bio-laboratory automation applications. Intl. Conference on Intelli-

gent Robots and Systems (IROS).

[Lysenkov, 2013] Lysenkov, I. (2013). Recognition and pose estimation of rigid transparent

objects with a kinect sensor. Robotics: Science and Systems VIII, page 273.

[Lytro, 2015] Lytro (2015). Lytro Illum User Manual. Lytro Inc., Mountain View, CA.

[Maeno et al., 2013] Maeno, K., Nagahara, H., Shimada, A., and Taniguchi, R.-I. (2013). Light

field distortion feature for transparent object recognition. In Intl. Conference on Computer

Vision and Pattern Recognition (CVPR). IEEE.

[Magnus, 1985] Magnus, J. R. (1985). On differentiating eigenvalues and eigenvectors. Econo-

metric Theory, 1(2):179–191.

BIBLIOGRAPHY 209

[Mahony et al., 2002] Mahony, R., Corke, P., and Chaumette, F. (2002). Choice of image fea-

tures for depth-axis control in image based visual servo control. In Intl. Conference on

Intelligent Robots and Systems (IROS), pages 390–395. IEEE.

[Malis and Chaumette, 2000] Malis, E. and Chaumette, F. (2000). 2 1/2 d visual servoing with

respect to unknown objects through a new estimation scheme of camera displacement. In-

ternational Journal of Computer Vision, 37(1):79–97.

[Malis et al., 1999] Malis, E., Chaumette, F., and Boudet, S. (1999). 2 1/2 d visual servoing.

IEEE Transactions on Robotics and Automation, 15(2):238–250.

[Malis et al., 2000] Malis, E., Chaumette, F., and Boudet, S. (2000). Multi-cameras visual

servoing. In Robotics and Automation (ICRA), pages 3183–3188. IEEE.

[Malis and Rives, 2003] Malis, E. and Rives, P. (2003). Robustness of image-based visual

servoing with respect to depth distribution errors. In 2003 IEEE International Conference

on Robotics and Automation (Cat. No. 03CH37422), volume 1, pages 1056–1061. IEEE.

[Marchand and Chaumette, 2017] Marchand, E. and Chaumette, F. (2017). Visual servoing

through mirror reflection. In 2017 IEEE International Conference on Robotics and Automa-

tion (ICRA), pages 3798–3804. IEEE.

[Mariottini et al., 2007] Mariottini, G. L., Oriolo, G., and Prattichizzo, D. (2007). Image-based

visual servoing for nonholonomic mobile robots using epipolar geometry. IEEE Transactions

on Robotics, 23(1):87–100.

[Marto et al., 2017] Marto, S. G., Monteiro, N. B., Barreto, J. P., and Gaspar, J. A. (2017).

Structure from plenoptic imaging. In 2017 Joint IEEE International Conference on Devel-

opment and Learning and Epigenetic Robotics (ICDL-EpiRob), pages 338–343. IEEE.

[McFadyen et al., 2017] McFadyen, A., Jabeur, M., and Corke, P. (2017). Image-based vi-

sual servoing with unknown point feature correspondence. IEEE Robotics and Automation

Letters, 2(2):601–607.

210 BIBLIOGRAPHY

[McHenry et al., 2005] McHenry, K., Ponce, J., and Forsyth, D. (2005). Finding glass. In

2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition

(CVPR’05), volume 2, pages 973–979. IEEE.

[Mehta and Burks, 2014] Mehta, S. and Burks, T. (2014). Vision-based control of robotic ma-

nipulator for citrus harvesting. Computers and Electronics in Agriculture, 102:146–158.

[Mezouar and Allen, 2002] Mezouar, Y. and Allen, P. K. (2002). Visual servoed microposi-

tioning for protein manipulation tasks. In IEEE/RSJ International Conference on Intelligent

Robots and Systems, volume 2, pages 1766–1771. IEEE.

[Miyazaki and Ikeuchi, 2005] Miyazaki, D. and Ikeuchi, K. (2005). Inverse polarisation ray-

tracing: estimating surface shapes of transparent objects. Intl. Conference on Computer

Vision and Pattern Recognition (CVPR).

[Morris and Kutulakos, 2007] Morris, N. J. W. and Kutulakos, K. N. (2007). Reconstructing

the surface of inhomogeneous transparent scenes by scatter-trace photography. 76(1).

[Muja and Lowe, 2009] Muja, M. and Lowe, D. G. (2009). Fast approximate nearest neighbors

with automatic algorithm configuration. VISAPP (1), 2(331-340):2.

[Mukaigawa et al., 2010] Mukaigawa, Y., Tagawa, S., Kim, J., Raskar, R., Matsushita, Y., and

Yagi, Y. (2010). Hemispherical confocal imaging using turtleback reflector. In Computer

Vision–ACCV 2010, pages 336–349. Springer.

[Murase, 1990] Murase, H. (1990). Surface shape reconstruction of an undulating transparent

object. In [1990] Proceedings Third International Conference on Computer Vision, pages

313–317. IEEE.

[Neumann and Fermuller, 2003] Neumann, J. and Fermuller, C. (2003). Polydioptric camera

design and 3D motion estimation. Intl. Conference on Computer Vision and Pattern Recog-

nition (CVPR).

BIBLIOGRAPHY 211

[Newcombe et al., 2011] Newcombe, R. A., Lovegrove, S., and Davison, A. J. (2011). DTAM:

dense tracking and mapping in real-time. In Intl. Conference on Computer Vision (ICCV),

pages 2320–2327.

[Ng et al., 2005] Ng, R., Levoy, M., Bredif, M., Duval, G., Horowitz, M., and Hanrahan, P.

(2005). Light field photography with a hand-held plenoptic camera. Technical report, Stan-

ford University Computer Science.

[O’Brien et al., 2018] O’Brien, S., Trumpf, J., Ila, V., and Mahony, R. (2018). Calibrating light

field cameras using plenoptic disc features. In 2018 International Conference on 3D Vision

(3DV), pages 286–294. IEEE.

[Pages et al., 2006] Pages, J., Collewet, C., Chaumette, F., and Salvi, J. (2006). An approach to

visual servoing based on coded light. In Proceedings 2006 IEEE International Conference

on Robotics and Automation, 2006. ICRA 2006., pages 4118–4123. IEEE.

[Papanikolopoulos and Khosla, 1993] Papanikolopoulos, N. P. and Khosla, P. K. (1993). Adap-

tive robotic visual tracking: Theory and experiments. IEEE Transactions on Automatic Con-

trol, 38(3):429–445.

[Pedrotti, 2008] Pedrotti, L. S. (2008). Fundamentals of Photonics.

[Perwass and Wietzke, 2012] Perwass, C. and Wietzke, L. (2012). Single lens 3D-camera with

extended depth-of-field. In IST/SPIE Electronic Imaging, pages 829108–829108. Interna-

tional Society for Optics and Photonics.

[Phong, 1975] Phong, B. T. (1975). Illumination for computer generated pictures. Communi-

cations of the ACM, 18(6):311–317.

[Piepmeier et al., 2004] Piepmeier, J. A., McMurray, G. V., and Lipkin, H. (2004). Uncali-

brated dynamic visual servoing. IEEE Transactions on Robotics and Automation, 20(1):143–

147.

212 BIBLIOGRAPHY

[Quadros, 2014] Quadros, A. J. (2014). Representing 3D Shape in Sparse Range Images for

Urban Object Classification. Thesis, University of Sydney.

[Raytrix, 2015] Raytrix (2015). Raytrix light field sdk.

[Rosten et al., 2009] Rosten, E., Porter, R., and Drummond, T. (2009). Faster and better: A

machine learning approach to corner detection. IEEE Trans. Pattern Analysis and Machine

Intelligence (to appear).

[Rublee et al., 2011] Rublee, E., Rabaud, V., Konolige, K., and Bradski, G. (2011). Orb: an

efficient alternative to sift or surf. In Intl. Conference on Computer Vision (ICCV).

[Salti et al., 2014] Salti, S., Tombari, F., and Stefano, L. D. (2014). SHOT: Unique signatures

of histograms for surface and texture description. Computer Vision and Image Understand-

ing, 125:251–264.

[Saxena et al., 2006] Saxena, A., Chung, S. H., and Ng, A. Y. (2006). Learning depth from

single monocular images. In Advances in neural information processing systems, pages

1161–1168.

[Saxena et al., 2008] Saxena, A., Driemeyer, J., and Ng, A. (2008). Robotic grasping of novel

objects using vision. International Journal of Robotics Research.

[Schlick, 1994] Schlick, C. (1994). A survey of shading and reflectance models. In Computer

Graphics Forum, volume 13, pages 121–131. Wiley Online Library.

[Schoenberger and Frahm, 2016] Schoenberger, J. and Frahm, J.-M. (2016). Structure-from-

motion revisited. CVPR.

[Schoenberger et al., 2017] Schoenberger, J., Hardmeier, H., Sattler, T., and Pollefeys, M.

(2017). Comparative evaluation of hand-crafted and learned local features. Intl. Confer-

ence on Computer Vision and Pattern Recognition (CVPR).

BIBLIOGRAPHY 213

[Shafer, 1985] Shafer, S. A. (1985). Using color to separate reflection components. Color

Research & Application, 10(4):210–218.

[Shi and Tomasi, 1993] Shi, J. and Tomasi, C. (1993). Good features to track. Technical report,

Cornell University.

[Siciliano and Khatib, 2016] Siciliano, B. and Khatib, O. (2016). Springer handbook of

robotics. Springer.

[Smith et al., 2009] Smith, B. M., Zhang, L., Jin, H., and Agarwala, A. (2009). Light field

video stabilization. In Intl. Conference on Computer Vision (ICCV).

[Song et al., 2015] Song, W., Liu, Y., Li, W., and Wang, Y. (2015). Light-field acquisition using

a planar catadioptric system. Optics Express, 23(24):31126–31135.

[Strecke et al., 2017] Strecke, M., Alperovich, A., and Goldluecke, B. (2017). Accurate depth

and normal maps from occlusion-aware focal stack symmetry. In Intl. Conference on Com-

puter Vision and Pattern Recognition (CVPR).

[Sturm et al., 2011] Sturm, P., Ramalingam, S., Tardif, J.-P., Gasparini, S., Barreto, J., et al.

(2011). Camera models and fundamental concepts used in geometric computer vision. Foun-

dations and Trends R© in Computer Graphics and Vision, 6(1–2):1–183.

[Szeliski et al., 2000] Szeliski, R., Avidan, S., and Anandan, P. (2000). Layer extraction from

multiple images containing reflections and transparency. In Proceedings IEEE Conference

on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662), volume 1,


[Tahri and Chaumette, 2003] Tahri, O. and Chaumette, F. (2003). Application of moment in-

variants to visual servoing. In 2003 IEEE International Conference on Robotics and Au-

tomation (Cat. No. 03CH37422), volume 3, pages 4276–4281. IEEE.

214 BIBLIOGRAPHY

[Tao et al., 2013] Tao, M. W., Hadap, S., Malik, J., and Ramamoorthi, R. (2013). Depth

from combining defocus and correspondence using light field cameras. In Computer Vision

(ICCV), 2013 IEEE International Conference on, pages 673–680. IEEE.

[Tao et al., 2016] Tao, M. W., Su, J.-C., Wang, T.-C., Malik, J., and Ramamoorthi, R. (2016).

Depth estimation and specular removal for glossy surfaces using point and line consistency

with light field cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence,

38(6):1155–1168.

[Teixeira et al., 2017] Teixeira, J. A., Brites, C., Pereira, F., and Ascenso, J. (2017). Epipolar

based light field key-location detector. In Multimedia Signal Processing.

[Teulière and Marchand, 2014] Teulière, C. and Marchand, E. (2014). A dense and direct ap-

proach to visual servoing using depth maps. IEEE Transactions on Robotics, 30(5):1242–

1249.

[Tombari et al., 2010] Tombari, F., Salti, S., and Stefano, L. D. (2010). Unique signatures of

histograms for local surface description. ECCV.

[Torr and Zisserman, 2000] Torr, P. H. and Zisserman, A. (2000). Mlesac: A new robust es-

timator with application to estimating image geometry. Computer vision and image under-

standing, 78(1):138–156.

[Tosic and Berkner, 2014] Tosic, I. and Berkner, K. (2014). 3D keypoint detection by light

field scale-depth space analysis. In Image Processing (ICIP). IEEE.

[Triggs et al., 2000] Triggs, B., McLauchlan, P., Hartley, R., and Fitzgibbon, A. (2000). Bundle

adjustment - a modern synthesis. Vision Algorithms, pages 298–372.

[Tsai et al., 2015] Tsai, C.-Y., Veeraraghavan, A., and Sankaranarayanan, A. C. (2015). What

does a single light-ray reveal about a transparent object? In 2015 IEEE International Con-

ference on Image Processing (ICIP), pages 606–610. IEEE.

BIBLIOGRAPHY 215

[Tsai et al., 2016] Tsai, D., Dansereau, D., Martin, S., and Corke, P. (2016). Mirrored Light

Field Video Camera Adapter. Technical report, Queensland University of Technology.

[Tsai et al., 2017] Tsai, D., Dansereau, D. G., Peynot, T., and Corke, P. (2017). Image-based

visual servoing with light field cameras. IEEE Robotics and Automation Letters, 2(2):912–

919.

[Tsai et al., 2019] Tsai, D., Dansereau, D. G., Peynot, T., and Corke, P. (2019). Distinguishing

refracted features using light field cameras with application to structure from motion. IEEE

Robotics and Automation Letters, 4(2):177–184.

[Tsai et al., 2013] Tsai, D., Nesnas, I., and Zarzhitsky, D. (2013). Autonomous vision-based

tether-assisted rover docking. In Intl. Conference on Intelligent Robots and Systems (IROS).

IEEE.

[Tuytelaars et al., 2008] Tuytelaars, T., Mikolajczyk, K., et al. (2008). Local invariant feature

detectors: a survey. Foundations and trends in computer graphics and vision, 3(3):177–280.

[Vaish et al., 2006] Vaish, V., Levoy, M., Szeliski, R., Zitnick, C., and Kang, S. (2006). Re-

constructing occluded surfaces using synthetic apertures: Stereo, focus and robust measures.

In Intl. Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages

2331–2338. IEEE.

[Verdie et al., 2015] Verdie, Y., Yi, K. M., Fua, P., and Lepetit, V. (2015). TILDE: A temporally

invariant learned DEtector. Intl. Conference on Computer Vision and Pattern Recognition

(CVPR).

[Walter et al., 2015] Walter, C., Penzlin, F., Schulenburg, E., and Elkmann, N. (2015). En-

abling multi-purpose mobile manipulators: Localization of glossy objects using a light field

camera. In Conference on Emerging Technologies & Factory Automation (ETFA), pages 1–8.

IEEE.

216 BIBLIOGRAPHY

[Wanner and Goldeluecke, 2014] Wanner, S. and Goldeluecke, B. (2014). Variational light

field analysis for disparity estimation and super-resolution. IEEE Trans. on Pattern Analysis

and Machine Intelligence, 36(3).

[Wanner and Goldluecke, 2012] Wanner, S. and Goldluecke, B. (2012). Globally consistent

depth labeling of 4D light fields. In Intl. Conference on Computer Vision and Pattern Recog-

nition (CVPR).

[Wanner and Golduecke, 2013] Wanner, S. and Golduecke, B. (2013). Reconstructing reflec-

tive and transparent surfaces from epipolar plane images. Proc. 35th German Conf. Pattern

Recog.

[Wei et al., 2013] Wei, Y., Kang, L., Yang, B., and Wu, L. (2013). Applications of structure

from motion: a survey. Journal of Zhejiang University-SCIENCE C (Computers & Electron-

ics), 14(7).

[Weisstein, 2017] Weisstein, E. W. (2017). Hyperplane. http://mathworld.wolfram.

com/Hyperplane.html. [Online; accessed 19-July-2017].

[Wetzstein et al., 2011] Wetzstein, G., Roodnick, D., Heidrich, W., and Raskar, R. (2011). Re-

fractive shape from light field distortion. In Intl. Conference on Computer Vision (ICCV),


[Wilburn et al., 2004] Wilburn, B., Joshi, N., Vaish, V., Levoy, M., and Horowitz, M. (2004).

High-speed videography using a dense camera array. In Intl. Conference on Computer Vision

and Pattern Recognition (CVPR), volume 2, pages II–294. IEEE.

[Wilburn et al., 2005] Wilburn, B., Joshi, N., Vaish, V., Talvala, E., Antunez, E., Barth, A.,

Adams, A., Horowitz, M., and Levoy, M. (2005). High performance imaging using large

camera arrays. ACM Transactions on Graphics (TOG), 24(3):765–776.

http://mathworld.wolfram.com/Hyperplane.html

http://mathworld.wolfram.com/Hyperplane.html

BIBLIOGRAPHY 217

[Wilson et al., 1996] Wilson, W. J., Hulls, C. W., and Bell, G. S. (1996). Relative end-effector

control using cartesian position based visual servoing. IEEE Transactions on Robotics and

Automation, 12(5):684–696.

[Xu et al., 2015] Xu, Y., Nagahara, H., Shimada, A., and ichiro Taniguchi, R. (2015). Transcut:

Transparent object segmentation from a light field image. Intl. Conference on Computer

Vision and Pattern Recognition (CVPR).

[Yamamoto, 1986] Yamamoto, M. (1986). Determining three-dimensional structure from im-

age sequences given by horizontal and vertical moving camera. Denshi Tsushin Gakkai

Ronbunshi (Transactions of the Institute of Electronics, Information and Communication

Engineers of Japan), pages 1631–1638.

[Yeasin and Sharma, 2005] Yeasin, M. and Sharma, R. (2005). Foveated vision sensor and im-

age processing–a review. In Machine Learning and Robot Perception, pages 57–98. Springer.

[Yi et al., 2016] Yi, K. M., Trulls, E., Lepetit, V., and Fua, P. (2016). LIFT: Learned invariant

feature transform. arXiv.

[Zeller et al., 2015] Zeller, N., Quint, F., and Stilla, U. (2015). Narrow field-of-view visual

odometry based on a focused plenoptic camera. In ISPRS Annals of the Photogrammetry,

Remote Sensing and Spatial Information Sciences.

[Zhang et al., 2018] Zhang, K., Chen, J., and Chaumette, F. (2018). Visual servoing with tri-

focal tensor. In 2018 IEEE Conference on Decision and Control (CDC), pages 2334–2340.

IEEE.

[Zhang et al., 2017] Zhang, Y., Yu, P., Yang, W., Ma, Y., and Yu, J. (2017). Ray space features

for plenoptic structure-from-motion. In Proceedings of the IEEE International Conference

on Computer Vision, pages 4631–4639.

218 BIBLIOGRAPHY

[Zhou et al., 2018] Zhou, Z., Sui, Z., and Jenkins, O. C. (2018). Plenoptic Monte Carlo object

localization for robot grasping under layered translucency. In 2018 IEEE/RSJ International

Conference on Intelligent Robots and Systems (IROS), pages 1–8. IEEE.

Appendix A

Mirrored Light-Field Video Camera

Adapter

This appendix section proposes the design of a custom mirror-based light-field camera adapter

that is cheap, simple in construction, and accessible. Mirrors of different shape and orientation

reflect the scene into an upwards-facing camera to create an array of virtual cameras with over-

lapping field of view at specified depths, and deliver video frame rate s. We describe the design,

construction, decoding and calibration processes of our mirror-based light-field camera adapter

in preparation for an open-source release to benefit the robotic vision community.

The latest report, computer-aided design models, diagrams and code can be obtained from the

following repository:

https://bitbucket.org/acrv/mirrorcam.

I

https://bitbucket.org/acrv/mirrorcam

II A.1. INTRODUCTION

A.1 Introduction

Light-field cameras are a new paradigm in imaging technology that may greatly augment the

computer vision and robotics fields. Unlike conventional cameras that only capture spatial

information in 2D, light-field cameras capture both spatial and angular information in 4D using

multiple views of the same scene within a single shot [Ng et al., 2005]. Doing so implicitly

encodes geometry and texture, and allows for depth extraction. Capturing multiple views of

the same scene also allows light-field cameras to handle occlusions [Walter et al., 2015], and

non-Lambertian (glossy, shiny, reflective, transparent) surfaces, that often break most modern

computer vision and robotic techniques [Vaish et al., 2006].

Robots must operate in continually changing environments on relatively constrained platforms.

As such, the robotics community is interested in low cost, computationally inexpensive, and

real-time camera performance. Unfortunately, there is a scarcity of commercially available

light-field cameras appropriate for robotics applications. Specifically, no commercial camera

delivers 4D s at video frame rates1. Creating a full camera array comes with more synchroniza-

tion, bulk, input-output and bandwidth issues. However, the advantages of our approach are

video-framerate LF video allowing real-time performance, the ability to customize the design

to optimize key performance metrics required for the application, and the ease of fabrication.

The main disadvantages of our approach are a lower resolution, a lower FOV2, and a more

complex decoding process.

Therefore, we constructed our own LF video camera by employing a mirror-based adapter. This

approach splits the camera’s field of view into sub-images using an array of planar mirrors. By

appropriately positioning the mirrors, a grid of virtual views with overlapping fields of view

can be constructed, effectively capturing a . We 3D-printed the mount based on our design, and

populated the mount with laser-cut acrylic mirrors.

1Though one manufacturer provides video, it does not provide a 4D LF, only 2D, RGBD or raw lenslet images

with no method for decoding to 4D.2A 3× 3 array will have 1/3 the FOV of the base camera.

APPENDIX A. MIRRORED LIGHT-FIELD VIDEO CAMERA ADAPTER III

(a)

(b)

(c)

Figure A.1: (a) MirrorCam mounted on the Kinova MICO robot manipulator. Nine mirrors

of different shape and orientation reflect the scene into the upwards-facing camera to create 9

virtual cameras, which provides video frame-rate s. (b) A whole image captured by the Mirror-

Cam and (c) the same decoded into a light-field parameterisation of 9 sub-images, visualized as

a 2D tiling of 2D images. The non-rectangular sub-images allow for greater FOV overlap [Tsai

et al., 2017].

IV A.2. BACKGROUND

The main contribution of this appendix is the design and construction of a mirror-based adapter

like the one shown in Fig. A.1a, which we refer to as MirrorCam. We provide a novel optimiza-

tion routine for the design of the custom mirror-based camera that models each mirror using

a 3-Degree-of-Freedom (DOF) reflection matrix. The calibration step uses 3-DOF mirrors as

well; the design step allows non-rectangular projected images. We aim to make the design,

methodology and code open-source to benefit the robotic vision research community.

The remainder of this appendix is organized as follows. Section A.2 provides some background

on light-field cameras in relation to the MirrorCam. Section A.3 explains our methods for

designing, optimizing, constructing, decoding and calibrating the MirrorCam. And finally in

Section A.4, we conclude the appendix and explore future work.

A.2 Background

Light-field cameras measure the amount of light travelling along each ray that intersects the

sensor by acquiring multiple views of a single scene. Doing so allows these cameras to obtain

both geometry, texture, and depth information within a single light-field image/photograph.

Some excellent references for s are [Adelson and Wang, 2002, Chan, 2014, Dansereau, 2014].

Table A.1 compares some of the most common LF camera architectures. The most prevalent

are the camera array [Wilburn et al., 2005], and the micro-lens array (MLA) [Ng et al., 2005].

However, the commercially-available light-field cameras are insufficient for providing s for real-

time robotics. Notably, the Lytro Illum does not provide s at a video frame rate [Lytro, 2015].

The Raytrix R10 is a light-field camera that captures the at more than 7-30 frames-per-second

(FPS); however, the camera uses lenslets with different focal lengths, which makes decoding the

raw image extremely difficult, and only provides 3D depth maps [Raytrix, 2015]. Furthermore,

as commercial products, the light-field camera companies have not disclosed details on how

to access and decode the light-field camera images, forcing researchers to hack solutions with

APPENDIX A. MIRRORED LIGHT-FIELD VIDEO CAMERA ADAPTER V

limited success. All of these reasons motivate a customizable, easy-to-access, easy to construct,

and open-source video frame-rate light-field camera.

A.3 Methods

We constructed our own LF video camera by employing a mirror-based adapter based on previ-

ous works [Fuchs et al., 2013, Song et al., 2015, Mukaigawa et al., 2010]. This approach slices

the original camera image into sub-images using an array of planar mirrors. Curved mirrors

may produce better optics; however, these mirrors are difficult to produce. Planar mirrors are

much more accessible and customizable. A grid of virtual views with overlapping field of view

can be constructed by carefully aligning the mirrors. These multiple views effectively capture

a . Our approach differs from previous work by reducing the optimization routine to a single

tunable parameter, and identifying the fundamental trade-off between depth of field and field of

view in the design of mirrored LF cameras. Additionally, we utilize non-square mirror shapes.

A.3.1 Design & Optimization

Because an array of mirrors has insufficient degrees of freedom to provide both perfectly over-

lapping FOVs and perfectly positioned projective centres, we employ an optimization algorithm

to strike a balance between these factors, as in [Fuchs et al., 2013]. A tunable parameter deter-

mines the relative importance of closeness to a perfect grid of virtual poses, and field of view

overlap, which is evaluated at a set of user-defined depths. The grid of virtual poses is allowed

to be rectangular, to better exploit rectangular camera FOVs.

The optimization routine begins with a faceted parabola at a user-defined scale and mirror count.

Optimization is allowed to manipulate the positions and normals of the mirror planes, as well

VI A.3. METHODS

Table A.1: Comparison of Accessibility for Different LF Camera Systems

LF Systems Sync FPS1 Customizability Open-Source

Camera Array poor 7-30 significant yes

MLA (Lytro Illum) good 0.5 none limited

MLA (Raytrix R8/R10) good 7-30 minor limited

MirrorCam good 2-30 significant yes

1 Frames per second

as their extents. Optimization constraints prevent mirrors occluding their neighbours, and allow

a minimum spacing between mirrors to be imposed for manufacturability.

Fig. A.2 shows an example 3× 3 mirror array before and after optimization. The FOV overlap

was evaluated at 0.3 and 0.5 m. Fig. A.1a shows an assembled model mounted on a robot arm,

and Fig. 4.1b shows an example image taken from the camera. Note that the optimized design

does not yield rectangular sub-images, as allowing a general quadrilateral shape allows for

greater FOV overlap. In future work, we will explore the use of non-quadrilateral sub-images.

A.3.2 Construction

For the construction of the MirrorCam, we aimed to use easily accessible materials and methods.

We 3D-printed the mount based on our design, and populated the mounts with laser-cut flat

acrylic mirrors. Figure A.3 shows a computer rendering of the MirrorCam before 3D printing.

The reflection of the 9 mirrors show the upwards-facing camera, which is secured at the base

of the MirrorCam. This design was built for the commonly available Logitech C920 webcam.

More detailed diagrams of the design are supplied in the Appendix.

Mirror thickness and quality proved to be an issue for the construction of the MirrorCam. Since

the mirrors are quite close to the camera, the thickness of the mirrors occlude a significant

portion of the image, which greatly reduces the resolution of each sub-image. Thus, we opted

APPENDIX A. MIRRORED LIGHT-FIELD VIDEO CAMERA ADAPTER VII

0

0.02

0.04

-0.04

0.06

z

0.08

-0.02 0.1

x

0 0.05

y

0.02 00.04 -0.05

(a)

-0.2 0 0.2

x

-0.1

0

0.1

0.2

y(b)

0

0.02

0.04

-0.04

0.06

z

0.08

-0.02 0.1

x

0 0.05

y

0.02 00.04 -0.05

(c)

-0.2 0 0.2

x

-0.1

0

0.1

0.2

y

(d)

Figure A.2: (a) A parabolic mirror array reflects images from the scene at right into a camera,

shown in blue at bottom; Each mirror yields a virtual view, shown in red – note that these are

far from an ideal grid; (b) The FOV overlap evaluated at 0.5 m, with the region of full overlap

highlighted in green; (c) and (d) the same after optimization, showing better virtual camera

placement and FOV overlap.

VIII A.3. METHODS

(a) (b)

Figure A.3: Rendered image of the MirrorCam version 0.4C, (a) from the front showing the

single camera lens that is visible from all nine mirrored surfaces, and (b) an isometric view

showing how the camera is attached to the mirrors.

for thin mirrors, but encountered problems with mirror warping and flatness from the cheap

acrylic mirrors. By inspecting the mirrors before purchase, and handling them very carefully

(without flexing them) during construction, cutting and adhesion, we were able to minimise

image warping and flatness.

A.3.3 Decoding & Calibration

Our MirrorCam calibration has two steps: first the base camera is calibrated following a con-

ventional intrinsic calibration, e.g. using MATLAB’s built-in camera calibration tool. Next

the camera is assembled with mirrors and the mirror geometry is estimated using a Levenberg-

Marquardt optimization of the error between expected and observed checker board corner lo-

cations. Initialization of the mirror geometry is based on the array design, and sub-image seg-

mentation is manually specified.

AP

PE

ND

IXA

.M

IRR

OR

ED

LIG

HT

-FIE

LD

VID

EO

CA

ME

RA

AD

AP

TE

RIX

128

.73

41.

23

Mount for Kinova MICO

Indentation used for approximate alignment of mirror arrqy

103.80

64.60

Mounting to suit Logitech C920 Camera

Holes sized to self tap with M5 mertic bolts

142.94

N/A

N/AN/AN/AN/A

0.4

8 7

A

B

23456 1

578 246 13

E

D

C

F F

D

B

A

E

C

DRAWN

FINISH: DEBURR AND BREAK SHARP EDGES

NAME SIGNATURE DATE MATERIAL:

DO NOT SCALE DRAWING REVISION

TITLE:

DWG NO.

SCALE:1:2 SHEET 1 OF 1

A33D Printed ABSSteven Martin 11/04/2016 KinovaMount

MirrorCam

Fig

ure

A.4

:M

irrorC

amv0.4

ckin

ova

mount.

One

poin

tof

differen

cew

ithprio

rw

ork

isth

atrath

erth

anem

plo

yin

ga

6-D

OF

transfo

rmatio

n

for

eachvirtu

alcam

eraview

,our

calibratio

nm

odels

eachm

irror

usin

ga

3-D

OF

reflectio

n

matrix

.T

his

reduces

the

DO

Fin

the

camera

model

and

more

closely

match

esth

ephysical

camera,

speed

ing

converg

ence

and

impro

vin

gro

bustn

ess.

Alim

itation

of

our

calibratio

ntech

niq

ue

isth

atth

eim

ages

taken

with

out

mirro

rsare

only

con-

sidered

when

initializin

gth

ecam

erain

trinsics.

Abetter

solu

tion,

leftas

futu

rew

ork

,w

ould

join

tlyco

nsid

erall

imag

es,w

ithan

dw

ithout

mirro

rs.

X A.4. CONCLUSIONS AND FUTURE WORK

Based on the calibrated mirror geometry, the nearest grid of parallel cameras is estimated, and

decoding proceeds as:

1. Remove 2D radial distortion,

2. Slice 2D image into a 4D array, and

3. Reproject each 2D sub-image into central camera view orientation.

Here, we assume the central camera view is aligned with the center mirror.

The final step corrects for rotational differences between the calibrated and desired virtual cam-

era arrays using 2D projective transformations. There is no compensation for translational error,

though in practice the cameras are very close to an ideal grid. An example input image and de-

coded are shown in Fig. 4.1c. Our calibration routine reported a 3D spatial reprojection RMS

error of 1.80 mm. The spatial reprojection error is the 3D distance from the projected ray to the

expected feature location during camera calibration, where pixel projections are traced through

the camera model into space. This small error confirms that the camera design, manufacture

and calibration has yielded observations close to an ideal .

It is important to note that our current calibration did not account for the manufacturing aspects

of the camera, such as the thickness of the acrylic mirrors, or the additional thickness of the

epoxy used to secure the mirrors to the mount. The acrylic mirrors we used also exhibited some

bending and rippling, causing image distortion unaccounted for in the calibration process.

A.4 Conclusions and Future Work

In this appendix, we have proposed the design optimisation, construction, decoding and cali-

bration process of a mirror-based light-field camera. We have shown that our 3D-printed Mir-

rorCam, optimized for overlapping FOV, reproduced a . This implies that the mirror-based LF

APPENDIX A. MIRRORED LIGHT-FIELD VIDEO CAMERA ADAPTER XI

camera was a viable, low-cost, and accessible alternative to commercially available LF cam-

eras. Our implementation takes 5 seconds per frame to operate as unoptimized MATLAB code.

The decoding and correspondence processes are the current bottlenecks. Through optimization,

real-time s should be possible. We push the envelope of technology towards real-time light-field

cameras for robotics. In future work, we will validate the MirrorCam in terms of image refo-

cusing, depth estimation and perspective shift in comparison to other commercially-available

light-field cameras.

XII

A.4

.C

ON

CL

US

ION

SA

ND

FU

TU

RE

WO

RK

50.

13

Mirrors mounted flush to surface using epoxy

85.14

14.80

Optional M5 shroud mounting holes

Center line for alignment

73.20

15

27

M6 Bolts with washers used to mount mirror holder to main assembly.

Not for vertical alignment

N/A

8 7

A

B

23456 1

578 246 13

E

D

C

F F

D

B

A

E

C

DRAWN




TITLE:

DWG NO.


A33D printed ABS

WEIGHT:

Steven Martin 11/04/2016 MirrorHolder

MirrorCam

0.4

Fig

ure

A.5

:M

irrorC

amv0.4

cm

irror

hold

er.

AP

PE

ND

IXA

.M

IRR

OR

ED

LIG

HT

-FIE

LD

VID

EO

CA

ME

RA

AD

AP

TE

RX

III

10

92.80

55.60

N/A

8 7

A

B

23456 1

578 246 13

E

D

C

F F

D

B

A

E

C

DRAWN




TITLE:

DWG NO.


A33D printed ABS

WEIGHT:

Steven Martin 11/04/2016 CameraClip

MirrorCam

0.4

Fig

ure

A.6

:M

irrorC

amv0.4

ccam

eraclip

.

light field features for robotic vision in the presence … yu peng_tsai_thesis.pdf · game theory...

Documents