sparse-then-dense alignment-based 3d map reconstruction ......2 electrical and electronics...

15
Machine Vision and Applications (2018) 29:345–359 https://doi.org/10.1007/s00138-017-0905-8 ORIGINAL PAPER Sparse-then-dense alignment-based 3D map reconstruction method for endoscopic capsule robots Mehmet Turan 1,5 · Yusuf Yigit Pilavci 2 · Ipek Ganiyusufoglu 3 · Helder Araujo 4 · Ender Konukoglu 5 · Metin Sitti 1 Received: 10 December 2016 / Revised: 11 November 2017 / Accepted: 28 November 2017 / Published online: 27 December 2017 © The Author(s) 2017. This article is an open access publication Abstract Despite significant progress achieved in the last decade to convert passive capsule endoscopes to actively controllable robots, robotic capsule endoscopy still has some challenges. In particular, a fully dense three-dimensional (3D) map reconstruction of the explored organ remains an unsolved problem. Such a dense map would help doctors detect the locations and sizes of the diseased areas more reliably, resulting in more accurate diagnoses. In this study, we propose a comprehensive medical 3D reconstruction method for endoscopic capsule robots, which is built in a modular fashion including preprocessing, keyframe selection, sparse-then-dense alignment-based pose estimation, bundle fusion, and shading-based 3D reconstruction. A detailed quantitative analysis is performed using a non-rigid esophagus gastroduodenoscopy simulator, four different endoscopic cameras, a magnetically activated soft capsule robot, a sub-millimeter precise optical motion tracker, and a fine-scale 3D optical scanner, whereas qualitative ex-vivo experiments are performed on a porcine pig stomach. To the best of our knowledge, this study is the first complete endoscopic 3D map reconstruction approach containing all of the necessary functionalities for a therapeutically relevant 3D map reconstruction. Keywords Endoscopic capsule robots · 3D map reconstruction · Sparse-then-dense feature tracking 1 Introduction Many diseases necessitate access to the internal anatomy of the patient for diagnosis and treatment. Since direct access to most anatomic regions of interest is traumatic, and sometimes impossible, endoscopic cameras have become a common method for viewing the anatomical structure. In particular, capsule endoscopy has emerged as a promising new tech- nology for minimally invasive diagnosis and treatment of gastrointestinal (GI) tract diseases. The low invasiveness B Mehmet Turan [email protected] 1 Physical Intelligence Department, Max-Planck Institute for Intelligent Systems, Stuttgart, Germany 2 Electrical and Electronics Engineering Department, Middle East Technical University, Ankara, Turkey 3 Computer Science and Engineering Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Turkey 4 Institute for Systems and Robotics, Universidade de Coimbra, Coimbra, Portugal 5 Computer Vision Laboratory, ETH Zurich, Zürich, Switzerland and high potential of this technology have led to substan- tial investment in their development by both academic and industrial research groups, such that it may soon be feasi- ble to produce a robotic capsule endoscope with most of the functionality of current flexible endoscopes. Although robotic capsule endoscopy has high potential of diagnostic and therapeutic capabilities, it continues to face many challenges. In particular, there is no broadly accepted approach for generating a comprehensive and therapeutically relevant 3D map of the organ being investigated. This prob- lem is made more severe by the fact that such a map may require a precise localization method for the endoscope, and such a method will itself require a map of the organ, a clas- sic chicken-and-egg problem [1]. The repetitive texture, lack of distinctive features, and specular reflections characteristic of the GI tract exacerbate this difficulty, and the non-rigid deformations introduced by peristaltic motions further com- plicate the reconstruction task [2]. Finally, the small size of endoscopic camera systems implies a number of limitations, such as restricted fields of view (FOV), low signal-to-noise ratio, and low frame rate; all of which degrade image quality [3]. These issues, to name a few, make accurate and precise 123

Upload: others

Post on 16-Nov-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sparse-then-dense alignment-based 3D map reconstruction ......2 Electrical and Electronics Engineering Department, Middle East Technical University, Ankara, Turkey 3 Computer Science

Machine Vision and Applications (2018) 29:345–359https://doi.org/10.1007/s00138-017-0905-8

ORIG INAL PAPER

Sparse-then-dense alignment-based 3Dmap reconstruction methodfor endoscopic capsule robots

Mehmet Turan1,5 · Yusuf Yigit Pilavci2 · Ipek Ganiyusufoglu3 · Helder Araujo4 · Ender Konukoglu5 ·Metin Sitti1

Received: 10 December 2016 / Revised: 11 November 2017 / Accepted: 28 November 2017 / Published online: 27 December 2017© The Author(s) 2017. This article is an open access publication

AbstractDespite significant progress achieved in the last decade to convert passive capsule endoscopes to actively controllable robots,robotic capsule endoscopy still has some challenges. In particular, a fully dense three-dimensional (3D) map reconstructionof the explored organ remains an unsolved problem. Such a dense map would help doctors detect the locations and sizes ofthe diseased areas more reliably, resulting in more accurate diagnoses. In this study, we propose a comprehensive medical 3Dreconstruction method for endoscopic capsule robots, which is built in a modular fashion including preprocessing, keyframeselection, sparse-then-dense alignment-based pose estimation, bundle fusion, and shading-based 3D reconstruction. A detailedquantitative analysis is performed using a non-rigid esophagus gastroduodenoscopy simulator, four different endoscopiccameras, a magnetically activated soft capsule robot, a sub-millimeter precise optical motion tracker, and a fine-scale 3Doptical scanner, whereas qualitative ex-vivo experiments are performed on a porcine pig stomach. To the best of our knowledge,this study is the first complete endoscopic 3D map reconstruction approach containing all of the necessary functionalities fora therapeutically relevant 3D map reconstruction.

Keywords Endoscopic capsule robots · 3D map reconstruction · Sparse-then-dense feature tracking

1 Introduction

Many diseases necessitate access to the internal anatomy ofthe patient for diagnosis and treatment. Since direct access tomost anatomic regions of interest is traumatic, and sometimesimpossible, endoscopic cameras have become a commonmethod for viewing the anatomical structure. In particular,capsule endoscopy has emerged as a promising new tech-nology for minimally invasive diagnosis and treatment ofgastrointestinal (GI) tract diseases. The low invasiveness

B Mehmet [email protected]

1 Physical Intelligence Department, Max-Planck Institute forIntelligent Systems, Stuttgart, Germany

2 Electrical and Electronics Engineering Department, MiddleEast Technical University, Ankara, Turkey

3 Computer Science and Engineering Faculty of Engineeringand Natural Sciences, Sabanci University, Istanbul, Turkey

4 Institute for Systems and Robotics, Universidade de Coimbra,Coimbra, Portugal

5 Computer Vision Laboratory, ETH Zurich, Zürich,Switzerland

and high potential of this technology have led to substan-tial investment in their development by both academic andindustrial research groups, such that it may soon be feasi-ble to produce a robotic capsule endoscope with most of thefunctionality of current flexible endoscopes.

Although robotic capsule endoscopy has high potential ofdiagnostic and therapeutic capabilities, it continues to facemany challenges. In particular, there is no broadly acceptedapproach for generating a comprehensive and therapeuticallyrelevant 3D map of the organ being investigated. This prob-lem is made more severe by the fact that such a map mayrequire a precise localization method for the endoscope, andsuch a method will itself require a map of the organ, a clas-sic chicken-and-egg problem [1]. The repetitive texture, lackof distinctive features, and specular reflections characteristicof the GI tract exacerbate this difficulty, and the non-rigiddeformations introduced by peristaltic motions further com-plicate the reconstruction task [2]. Finally, the small size ofendoscopic camera systems implies a number of limitations,such as restricted fields of view (FOV), low signal-to-noiseratio, and low frame rate; all of which degrade image quality[3]. These issues, to name a few, make accurate and precise

123

Page 2: Sparse-then-dense alignment-based 3D map reconstruction ......2 Electrical and Electronics Engineering Department, Middle East Technical University, Ankara, Turkey 3 Computer Science

346 M. Turan et al.

localization and reconstruction a difficult problem and canrender navigation and control counterintuitive [4].

Despite these challenges, accurate and robust three-dimensional (3D) mapping of patient-specific anatomyremains a difficult goal. Such a map would provide doctorswith a reliable measure of the size and location of a diseasedarea, thus allowing more intuitive and accurate diagnoses. Inaddition, should next-generation medical devices be activelycontrolled, a map would dramatically improve the doctorscontrol in diagnostic, prognostic, and therapeutic operations[5]. As such, considerable energy has been devoted to adaptcomputer vision techniques to the problem of in vivo 3Dreconstruction of tissue surface geometry.

Two primary approaches have been pursued as work-arounds for the challenges mentioned previously. First,tomographic intra-operative imaging modalities, such asultrasound (US), intra-operative computed tomography (CT),and interventional magnetic resonance imaging (iMRI),have been investigated for capturing detailed information ofpatient-specific tissue geometry [5]. However, surgical anddiagnostic operations pose significant technological chal-lenges and costs for the use of such devices, due to the needto acquire a high signal-to-noise ratio (SNR) without imped-iment to the doctor. Another proposal has been to equipendoscopes with alternative sensor systems in the hope ofproviding additional information; however, these alternativesystems have other restrictions that limit their use within thebody.

This paper proposes a complete pipeline for 3D visualmap reconstruction using only RGB camera images, withno additional sensor information. The pipeline is arrangedin a modular form and includes a preprocessing module forremoval of specular reflections, vignetting and radial lensdistortions, a keyframe selection module, a pose estima-tion and image stitching module for registration of images,and a shape-from-shading (SfS) module for reconstructionof 3D structures. We provide both qualitative and quantita-tive analysis of pose estimation and 3D map reconstructionaccuracy using a porcine pig stomach, an esophagus gastro-duodenoscopy simulator, four different endoscopic cameramodels, an optical motion tracker, and a 3D optical scan-ner. In sum, our method proposes a substantial contributiontoward a more general, therapeutically relevant, and exten-sive use of the information that capsule endoscopes mayprovide.

2 Literature survey

Several studies in the literature have discussed 3D mapreconstruction for standard hand-held and passive capsuleendoscopes [6–13], etc. These methods may be broken intofour major classes, i.e.,

– stereoscopy– shape from shading (SfS)– structured light (SL)– time of flight (ToF)

Structured light and time-of-flightmethods require additionalsensors, with a concomitant increase in cost and space; assuch, they are not covered in this paper. Stereo-based meth-ods use the parallax observed when viewing a scene fromtwo distinct viewpoints to obtain an estimate of the distancefrom observer to object under observation. Typically, suchalgorithms have four stages in computing the disparity map[14]: cost computation, cost aggregation, disparity computa-tion and optimization, and disparity refinement.

With multiple algorithms reported per year, computa-tional stereo depth perception has become an extremelyresearched field. The first work reporting stereoscopic depthreconstruction in endoscopic images was the work doneby [6], which implemented a dense computational stereoalgorithm. Later, Hager et al. developed a semi-global opti-mization [7], which was used to register the depth mapacquired during surgery to preoperative models [8]. Stoy-anov et al. used local optimization to propagate disparityinformation around feature-matched seed points, and it hasalso been reported to perform well for endoscopic images.This method was able to handle highlights, occlusions, andnoisy regions. Similar to stereo vision, another method thatemploys epipolar geometry and feature extraction is alsoproposed in [15]. This work flow starts with camera calibra-tion, and it relies on SIFT extraction and feature description.Finally, the main algorithm calculates the 3D spatial pointlocation using extrinsic parameters, which is calculated frommatched features in consecutive frames. Although this sys-tem exploits the advantage of sparse 3D reconstruction, thestrongdependencyon feature extraction causes performance-related issues for endoscopic type of imaging. Despite thevariety of algorithms and simplicity of implementation, com-putational stereo techniques are affected by several importantdisadvantages. To begin with, stereo reconstruction algo-rithms generally require two cameras, since the triangulationneeds a known baseline between viewpoints. Further, theaccuracy of triangulation decreases with distance from thecameras due to the shrinkage of relative baseline betweencamera centers and reconstructed points. Most endoscopiccapsule robots have only one camera, and in those thathave more, the diameter of endoscope inherently bounds thebaseline. As such, stereo techniques have yet to find wideapplication in endoscopy.

Due to the difficulty in obtaining stereo-compatible hard-ware, efforts have been made to adapt passive monocularthree-dimensional reconstruction techniques to endoscopicimages. These techniques have been focused on research incomputer vision for decades and have the distinct advan-

123

Page 3: Sparse-then-dense alignment-based 3D map reconstruction ......2 Electrical and Electronics Engineering Department, Middle East Technical University, Ankara, Turkey 3 Computer Science

Sparse-then-dense alignment-based 3D... 347

tage of not requiring extra hardware equipment in additionto existing endoscopic devices. Two main methods haveemerged as useful in the field of endoscopic images: shapefrommotion (SfM) and shape from shading (SfS). SfS,whichhas been studied since the 1970s [16], has demonstrated somesuitability for endoscopic image reconstruction. Its primaryassumption is that there is a single light source on the scene,of which the intensity and pose relative to the camera areknown. Both assumptions are mostly fulfilled in endoscopy[11–13]. Furthermore, the transfer function of the camera canbe included in the algorithm to additionally refine estimates[17]. Additional assumptions are that the object reflects lightobeying lambertian model and that the object surface has aconstant albedo. If these assumptions hold to a degree and theequation parameters are known, SfS can use the brightness ofa pixel to estimate the angle between cameras depth axis andthe shape normal at that pixel. This has been demonstratedto be effective in recovering details, although global shaperecovery often fails.

Both methods have been demonstrated to have disadvan-tages: SfS often fails in the presence of uncertain information,e.g., bleeding, reflections, noise artifacts, and occlusions; fea-ture tracking-based SfM methods tend to fail in the presenceof poorly textured areas and occlusions.

Therefore, many state-of-the-art works are mainly basedon the combination of these two techniques: In [18], apipeline for 3D reconstruction of endoscopy imaging usingSfS and SfM techniques is presented. In this work, thepipeline starts with basic preprocessing steps and focuseson 3D map reconstruction, which is independent of lightsource position and illumination. Finally, the framework endswith frame-to-frame feature matching to solve the scalingissue of monocular images. This paper proposes interestingmethods for the difficult task of reconstruction. However,enhanced preprocessing and especially less dependency onfeature extraction andmatching are still needed. In the recentwork of [19], SfS and SfM are fused together to reach a bet-ter 3D map accuracy. With SFM, a sparse point cloud isobtained and a dense version of this cloud is generated bymeans of SFS. For better performance of SFS, they also pro-pose a refined reflectance model. One notable idea basedon SfS and SfM fusion is proposed in [20]. This method-ology first reconstructs a sparse 3D map using SfM anditeratively refines the final reconstruction using SfS. Theapproach does not directly address the difficulties caused bythe ill-posed illumination and specular reflectance, althoughthe proposed geometric fusion tries to eliminate such issues.And the strong reliance on the establishment of feature cor-respondence remains unsolved. Attempts to solve the latterproblem with template-matching techniques have had somesuccess, but tend to be computationally very complex whichmakes it unsuitable for real-time performance. In [21], onlySFS is used for reconstruction and 2D features are pre-

ferred for estimating the transformation. Similarly, [22] and[23] combine SFM and SFS for 3D reconstruction withoutany preprocessing and with the Lambertian surface assump-tion. In [24], machine learning algorithms are applied for3D reconstruction. Basically, training is completed with anartificial dataset and real endoscopy images are used fortest data. Another state-of-the-art pipeline is proposed in[25], which presents a workflow combining RGB cameraand inertial measurement sensors (IMU). Besides improvedresults, this hardware makes the overall flow more com-plex and costly. Moreover, IMU sensors occupy extra placeand they are not accurate enough. In addition, they inter-fere with the magnetic actuation systems which makes themunsuitable for the next generation of actively controllableendoscopic capsule robots. The main common issue remain-ing for 3D reconstruction of endoscopic-type datasets is thevisual complexity of these images. The challenges whichwe mentioned in the abstract and introduction affect theperformance of standard computer vision algorithms. In par-ticular, the proposed method must be robust to specularview-dependent highlights, noise, peristalticmovements, andfocus-dependent changes in calibration parameters. Unfortu-nately, a quantitativemeasure of algorithm robustness has notbeen suggested in the literature until today, despite its clearvalue for the evaluation of algorithmic dependability and pre-cision.Moreover, all of thementionedmethods in that sectionwere developed and evaluated on only one specific cameramodel, which makes it impossible to justify the robustness ofthe framework in the case of different camera choices withlimited specifications such as lower resolution and imagequality.

Our paper proposes a full pipeline consisting of cam-era calibration, reflection detection and suppression, radialundistortion, de-vignetting, keyframe selection, pose estima-tion, frame stitching, and SfS to reconstruct a therapeuticallyrelevant 3D map of the organ under observation. Both syn-thetic and real pig stomachs are used for evaluation. Amongother contributions, an extensive quantitative analysis hasbeen proposed and performed to demonstrate the influenceof pipeline modules on the accuracy and robustness of theestimated camera pose and reconstructed 3D map. To ourknowledge, this is the first such comprehensive quantitativeanalysis to be enacted in endoscopic type of image process-ing.

3 Method

This section represents the proposed framework in moredepth. Preprocessing steps, keyframe selection, pose estima-tion, frame stitching, and SfS module will be discussed indetail.

123

Page 4: Sparse-then-dense alignment-based 3D map reconstruction ......2 Electrical and Electronics Engineering Department, Middle East Technical University, Ankara, Turkey 3 Computer Science

348 M. Turan et al.

3.1 Preprocessing

The proposed modular endoscopic 3D map reconstructionframework starts with a preprocessing module which per-forms intrinsic camera calibration, reflection detection andsuppression, radial distortion correction, and de-vignetting.Specular reflections are a common problem causing inac-curate depth estimation and map reconstruction. Therefore,eliminating specular artifacts is a fundamental endoscopicimage preprocessing step to ensure lambertian surface prop-erties and increase the quality of the 3D map. On the otherhand, specularities can deliver useful information for poseestimation, especially orientation information. For the reflec-tion detection task, we propose an original method whichdetermines the reflection regions bymaking use of geometricand photometric information. To determine the locations ofthe reflection areas, the gradient map of the input gray-scaleimage is created and a morphological closing operation isapplied to fill the gaps inside reflection-distorted areas. Forthe closing operation, we used OPENCV function close(). Inparallel, a photometric method applies adaptive threshold-ing determined by the mean and standard deviation of thegray-scale image I to identify the specular regions:

MaskIllu =

{0, I < μI + σI

1, otherwise(1)

where μI and σI are the mean and standard deviation of theintensity levels of the gray-scale image I . The pixel-wisecombination of both detection strategies leads to a robustreflection detection approach. Once specular reflection pix-els are detected, the inpainting method proposed by [26] isapplied to suppress the saturated pixels by replacing the spec-

ularity by an intensity value derived from a combination ofneighboring pixel values.

As a next step, the Brown-Conrady [27] undistortion tech-nique is applied to handle the radial distortions. Vignetting,referring to an inhomogeneous illumination distribution rel-ative to the image center, primarily caused by camera lensimperfections and light source limitations, is handled byapplying a radial gradient symmetry enforcement-basedmethod (Fig. 1). Our framework applies the vignetting cor-rection approach proposed by [28] which de-vignettes theimage by enforcing the symmetry of the radial gradientfrom center to boundaries. An example of input image andvignetting-corrected output image can be seen in Fig. 1.De-vignetting is demonstrated in Fig. 2, where it is clearlyobservable that the intensity levels of de-vignetted imagehave a more homogeneous pattern.

3.2 Keyframe selection

Endoscopic videos generally contain thousands of highlyoverlapping frames (more than %75 overlap) due to slowendoscopic capsule movement during organ exploration. Asubset of the most relevant keyframes has to be chosen auto-matically. The minimum amount of key frames requiredto recover the entire stomach surface with approximately%50 overlapping area between keyframes is around 300frames. Thus, at least every tenth frame could be selectedas a keyframe. However, since the endoscopic capsule robotmotion is not constant during organ exploration, it is nota good practice to blindly assign keyframes with a con-stant interval. We developed an adaptive keyframe selectionmethod based on Farneback optical flow (OF) estimationbetween frame pairs. Farneback OF is chosen due to its

Fig. 1 Preprocessing pipeline: reflection removal, radial undistortion, de-vignetting

123

Page 5: Sparse-then-dense alignment-based 3D map reconstruction ......2 Electrical and Electronics Engineering Department, Middle East Technical University, Ankara, Turkey 3 Computer Science

Sparse-then-dense alignment-based 3D... 349

Fig. 2 Demonstration of the de-vignetting process

improved performance relative to other optical flow meth-ods applied to our dataset. We add the magnitudes of opticalflow values for each frame pair and normalize the sum bytotal image resolution. If the normalized sumdoes not exceeda predefined threshold τ = 30 pixels, the overlap betweenreference keyframe and keyframe candidate is accepted asbeing high (more than %70 overlap). In that case, candidateframe fails and the algorithm goes to the next frame. The loopstarts again and runs until a keyframe is found. The key frameselection procedure and termination criteria are representedin algorithm 1:

Algorithm 1 Keyframe selection algorithm1: Extract Farneback optical flowbetween reference keyframe and can-

didate keyframe.2: Sum the magnitude values of the optical flow vectors for each pixel

pair.3: Normalize the sum by total pixel number.4: If the normalized sum is less than predefined threshold τ = 30

pixels, go to the next frame; else identify the frame as a keyframeand go tho the first step.

5: If fifteen frames failed to fulfill the key frame conditions, and stillτ = 30 pixels could not be exceeded, assign the frame with highestτ value among these fifteen frames as a key frame and go to the firststep.

3.3 Keyframe stitching

A state-of-the-art image stitching pipeline contains severalstages:

– Feature detection, which detects features in input imagepair.

– Feature matching, which matches features between inputimages.

– Homography estimation, which estimates extrinsic cam-era parameters between the image pairs.

– Bundle adjustment, which is a postprocessing step to cor-rect drifts in a global manner.

– Imagewarping,whichwarps the images onto a composit-ing surface.

– Gain compensation, which normalizes the brightness andcontrast of all images.

– Blending, which blends pixels along the stitch seam toreduce the visibility of seams.

Stitching algorithms fall broadly into two categories:direct alignment-based methods and feature-based methods.Direct alignment-basedmethods attempt tomatch every pixelbetween the frame pair using iterative optimization tech-niques. These methods have the benefit of using all theavailable data which is a good practice for low-texturedimages such as endoscopic type of images. However, directmethods require a good initialization so that they do not con-verge into local minima. Moreover, they are very susceptibleto varying brightness conditions. Feature-based methods, onthe other hand, first find unique feature points such as cor-ners and try to match them. These methods do not require aninitialization, but the features are not easy to detect in low-textured images and detected features can be susceptible toillumination changes, scale changes caused by zoom-in andout and viewpoint changes.Our keyframe stitching techniquemakes use of both alignmentmethods in a coarse-to-fine fash-ion combining Farneback OF-based coarse alignment withpatch-wise fine alignment. Farneback OF delivers the ini-tial 2D motion estimation, whereas the SSD-based energyminimization applied to circular regions of interest with aradius of 15pixels around each inlier point refines this estima-tion. Patch-wise fine alignment estimates the parameters ofaffine transformation by minimizing an intensity difference-based energy cost function. The affine transformation mapsan image I1 onto the reference image I2, where x ′ , y′ repre-sent the transformed and x , y the original pixel coordinates,and a1, a2, a3, a4, tx , ty the parameters of affine transfor-

123

Page 6: Sparse-then-dense alignment-based 3D map reconstruction ......2 Electrical and Electronics Engineering Department, Middle East Technical University, Ankara, Turkey 3 Computer Science

350 M. Turan et al.

mation matrix A, respectively. We define a cost functionmeasuring the pixel intensity similarity between the imagepair (Eq. 4), which is supposed to be minimized by the cor-responding affine transformation parameters.

⎛⎝x2y21

⎞⎠ =

⎛⎝a1 a2 txa3 a4 ty0 0 1

⎞⎠ ·

⎛⎝x1y11

⎞⎠ (2)

Since the cost function has to ignore the pixels lying outsidethe circular patches defined around inlier points, a weightingfunction w(x, y) is defined:

ω(x, y) ={0, i f (x − xc)2 + (y − yc)2 ≥ r2

1, i f (x − xc)2 + (y − yc)2 < r2(3)

where xc and yc are the coordinates of inlier point and r theradius of the circular image region around this inlier pointcenter. The resulting cost function has a bias toward smalleroverlapping solutions; thus a normalization of it by the over-lapping area is necessary, resulting in the mean squared pixelerror (MSE):

eMSE(A)=∑

i ω(xi , yi )ω(x ′i , y

′i )(I2(x

′i , y

′i ) − I1(xi , yi ))2∑

i ω(xi , yi )ω(x ′i , y

′i )

.

(4)

The affine transformation matrix A is iteratively determinedby the image transformation that minimizes eMSE usingGaussian–Newton optimization. CUDA library was utilizedto achieve better performance and reduce execution time ofGN Optimization through parallelism. The system architec-ture diagram of the proposed frame stitching algorithm isdemonstrated in Fig. 3.

The termination criteria of the Gaussian–Newton opti-mization were defined by a threshold τ = e−9, whereas theoptimization stops when the eMSE drops below the thresh-old τ or maximum number of iterations have already beenreached. Once the optimization has converged and the affinetransformation parameters are estimated, bundle adjustmentis performed to correct drifts for all the camera parametersjointly and to minimize the accumulative errors. At the nextstep, all keyframes Ii are transformed into the coordinatesystem of the anchor keyframe IA. In areas where severalkeyframes overlap, corresponding image pixels often do not

Fig. 3 Image stitching flowchart

Fig. 4 Multi-band blending flowchart

123

Page 7: Sparse-then-dense alignment-based 3D map reconstruction ......2 Electrical and Electronics Engineering Department, Middle East Technical University, Ankara, Turkey 3 Computer Science

Sparse-then-dense alignment-based 3D... 351

Fig. 5 Demonstration of the keyframe stitching process for the non-rigid esophagus gastroduodenoscopy simulator (left) and real pig stomach(right)

have the same intensity due to illumination changes, scalechanges, and intensity level variations. Multi-band blendingmethod is applied to overcome these issues. The overviewof multi-blending approach is shown in Fig. 4. For furtherdetails, the reader is referred to the original work of [29].Algorithm 2 summarizes the steps of keyframe stitchingmodule. Results of the stitching process for the real pig stom-ach and nonrigid simulator are shown in Fig. 5.

Algorithm 2 Proposed endoscopic keyframe stitching mod-ule1: Identify the next keyframe.2: Match pixels between the reference keyframe and the identified next

keyframe using optical flow estimation.3: Use RANSAC to detect inlier points.4: Use optical flow vectors between inlier matches as initialization for

the GN optimization.5: Define circular regions around each inlier point.6: Calculate the intensity difference-based energy cost function.7: Execute iterative Gaussian–Newton optimization (GN) to minimize

the energy cost function.8: Perform GPU-based multi-core bundle adjustment to globally opti-

mize all of the camera poses jointly [30].9: Perform frame warping.10: Perform gain compensation [31].11: Perform multi-band blending.

3.4 Deep learning and frame stitching

A major drawback of our frame stitching module is the needfor an extensive engineering and implementation effort. Toovercome these issues, we investigated the applicability ofdeep learning techniques to the endoscopic capsule robotpose estimation [2]. Deep learning (DL) has been drawingthe attention of the machine learning research communityover the last decade. Much of its success roots on havingmade available models and technologies capable of achiev-ing ground-breaking performances in a variety of traditionalfields of application of machine learning, such as machinevision and natural language processing. Admittedly, some

of the DL flagships, like NLP and image processing, havetheir implications in medical fields, e.g., in extracting infor-mation from the images taken from patients’ records to findanomalous patterns and detect diseases. With that motiva-tion, we are trying to extend the application ofDL technologyinto endoscopic capsule robot localization. The core idea ofour DL-based method is the use of deep recurrent convolu-tional neural networks (RCNNs) for the pose estimation task,where convolutional neural networks (CNNs) and recurrentneural networks (RNNs) are used for the feature extractionand inference of dynamics across the frames, respectively[2]. Using this pretrained neural network, we are able toachieve pose estimation accuracies comparable our sparse-then-dense pose alignment [2]. Thus, as a future step, wemight consider to integrate DL-based pose estimation intoour frame stitching module to decrease the complexity ofour stitching method and relax the extensive engineeringand implementation efforts required in this study. Since DL-based pose estimation is out of scope of this paper, the readeris referred to the original paper [2] for further details.

3.5 Endo-VMFusenet and frame stitching

Even though the proposed sparse-then-dense alignment-based visual pose estimation achieves very promising resultsfor endoscopic capsule robot localization, it fails in case ofvery fast frame-to-frame motions. This is a common issue ofany vision-based odometry algorithm. If the overlap betweenconsecutive frames becomes less than a certain percentage,any vision-based pose estimation approach fails. It can evenoccur that due to drifts of endoscopic capsule robot, the over-lap area between frame pairs decreases drastically, which caneven be zero in some cases. To overcome this issue, we devel-oped a supervised sensor fusion approach based on an end-to-end trainable deep neural network consisting of multi-ratelong short-term memories (LSTMs) for frequency adjust-ment between sensors and a core LSTM unit for fusion of theadjusted sensor information. Detailed evaluations indicate

123

Page 8: Sparse-then-dense alignment-based 3D map reconstruction ......2 Electrical and Electronics Engineering Department, Middle East Technical University, Ankara, Turkey 3 Computer Science

352 M. Turan et al.

that our pretrained DL-based sensor fusion network detectswhether visual odometry fails and instantaneouslymakes useof magnetic localization until visual odometry path againrecovers. The same applies if magnetic sensor-based local-ization fails. Additionally,monocular cameras sufferwith theabsence of real depth informationwhich causes anymeasure-mentsmade by them to be recoverable only up to a scale. Thiscondition is known as scale ambiguity. Another contributionof our DL-based sensor fusion approach is the accurate scaleestimation by using absolute position information obtainedby themagnetic localization system. In that way, doctors willhave a 3D map of exactly same size of the explored innerorgan, which will not only help the exact estimation of thediseased region size, but also enable biopsy-like treatmentsor local drug delivery onto the diseased region. Since it is outof scope, for further details of our DL-based sensor fusionapproach, the reader is referred to our paper [4].

3.6 Depth image creation

Once the final mosaic image is obtained, the next modulecreates its depth image using the SfS technique of Tsai andShah [32]. Tsai–Shah SfS method is based on the followingassumptions:

– The object surface is lambertian.– The light comes from a single-point light source.– The surface has no self-shaded areas.

Lambertian surface assumption is not obeyed by raw endo-scopic images due to the specular reflections inside theorgans.Weaddressed this problem through the reflection sup-pression technique previously described. Subsequently, theabove assumptions allow the image intensities to be modeledby

I (x, y) = ρ(x, y, z) · cos�i , (5)

where I is the intensity value, p is the albedo (reflectingpower of surface), and theta is the angle between surfacenormal N and light source direction S. With this equation,the gray values of an image I are related only to albedo andangle theta. Using these assumptions, the above equation canbe rewritten as follows:

I (x, y) = ρ · N .S, (6)

where (.) is the dot product, N is the unit normal vector of thesurface, and S is the incidence direction of the source light.These may be expressed respectively as

N = (−p(x, y),−q(x, y), 1)

(p2 + q2 + 1)(1/2)(7)

S = (cos τ · sin σ, sin τ · sin σ, cos σ) (8)

where (τ ) and (σ ) are the slant and tilt angles, respectively,and p and q are the x and y gradients of the surface Z :

p(x, y) = ∂Z(x, y)

∂x(9)

q(x, y) = ∂Z(x, y)

∂ y. (10)

The final function then takes the form

I (x, y)

= ρ · (cos σ + p(x, y) · cos τ · sin σ + q(x, y) · sin τ · sin σ)

((p(x, y))2 + (q(x, y))2 + 1)(1/2)

= R(px,y, qx,y). (11)

Solving this equation for p and q essentially corresponds tothe general problem of SfS. The approximations and solu-tions for p and q yield the reconstructed surface map Z . Thenecessary parameters are tilt, slant, and albedo, and can beestimated as proposed in [33]. The unknown parameters ofthe 3D reconstruction are the horizontal and vertical gradi-ents of the surface Z , p, and q.With discrete approximations,they can be written as follows:

p(x, y) = Z(x, y) − Z(x − 1, y) (12)

q(x, y) = Z(x, y) − Z(x, y − 1), (13)

where Z(x, y) is the depth value of each pixel. From theseapproximations, the reflectance function R(px,y, qx,y) can beexpressed as

R(Z(x, y) − Z(x − 1, y), Z(x, y) − Z(x, y − 1)). (14)

Using equations 12, 13, and 14, the reflectance equation mayalso be written as

f (Z(x, y), Z(x, y − 1), Z(x − 1, y), I (x, y))

= I (x, y) − R(Z(x, y) − Z(x − 1, y),

Z(x, y) − Z(x, y − 1)) = 0. (15)

Tsai and Shah proposes a linear approximation using a first-order Taylor series expansion for function f and for depthmap Zn−1, where Zn−1 is the recovered depthmap after n−1iterations. The final equation is

Zn(x, y) = Z (n−1)(x, y) − f (Z (n−1)(x, y))d( f (Z (n−1)(x,y))

d(Z(x,y))

, (16)

where f is a predefined function, constrained by

d f (Z (n−1)(x, y))

dZ(x, y)(1 + i2x + i2y)) (17)

123

Page 9: Sparse-then-dense alignment-based 3D map reconstruction ......2 Electrical and Electronics Engineering Department, Middle East Technical University, Ankara, Turkey 3 Computer Science

Sparse-then-dense alignment-based 3D... 353

and

ix = cos τ · sin σ

cos σ(18)

iy = sin τ · sin σ

cos σ. (19)

The nth depth map Zn is calculated by using the estimatedslant, tilt, and albedo values.

4 Evaluation

We evaluate the performance of our system both quanti-tatively and qualitatively in terms of pose estimation andsurface reconstruction. We also report the computationalcomplexity of the proposed framework.

4.1 Dataset

We created our own dataset from a real pig stomach and froma non-rigid open GI tract model EGD (esophagus gastroduo-denoscopy) surgical simulator LM-103 (Figs. 6, 7). The EGDsurgical simulator was used for quantitative analyses, and thereal pig stomach for qualitative evaluations. Synthetic stom-ach fluid was applied to the surface of the EGD simulatorto imitate the mucosa layer of the inner tissue. To ensurethat our algorithm is not tuned to a specific camera model,four different commercially available endoscopic cameraswere employed for the video capture varying in their reso-lution, pixel size, depth of focus, and image quality. A totalof 17010 endoscopic frames were acquired by these fourcamera models which were mounted on our robotic magnet-ically actuated soft capsule endoscope prototype (MASCE)(Fig. 8, [34,35]). The first sub-dataset, consisting of 4230frames, was acquired with an Awaiba NanEye camera (Table

Fig. 6 Non-rigid esophagus gastroduodenoscopy simulator dataset overview for different endoscopic cameras

Fig. 7 Real pig stomach dataset overview for different endoscopic cameras

123

Page 10: Sparse-then-dense alignment-based 3D map reconstruction ......2 Electrical and Electronics Engineering Department, Middle East Technical University, Ankara, Turkey 3 Computer Science

354 M. Turan et al.

Fig. 8 Robotic magnetically actuated soft capsule endoscopes(MASCE) [34,35]

Table 1 Awaiba Naneye monocular endoscopic camera

Resolution 250 × 250 pixel

Footprint 2.2 × 1.0 × 1.7 mm

Pixel size 3 × 3µm2

Pixel depth 10 bit

Frame rate 44 fps

1). The second sub-dataset, consisting of 4340 frames, wasacquired by theMisumi V3506-2ES endoscopic camera withthe specification shown in Table 2. The third sub-datasetof 4320 frames was obtained by the Misumi V5506-2ESendoscopic camera with the specification shown in Table 3.Finally, the fourth sub-dataset of 4120 frames was obtainedby the Potensic mini camera with the specification shownin Table 4. We scanned the open stomach simulator usingthe 3D Artec Space Spider image scanner and used this3D scan as the ground truth for the 3D map reconstructionframework (Fig. 9). Even though our focus and ultimate goalis an accurate and therapeutically relevant 3D map recon-struction, we also evaluated the pose estimation accuracyof the proposed framework quantitatively since a precisepose estimation is a prerequisite for an accurate 3D map-ping. Thus, an Optitrack motion-tracking system consistingof eight Prime-13 cameras and a tracking software was uti-lized to obtain a 6-DoF localization ground truth data of theendoscopic capsule motion with a sub-millimeter precision(Fig. 9).

4.2 Trajectory estimation

To evaluate the pose estimation performance, we tested oursystem on different trajectories of various difficulty levels.The absolute trajectory (ATE) root-mean-square error metric

Table 2 Misumi-V3506-2ES monocular camera

Resolution 400 × 400 pixel

Diameter 8.2mm

Pixel size 5.55 × 5.55µm2

Pixel depth 10 bit

Frame rate 30 fps

Table 3 Misumi-V5506-2ESmonocular camera Resolution 640 × 480 pixel

Diameter 8.6mm

Pixel size 6.0 × 6.0µm2

Pixel depth 10 bit

Frame rate 30 fps

Table 4 Potensic monocular mini camera

Resolution 1280 × 720 pixel

Diameter 8.8mm

Pixel size 10.0 × 10.0µm2

Pixel depth 10 bit

Frame rate 30 fps

(RMSE) is used for quantitative pose accuracy evaluations.The absolute trajectory (ATE) root-mean-square error met-ric measures the root-mean-square of Euclidean distancesbetween the estimated endoscopic capsule robot poses andthe ground truth poses estimated by the motion capture sys-tem. Table 5 shows the results of the trajectory estimation forsix different trajectories. Trajectory 1 is an uncomplicatedpath with very slow incremental translations and rotations.Trajectory 2 follows a comprehensive scan of the stom-ach with many local loop closures. Trajectory 3 containsan extensive scan of the stomach with more complicatedlocal loop closures. Trajectory 4 consists of more challengemotions including fast rotational and translational frame-to-frame motions. Trajectory 5 is the same of trajectory 4, butincluded synthetic noise to evaluate the robustness of sys-tem against noise effects. Before capturing trajectory 6, weadded more synthetic stomach oil into the simulator tissue tohave heavier reflection conditions. Similar to the trajectory5, trajectory 6 consists of very loopy and complex motions.As seen in Table 5, the system performs very robust andaccurate in terms of trajectory tracking in all of the chal-lenge datasets. Tracking accuracy is only decreased for veryfast frame-to-framemovements, motion blur, noise, or heavyspectral reflections occurring frequently in last trajectoriesespecially.

RMSE results for pose estimation before and after appli-cation of reflection suppression, de-vignetting, and radialundistortion were evaluated and compared to quantitatively

123

Page 11: Sparse-then-dense alignment-based 3D map reconstruction ......2 Electrical and Electronics Engineering Department, Middle East Technical University, Ankara, Turkey 3 Computer Science

Sparse-then-dense alignment-based 3D... 355

Fig. 9 Schematics of the experimental setup for 3D visual map reconstruction: a real pig stomach, an esophagus gastroduodenoscopy simulatorfor surgical training, 3D image scanner, Optitrack system, endoscopic camera, and active robotic capsule endoscope

Table 5 Comparison of ATE RMSE for different trajectories and cam-eras

Length in cm Potensic Misumi-I Misumi-II Awaiba

Traj 1 123.5 4.10 4.23 4.17 6.93

Traj 2 132.4 4.14 4.45 4.32 7.12

Traj 3 124.6 5.23 5.54 5.43 7.42

Traj 4 128.2 5.53 5.67 5.47 7.51

Traj 5 128.2 6.32 5.45 5.32 8.32

Traj 6 123.1 7.73 6.72 6.51 8.73

analyze their effects in terms of pose estimation accuracy.Results shown in Table 6 for Misumi camera-II indicate thatreflection suppression leads to a decrease in pose estimationperformance. This decrease might be related to the fact thatsuch saturated peak values contain orientation information.Thus, in consideration of pose estimation, reflection suppres-sion should be avoided.On the other hand, radial undistortionand de-vignetting operations both increase pose estimationaccuracy of the framework as expected.

4.3 Surface reconstruction

We evaluated the surface reconstruction accuracy of oursystem on the same dataset that we used for the trajec-

Table 6 Comparison of ATE RMSE for MISUMI-II camera and dif-ferent combinations of preprocessing operations

RS NRS RS+RUD RS+RUD+DV

Traj 1 5.45 4.12 4.01 4.03

Traj 2 6.44 4.23 4.07 4.04

Traj 3 6.57 5.13 4.97 4.98

Traj 4 7.55 5.34 5.16 5.08

Traj 5 8.43 5.43 5.14 5.02

Traj 6 8.69 5.64 5.25 5.12

NPRNo preprocessing applied,RS reflection suppression applied,RUDradial undistortion applied, DV de-vignetting applied

Table 7 Comparison of surface reconstruction accuracy results on theevaluated datasets

Depth Potensic Misumi-I Misumi-II Awaiba

Traj 1 63.42 2.82 2.32 2.14 3.42

Traj 2 63.45 2.56 2.45 2.16 4.14

Traj 3 63.41 3.16 2.76 2.45 4.45

Quantities shown are the mean distances from each point to the nearestsurface in the ground truth 3D model in cm

tory estimation framework as well. We scanned the opennon-rigid esophago-gastroduodenoscopy (EGD) simulator toobtain the ground truth 3D data using a highly accurate com-

123

Page 12: Sparse-then-dense alignment-based 3D map reconstruction ......2 Electrical and Electronics Engineering Department, Middle East Technical University, Ankara, Turkey 3 Computer Science

356 M. Turan et al.

Fig. 10 Qualitative 3D reconstructed map results for different cameras [(real pig stomach (left), synthetic human stomach (right)]

Table 8 Comparison of ATERMSE for different trajectoriesand combinations ofpreprocessing operations on theevaluated dataset

NPR RSM RSPM RSPM+RUD RSPM+RUD+DV

Traj 1 5.45 3.65 3.42 2.02 2.14

Traj 2 6.44 3.91 3.71 2.08 2.16

Traj 3 6.54 4.23 3.94 2.27 2.45

Traj 4 7.25 4.53 4.14 3.02 3.14

Traj 5 8.35 4.95 4.63 3.34 3.52

Traj 6 8.95 5.55 5.14 3.55 3.82

Quantities shown are the mean distances from each point to the nearest surface in the ground truth 3D modelin cmNPR No preprocessing applied, RSPM reflection suppression applied for both pose estimation and mapreconstruction, RSM reflection suppression applied only for map reconstruction, RUD radial undistortionapplied, DV de-vignetting applied, MISUMI-II camera were used

mercial 3D scanner (Artec 3D Space Spider). The final 3Dmap of the stomach model obtained by the proposed frame-work and the ground truth scan were aligned using iterativeclosest point algorithm (ICP). The absolute depth (ADE)RMSE was used to evaluate the performance of map recon-struction approach, whichmeasured the root-mean-square ofEuclidean distances between estimated depth values and thecorresponding ground truth depth values. A lowest RMSE of2.14 cm (Table 7) proves that our system can achieve veryhigh map accuracies. Even in more challenge trajectoriessuch as trajectory 3, our system is still capable of providingan acceptable 3D map of the explored inner organ tissue.Three-dimensional reconstructed maps of real pig stomachand synthetic human stomach are represented in Fig. 10 forvisual reference.

To evaluate the contributions of each preprocessing mod-ule on the map reconstruction accuracy, we tested theapproach with leave-one out strategy leaving one moduleeach time.As shown in Table 8, each preprocessing operationhas a certain influence on the RMSE results. One importantobservation is that even though pose accuracy increases with

existence of reflection points, these saturated pixels have neg-ative influence on the map accuracy, as expected. Therefore,disabling reflection suppression during pose estimation andenabling it for map reconstruction are the best option to fol-low.

4.4 Computational performance

To analyze the computational performance of the proposedframework, we determined the average frame pair process-ing time across the trajectory sequences. The test platformwas a desktop PC with an Intel Xeon E5-1660v3-CPU at3.00, 8 cores, 32GB of RAM, and an NVIDIA QuadroK1200GPUwith 4GBofmemory. Three-dimensional recon-struction of 100 frames took 80.54 s to process, whereasprocessing of 200 frames took 180.83 s, and processing of300 frames 290.12 s, respectively. That indicates an averageframe pair processing time of 919.15 ms, implying that ourpipeline needs to be accelerated using more effective parallelcomputing andGPUpower in order to reach real-time perfor-mance. To achieve this, we developed a RGB-Depth SLAM

123

Page 13: Sparse-then-dense alignment-based 3D map reconstruction ......2 Electrical and Electronics Engineering Department, Middle East Technical University, Ankara, Turkey 3 Computer Science

Sparse-then-dense alignment-based 3D... 357

method, which is capable of capturing comprehensive andglobally dense surfel-based maps of the inner organs in realtime, by using joint photometric–volumetric pose alignment,dense frame-to-model camera tracking, and frequent modelrefinement through non-rigid surface deformations [1]. Theexecution time of the RGB-Depth SLAM is dependent onthe number of surfels in the map, with an overall averageof 48 ms per frame scaling to a peak average of 53 ms,implying a worst case processing frequency of 18 Hz. Eventhough RGB-Depth SLAM is much faster than our sparse-then-dense alignment-based 3D reconstruction method, themap quality decreases due to the use of surfel elements.Moreover, the joint photometric–volumetric pose alignmentis prone to converge into local minima in low-textured areas.For further details of our RGB Depth SLAM method, thereader is referred to our paper [1].

4.5 Conclusion

In this study, we proposed a therapeutically relevant and verydetailed 3D map reconstruction approach for endoscopiccapsule robots consisting of preprocessing, key frame selec-tion, a sparse-then-dense pose estimation, frame stitching,and shading-based 3D reconstruction. Detailed quantitativeand qualitative evaluations show that the proposed systemachieves sub-millimeter precision for both 3D map recon-struction and pose estimation. In future, we aim to achievereal-time operation for the proposed framework so that itcan be used for active navigation of the robot during endo-scopic operations, as well. Moreover, we plan to incorporatemagnetic localization and scale estimation module into ourmethod to develop even more robust endoscopic reconstruc-tion tools.

Acknowledgements Open access funding provided by Max PlanckSociety.

Open Access This article is distributed under the terms of the CreativeCommons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution,and reproduction in any medium, provided you give appropriate creditto the original author(s) and the source, provide a link to the CreativeCommons license, and indicate if changes were made.

References

1. Turan, M., Almalioglu, Y., Araujo, H., Konukoglu, E., Sitti, M.: Anon-rigid map fusion-based direct SLAM method for endoscopiccapsule robots. Int. J. Intell. Robot. Appl. (2017a)

2. Turan, M., Almalioglu, Y., Araujo, H., Konukoglu, E., Sitti, M.:Deep EndoVO: a recurrent convolutional neural network (RCNN)based visual odometry approach for endoscopic capsule robots.Neurocomputing (2017b)

3. Sitti, M., Ceylan, H., Hu, W., Giltinan, J., Turan, M., Yim,S., Diller, E.: Biomedical applications of untethered mobile

milli/microrobots. Proceedings of the IEEE 103(2), 205–224(2015)

4. Turan, M., Almalioglu, Y., Gilbert, H. , Sari, A.E. Soylu, U.,Sitti, M.: Endo-VMFuseNet: deep visual-magnetic sensor fusionapproach for uncalibrated, unsynchronized and asymmetric endo-scopic capsule robot localization data. arXiv:1709.06041 [cs.RO](2017c)

5. Turan, M., Shabbir, J., Araujo, H., Konukoglu, E., Sitti, M.: Adeep learning based fusion of RGB camera information and mag-netic localization information for endoscopic capsule robots. J.Intell. Robot. Appl, Int (2017). https://doi.org/10.1007/s41315-017-0039-1

6. Devernay, F.,Mourgues, F., Coste-Manire, É.: Towards endoscopicaugmented reality for robotically assisted minimally invasive car-diac surgery. In: International Workshop on Medical Imaging andAugmented Reality (MIAR), pp. 16–20 (2001)

7. Hager, G., Vagvolgyi, B., Yuh, D.: Stereoscopic video overlaywith deformable registration. In: Medicine Meets Virtual Reality(MMVR) (2007)

8. Su, L.M., Vagvolgyi, B.P., Agarwal, R., Reiley, C.E., Taylor, R.H.,Hager, G.D.: Augmented reality during robot-assisted laparoscopicpartial nephrectomy: toward real-time 3D-CT to stereoscopic videoregistration. Urology 73, 896–900 (2009)

9. Stoyanov, D., Scarzanella, M., Pratt, P., Yang, G.: Real-time stereoreconstruction in robotically assisted minimally invasive surgery.In:Medical ImageComputing andComputer-Assisted InterventionMICCAI, pp. 275–282 (2010)

10. Stoyanov,D.,Mylonas,G.,Deligianni, F.,Darzi,A.,Yang,G.: Soft-tissue motion tracking and structure estimation for robotic assistedMIS procedures. In: International Conference on Medical ImageComputing and Computer-Assisted Intervention (MICCAI), vol.3759, pp. 114–121 (2005)

11. Wu,C.,Narasimhan, S.G., Jaramaz,B.:Amulti-image shape-from-shading framework for near-lighting perspective endoscopes. Int.J. Comput. Vis. 86, 211–228 (2010)

12. Yeung, S., Tsui, H., Yim, A.: Global shape from shading for anendoscope image. In: International Conference on Medical ImageComputing and Computer-Assisted Intervention (MICCAI), pp.318–327 (1999)

13. Okatani, T., Deguchi, K.: Shape reconstruction from an endoscopeimage by shape from shading technique for a point light source atthe projection center. Comput. Vis. Image Underst. 66, 119–131(1997)

14. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of densetwo-frame stereo correspondence algorithms. Int. J. Comput. Vis.47, 7–42 (2002)

15. Fan, Y., Meng, MQ-H., Li, B.: 3D reconstruction of wireless cap-sule endoscopy images. In: 2010 Annual International Conferenceof the IEEE Engineering in Medicine and Biology (2010)

16. Horn, B.: Shape from shading. Cambridge:Massachusetts Instituteof Technology. Int. J. Comput. Vis. 5(1), 37–75 (1970)

17. Rai, L., Higgins, W.E.: Method for radiometric calibration of anendoscopes camera and light source. In: SPIE Medical Imaging:Visualization, Image-Guided Procedures, and Modeling, pp. 691–813 (2008)

18. Visentini-Scarzanella, M., Stoyanov, D., Yang, G.-Z.: Metric depthrecovery from monocular images using shape-from-shading andspecularities. IEEE International Conference on Image Processing(ICIP), Orlando, FL (2012)

19. Wang, R, et al.: Improving 3D surface reconstruction from endo-scopic video via fusion and refined reflectance modeling. (2017)

20. Zhao, Q., Price, T., Pizer, S., Niethammer, M., Alterovitz, R.,Rosenman, J.: The Endoscopogram: A 3D model reconstructedfrom endoscopic video frames. In: Ourselin, S., Joskowicz, L.,Sabuncu, M., Unal, G., Wells, W. (eds.) Medical Image Comput-ing andComputer-Assisted Intervention -MICCAI 2016.MICCAI

123

Page 14: Sparse-then-dense alignment-based 3D map reconstruction ......2 Electrical and Electronics Engineering Department, Middle East Technical University, Ankara, Turkey 3 Computer Science

358 M. Turan et al.

2016. Lecture Notes in Computer Science, vol. 9900. Springer,Cham (2016)

21. Kaufman, A., Wang, J.: 3d Surface Reconstruction from Endo-scopic Videos, Visualization in Medicine and Life Sciences, pp.61–74. Springer, Berlin (2008)

22. Malti, A., Bartoli, A.: Combining conformal deformation andcooktorrance shading for 3-D reconstruction in laparoscopy. IEEETrans. Biomed. Eng. 61(6), 1684–1692 (2014)

23. Malti, A., Bartoli, A., Collins, T.: Template-based conformalshape-from-motion-and-shading for laparoscopy. In: InternationalConference on InformationProcessing inComputer-Assisted Inter-ventions. Springer, Berlin (2012)

24. Nadeem, S., Kaufman, A.: Depth reconstruction and computer-aided polyp detection in optical colonoscopy video frames. arXivpreprint arXiv:1609.01329 (2016)

25. Abu-Kheil Y, Ciuti G, Mura M, Dias J, Dario P, Seneviratne L:Vision and inertial-based imagemapping for capsule endoscopy. In:2015 InternationalConference on Information andCommunicationTechnology Research (ICTRC) (2015)

26. Telea, Alexandru: An image inpainting technique based on the fastmarching method. J. Graph GPU Game Tools 9, 23–34 (2004)

27. Conrady, A.: Decentering lens systems. Mon. Not. R. Astron. Soc.79, 384–390 (1919)

28. Zheng, Y., Yu, J., Kang, S.B., Lin, S., Kambhamettu, C.: Single-image vignetting correction using radial gradient symmetry. In:IEEE Conference on Computer Vision and Pattern Recognition,pp. 562–576 (2008)

29. Burt, P.J., Adelson, E.H.:Amulti-resolution splinewith applicationto image mosaics. ACM Trans. Graph. (TOG). https://dl.acm.org(1983)

30. Wu, C., Agarwal, S., Curless, B., Seitz, S.M. Multicore bundleadjustment. In: CVPR (2011)

31. Brown,M., Loewe,D.:Automatic panoramic image stitching usinginvariant features. Int. J. Comput. Vision 74(1), 59–73 (2007)

32. Ping-Sing, T., Shah, M.: Shape from shading using linear approx-imation. Image Vis. Comput. 12(8), 487–498 (1994)

33. Elhabian, S.Y.: Hands on shape from shading. SCIHomeTechnicalReport, Spring (2008)

34. Yim, S., Sitti, M.: Design and rolling locomotion of a magneticallyactuated soft capsule endoscope. IEEE Trans. Robot. 28, 183–194(2012)

35. Yim, S., Goyal, K., Sitti, M.: Magnetically actuated soft cap-sule with multi-modal drug release function. IEEE/ASME Trans.Mechatron. 18, 1413–1418 (2013)

Mehmet Turan received hisDiploma Degree from the Infor-mation technology and Electronicsengineering department of RWTHAachen, Germany in 2012. He wasa research scientist at UCLA (Uni-versity of California Los Ange-les) between 2013 and 2014 anda research scientist at the MaxPlanck Institute for Intelligent Sys-tems between 2014-present. He iscurrently enrolled as a Ph.D. Stu-dent at the ETH Zurich, Switzer-land. He is also affiliated with MaxPlanck-ETH Center for Learning

Systems, the first joint research center of ETH Zurich and the MaxPlanck Society. His research interests include SLAM (simultaneouslocalization and mapping) techniques for milli-scale medical robotsand deep learning techniques for medical robot localization and map-ping. He received DAAD fellowship between years 2005–2011 and

Max Planck Fellowship between 2014-present. He has also receivedMPI-ETH Center fellowship between 2016-present.

Yusuf Yigit Pilavci received hisBachelor Degree from Electricaland Electronics Engineering ofMiddle East Technical University,Ankara in 2017. He worked asan undergraduate researcher focus-ing on image processing, computervision, machine learning and artifi-cial intelligence. Currently, he pur-sues his master degree in Com-puter Science and Engineering ofPolitecnico di Milano, Italy. Addi-tionally, he is working on graphsignal processing and domain adap-tation problems.

Ipek Ganiyusufoglu pursues B.Sc.degree in Department of ComputerScience, Sabanci University,Turkey. Besides smaller projects,her current interests include com-puter graphics, interaction andvision, which she plans to focus onfurther when doing masters.

Helder Araujo is a Professor atthe Department of Electrical andComputer Engineering of the Uni-versity of Coimbra. His researchinterests include Computer Visionapplied to Robotics, robot naviga-tion and visual servoing. In thelast few years he has been work-ing on non-central camera models,including aspects related to poseestimation, and their applications.He has also developed work inActive Vision, and on control ofActive Vision systems. Recently hehas started work on the develop-

ment of vision systems applied to medical endoscopy.

123

Page 15: Sparse-then-dense alignment-based 3D map reconstruction ......2 Electrical and Electronics Engineering Department, Middle East Technical University, Ankara, Turkey 3 Computer Science

Sparse-then-dense alignment-based 3D... 359

Ender Konukoglu Ph.D., finishedhis Ph.D. at INRIA Sophia Antipo-lis in 2009. From 2009 till 2012he was a post-doctoral researcherat Microsoft Research Cambridge.From 2012 till 2016 he was a juniorfaculty at the Athinoula A. Mar-tinos Center affiliated toMassachusetts General Hospitaland Harvard Medical School. Since2016 he is an Assistant Profes-sor of Biomedical Image Comput-ing at ETH Zurich. His is inter-ested in developing computationaltools and mathematical methods

for analysing medical images with the aim to build decision supportsystems. He develops algorithms that can automatically extract quanti-tative image-based measurements, statistical methods that can performpopulation comparisons and biophysical models that can describephysiology and pathology.

Dr. Metin Sitti received the B.Sc.and M.Sc. degrees in electrical andelectronics engineering fromBogazici University, Istanbul,Turkey, in 1992 and 1994, respec-tively, and the Ph.D. degree in elec-trical engineering from the Uni-versity of Tokyo, Tokyo, Japan, in1999. He was a research scientistat UC Berkeley during 1999–2002.He has been a professor in theDepartment of Mechanical Engi-neering and Robotics Institute atCarnegie Mellon University, Pitts-burgh, USA since 2002. He is cur-

rently a director at the Max Planck Institute for Intelligent Systemsin Stuttgart. His research interests include small-scale physical intel-ligence, mobile microrobotics, bio-inspired materials and miniaturerobots, soft robotics, and micro-/nanomanipulation. He is an IEEE Fel-low. He received the SPIE Nanoengineering Pioneer Award in 2011and NSF CAREER Award in 2005. He received many best paper,video and poster awards in major robotics and adhesion conferences.He is the editor-in-chief of the Journal of Micro-Bio Robotics.

123