deeptio: a deep thermal-inertial odometry with visual … · 2019-09-17 · deeptio: a deep...

DeepTIO: A Deep Thermal-Inertial Odometry with Visual Hallucination

Muhamad Risqi U. Saputra, Pedro P. B. de Gusmao, Chris Xiaoxuan Lu, Yasin Almalioglu,Stefano Rosa, Changhao Chen, Johan Wahlstrom, Wei Wang, Andrew Markham, and Niki Trigoni

Abstract— Visual odometry shows excellent performance ina wide range of environments. However, in visually-denied sce-narios (e.g. heavy smoke or darkness), pose estimates degradeor even fail. Thermal imaging cameras are commonly used forperception and inspection when the environment has low visi-bility. However, their use in odometry estimation is hamperedby the lack of robust visual features. In part, this is as a resultof the sensor measuring the ambient temperature profile ratherthan scene appearance and geometry. To overcome these issues,we propose a Deep Neural Network model for thermal-inertialodometry (DeepTIO) by incorporating a visual hallucinationnetwork to provide the thermal network with complementaryinformation. The hallucination network is taught to predictfake visual features from thermal images by using the robustHuber loss. We also employ selective fusion to attentivelyfuse the features from three different modalities, i.e thermal,hallucination, and inertial features. Extensive experiments areperformed in our large scale hand-held data in benign andsmoke-filled environments, showing the efficacy of the proposedmodel.

I. INTRODUCTION

Camera pose estimation is a key enabler for a widerange of applications in robotics and computer vision. Pri-mary examples include position tracking of mobile robots,autonomous vehicles, pedestrians, or mobile devices foraugmented reality applications. Visual Odometry (VO) isthe de facto solution for estimating camera pose. ManyVO techniques have been proposed, ranging from traditionalfeature-based approaches [1], [2], [3] to the more recentlydeveloped Deep Neural Network (DNN) based approaches[4], [5], [6], [7]. While VO is useful in a number of scenarios,its application is limited to those with sufficient illumination.For instance, VO systems fail in locating aerial robots in dimunderground tunnels [8] or tracking a firefighter in emer-gency response scenarios in presence of airborne particulates(e.g., smoke and soot). In contrast, thermal cameras are notaffected by illumination conditions or airborne particulates,making them a viable sensing alternative to RGB cameras.

Although thermal cameras have been commonly used invisually-denied environments, their use cases are largelylimited to perception and inspection [9], [10]. The main hin-drance preventing their usage in odometry estimation is thelack of visual features (e.g. edges and textures) in the imag-ing system. Thermal cameras capture the radiation emittedfrom objects in the Long-Wave Infrared (LWIR) portion ofthe spectrum. These raw radiometric data are then convertedto a temperature profile represented in a visible format (e.g.grayscale) to ease human interpretation [11]. As the camera

The authors are with the Department of Computer Science, University ofOxford, UK [email protected]

6

IMU

Selective Fusion

6 DoF PosesThermal

Radiometric Data

HallucinationNetwork

ThermalNetwork

t

t-1

Recurrent Network

RegressionNetwork

Fake RGBLatent

Features

Fig. 1. The architecture of DeepTIO at test time. DeepTIO not only extractsthermal features but also hallucinates visual features to provide additionalinformation for accurate odometry.

captures the environmental temperature rather than the sceneappearance and geometry, it is difficult to extract sufficienthand engineered features to accurately estimate pose. More-over, even for the same scene, the extracted features aredependent on the temperature gradient. This issue is furthercompounded by the fact that every thermal camera is plaguedwith fixed-pattern noise and requires frequent re-calibrationduring operation through Non-Uniformity Correction (NUC)which periodically freezes the images for about half to onesecond [12] every 30-150 seconds.

The last decade has witnessed a rapid development inthe use of deep learning for automatically extracting salientfeatures by directly learning a non-linear mapping functionfrom data. We believe that, with sufficient training data, aDNN can also learn to infer 6 Degree-of-Freedom (DoF)poses from a sequence of thermal images. However, despitethe DNN’s ability to model this complexity, as stated before,thermal images are largely textureless and inherently lacksufficient features for accurate odometry estimation. Ournovel intuition to alleviate this issue is to force our networkto not only extract features from thermal images, but to addi-tionally learn to hallucinate visual features similar to the onesextracted from a DNN-based VO, which have been proven towork well [4], [5], [6], [13]. Given sufficient training data, wehypothesize that hallucinating visual features is possible andcan provide the thermal network with auxiliary informationfor accurate odometry estimation.

In this paper, we propose a DNN-based thermal-inertialodometry which is able to estimate accurate camera poseby not only extracting features from thermal images, butalso hallucinating the visual features given thermal images as

arX

iv:1

909.

0723

1v1

[cs

.CV

] 1

6 Se

p 20

19

input. We also fuse the thermal image stream with InertialMeasurement Unit (IMU) data to improve pose estimationrobustness due to its environment-agnostic characteristic. Tothis extent, we employ selective fusion [14] to adaptivelyfuse the different modalities conditioned on the input data.In summary, our key contributions are as follows:

• We propose the first end-to-end trainable Deep Thermal-Inertial Odometry (DeepTIO) model.

• We present a novel deep neural odometry architectureincorporating a hallucination network.

• We present a new application of selective fusion withinput from three feature channels, i.e. thermal, IMU,and hallucinated visual features.

• We perform extensive experiments and analysis in ourself-collected hand-held dataset in both benign andsmoke-filled environments.

II. RELATED WORK

A. Thermal Odometry

Accurately estimating camera ego-motion from a thermalimaging system remains a challenging problem. Some effortshave been made towards thermal odometry systems, althoughthese are limited to relatively short distances and yield sub-optimal performances compared to visible camera systems.Existing works either rely on sparse feature-based or direct-based approaches. Mouats et al. [15] employed a Fast-Hessian feature detector for UAV tracking using a stereothermal camera. Khattak et al. [16] developed a keyframe-based direct approach which minimizes radiometric error(raw thermal data) between consecutive frames. Borges andVidas [17] designed a practical thermal odometry systemwith an automatic mechanism to determine when to performNUC operation based on the current and the predicted poses.To improve robustness during NUC operation, most worksincorporate thermal imaging with other modalities such asvisual [18], [19] or inertial [20], [16].

B. DNN-based Odometry

Due to the advancements of DNN, learning based odom-etry is recently gaining more favor. Wang et al. [4] startedthis trend by introducing an end-to-end trainable deep vi-sual odometry method (DeepVO) by composing the featureextraction capabilities of Convolutional Neural Networks(CNN) and the ability to model long-term camera posedependencies using a Recurrent Neural Network (RNN).This was then followed by other improvements such asenforcing consistency among multiple poses [5] or intro-ducing additional learning signals by performing multi-tasklearning such as global pose localization [21] or semanticsegmentation [22]. Other works improved the robustness ofVO by fusing visual and inertial streams [23] and performingselective fusion between visual and inertial features [14]. Inparallel to these supervised approaches, many self-supervisedDNN-based VO approaches were also developed by leverag-ing the view reconstruction paradigm, started by Zhou et al.[24], followed by adding stereo information [25], [7], [26]or generative networks [6], [27]. While many works exist for

DNN-based odometry, to the best of our knowledge none ofthem uses thermal camera as input.

C. Learning with Side Information

Related to our work is the concept of learning with sideinformation. Hoffman et al. [28] introduced this concept byincorporating a depth hallucination network to increase theaccuracy of object detection in RGB images. This conceptwas then adopted in other applications such as learninghand articulations [29] or face recognition [30]. Our workintroduces this concept to odometry regression and trainsthe whole network with the non-trivial Huber loss. We arethe first to hallucinate visual features from thermal imagesfor odometry regression.

III. NETWORK ARCHITECTURE

In this section we describe our proposed DeepTIO modelfor estimating thermal-inertial odometry. Fig. 1 illustrates thegeneral architecture of DeepTIO model at inference time.It is composed by a feature encoder, a selective fusionmodule, and a pose regressor. The feature encoder extractssalient features from each modality. We use a CNN forencoding thermal data and hallucinating visual features fromthermal images. To extract features from the IMU datastream we employ a RNN, as RNN works better to modeltemporal dependencies of time-series data [31]. The featurevectors generated from the IMU, thermal, and hallucinationencoder networks are input into the selective fusion module,attentively selecting certain features that are necessary forpose regression. The reweighted features are further feed intopose regression module to infer 6-DoF relative camera poses.The details of each module are described as below.

A. Feature Encoder

Given a pair of consecutive thermal images xT ∈IR2×(w×h×c), the purpose of the thermal encoder networkis to extract geometrically meaningful features for move-ment estimation (e.g. optical flow captured between movingedges). To this end, both thermal encoder ΨT and halluci-nation encoder ΨH are implemented and pre-initialized withFlowNetSimple structure [32]. As the observed temperatureprofile (in grayscale) fluctuates when the camera captureshotter objects, we directly use the 16 bit raw radiometric datato obtain more stable inputs. Since raw radiometric data areonly represented by one channel, we duplicate it into threechannels for feeding into the FlowNet structure. We use thelast output activation from both ΨT and ΨH as our thermalaT and visual hallucinated aH features

aT = ΨT (xT ), aH = ΨH(xT ). (1)

We employ a single LSTM layer with 256 hidden states asIMU encoder ΨI . The 6-dimensional inertial data with asequence of 20 frames xI ∈ IR6×20 are fed into IMU encoderΨI to produce IMU features

aI = ΨI(xI). (2)

6

IMU

Selective Fusion

6 DoF Poses

6

6 DoF Poses

Hallucination Loss

CopiedChannels

Thermal Radiometric Data

Deep Visual-Inertial Odometry (VINet)


ThermalNetwork

RGBNetwork

t

t-1

Recurrent Network

t

t-1

RGB Images

IMU

Recurrent Network

Regression Loss

Deep Thermal-Inertial Odometry (DeepTIO)

Regression Loss

RegressionNetwork

RegressionNetwork

Fig. 2. The architecture of DeepTIO at training time. Note how RGBimages are used to guide the visual hallucination.

To balance the number of features, we perform averagepooling for aT and aH , such that the final dimensions for allfeatures are aT ∈ IR2048, aH ∈ IR2048, and aI ∈ IR5120.

B. Selective Fusion

In deep learning-based VIO, a standard way to fuse featurevectors coming from different modalities is by concatenation.However, a direct fusion of all feature modalities usingconcatenation results in sub-optimal performance, as notall features are useful and necessary [14]. The situation iseven more exasperated by the intrinsic noise distributionof each modality. In our case, thermal data are plaguedby fixed-pattern noise, while IMU data are affected bywhite random noise and sensor bias. On the other hand,the hallucination network might produce erroneous visualfeatures. Moreover, in real applications there is high chancethat different modalities, as well as the ground truth poses,will not be tightly synchronized.

To this end, we employ selective fusion [14] to let thenetwork automatically learn the best suitable feature combi-nation given feature inputs. Specifically, a deterministic softfusion is employed to attentively fuse features from threesources with compensation for possible misalignment be-tween inputs and ground truth. The fusion module will learnto re-weight each feature by conditioning on all channels.The corresponding mask for thermal mT , hallucination mH ,and inertial feature mI are learnt via:

mT = σ(WT [aT ; aH ; aI ])

mH = σ(WH [aT ; aH ; aI ])

mI = σ(WI [aT ; aH ; aI ]),

(3)

where [aT ; aH ; aI ] denotes the concatenation of all channelsfeatures, σ(x) = 1/(1 + e−x) is the sigmoid function andWT , WH , and WI are the learnable weights for each feature

modality. These masks are used to weight the relative im-portance of the features modalities by multiplying them viaelement-wise operation � with their corresponding masks:

afused = [aT �mT ; aH �mH ; aI �mI ]. (4)

Finally, the merged features afused are fed to the poseregressor network to estimate 6-DoF poses.

C. Pose Regressor

The pose regressor consists of LSTM layers followedby two parallel Fully Connected (FC) layers that estimaterelative translation and rotation respectively. We use anLSTM to model the long-term temporal dependencies ofcamera ego-motion as seen in [4], [5]. Each LSTM has512 hidden states and takes the reweighted features afused

as input. The output latent vectors from the LSTMs arethen fed into three parallel FC layers with 128, 64, 3 unitsrespectively. We decouple the FC layers for translation androtation as it has been shown to work better separately as in[33]. We also use a dropout [34] rate of 0.25 between FClayers to help regularization.

IV. LEARNING MECHANISM

This section introduces the mechanism to train hallucina-tion network and learn odometry regression.

A. Learning Visual Hallucination

The visual hallucination network ΨH is intended to pro-vide additional information along with the thermal encoderΨT . Given original thermal images xT as an input, thismodule produces visual hallucination vectors aH that imitatethe visual features aV from real RGB image input encodedby a visual encoder ΨV . In order to acquire pseudo groundtruth of visual features, we employ a modified deep Visual-Inertial Odometry (VIO) model, i.e. VINet [23]. The onlydifference is that we utilize FlowNetSimple as the featureextractor instead of FlowNetCorr [32] as used in the originalVINet. This modification allows hallucination features aH

and visual features aV to have same dimension, simplifyingthe training process. After training VINet model, the weightsWV in visual encoder ΨV are frozen during the training ofhallucination network, while the hallucination encoder ΨH ’sweights WH are trainable.

Fig. 2 illustrates the architecture of our visual hallucina-tion model in training process. We train the hallucinationnetwork by minimizing the discrepancy ξ between the outputactivation from ΨH and ΨV . Standard L2 norm is generallyused for minimizing ξ in benign cases [28], [35]. However,thermal camera requires periodic NUC calibration, duringwhich time the same image will be output for between half toone second. NUC will force several identical thermal featuresto be matched with different visual features during networktraining. This process might produce an erroneous mappingbetween aH and aV and contaminate ξ with outliers. Sincethe L2 loss is very sensitive to outliers, encountering someduring training will impact gradient back-propagation asthe outliers will dominate the loss, impacting convergence.

To improve robustness against outliers, we instead proposeto use the Huber Loss H [36] to minimize ξ. Then, ourhallucination loss Lhallucinate is formally defined as follows:

Lhallucinate =1

n

n∑i=1

Hi(ξ),

with ξ = ΨH(xT ; WH)−ΨV (xV ; WV ),

and H(ξ) =

{12 ‖ξ‖

2 for ‖ξ‖ ≤ δ,δ(‖ξ‖ − 1

2δ) otherwise

(5)

where δ is a threshold and n is the batch size during training.By using Huber loss, ξ larger than δ will have a linear effectinstead of quadratic, making it less sensitive to outliers. Lossvalues below δ will still be minimized using quadratic lossto enable fast convergence. During training, we use δ = 1.0.

B. Learning Odometry Regression

We train the network to estimate odometry by minimizingthe loss between the predicted pose and the ground truthpose. This task is essentially learning a mapping functionfrom the input to the output {(xT ; xI)1:N} → {(IR6)1:N}where N is the whole training data. The pose regressornetwork, together with all other networks except the hallu-cination part, are trained using the following regression loss

Lregress =1

n

n∑i=1

Hi(t− t) + αHi(r− r) (6)

whereH is the Huber Loss as in (5). [t, r] and [t, r] are a pairof translation and rotation component for the predicted posesand the ground truth poses respectively. We use Euler angleto represent rotation since it is faster to converge as it is freefrom constraints unlike other representations (e.g. rotationmatrix, quaternion). We also use α = 0.001 to balance theloss between translation and rotation.

C. Training Details

The architecture is trained in two stages. In the first stagewe train the hallucination network, while in the secondstage we train the remaining networks. Note that, in thesecond stage, we freeze the hallucination network such thatthe unstable learning process in the beginning of trainingthe other networks does not alter the learnt hallucinationweights that have been trained in the first stage. We train thelosses separately as it empirically shows better result. Weuse the Adam optimizer with a 0.0001 learning rate to trainthe hallucination network for 200 epochs. For training theremaining networks in the second stage we employ RMSPropwith a 0.001 initial learning rate, dropping by 25% every 25epochs for a total of 200 epochs. We normalize the inputradiometric data by subtracting the mean over the dataset.We randomly cut the training sequence into small batchesof consecutive pairs (n = 8) to obtain better generalization.We also sub-sample the input such that the frame rate isaround 4-5 fps to provide sufficient parallax between twoconsecutive frames. To further fine-tune the network, wealternately freeze and train the selective fusion and the poseregressor.

6

Thermal Radiometric Data

Deep Visual-Inertial Odometry (VINet)


Fake RGBLatent

Features

t

t-1

IMU

Recurrent Network

RegressionNetwork

Fig. 3. The hallucination network is validated by feeding the fakeRGB features to the original VINet and measuring the pose estimationdiscrepancy.

V. EXPERIMENT

A. Dataset

The thermal data was collected using a hand-held FLIRE95 camera at 60 fps with 464×348 image resolution, whileinertial measurements were captured using a XSens MTI-1Series IMU. As we gathered the data mostly in public spaces,real ground truth poses are not available. Instead, we utilizeVINS-Mono (visual-inertial SLAM) [37] to act as pseudoground truth. For this purpose, we also collected RGB-Ddata at 30 fps using an Intel RealSense D435 depth cameraat 848×480 image resolution.

We collected data from five different buildings includinga real firefighter training facility. In total, we collected 19sequences in different places including a library, open office,large room, apartment, corridor, underground storage, and anactual smoke-filled environment. The sequences are dividedinto two groups, one with good time alignment and anotherone with slight misalignments among sensor modalities. Thismisalignment can be used to test the robustness of the modelagainst unsynchronized inputs. We use 13 sequences fortraining (about half from bad and another half from goodalignment) and use the remaining data for testing.

B. Evaluation Metrics

To evaluate the proposed model, we utilize Mean Square(MS) of Relative Pose Error (RPE) and Absolute Trajec-tory Error (ATE), since they have been widely used formeasuring VO or visual SLAM accuracy [38]. Since VINS-Mono generates trajectories through a key-frame selectionprocess, the estimated poses will be unaligned in time withthe poses produced by the model. To this end, we align thepredicted poses with VINS-Mono using Horn approaches(closed-form) and evaluate only the poses that are closest intime. We use the evaluation tools from TUM RGB-D datasetto do this1.

C. Sensitivity Analysis

To understand the influence of the hallucination network,we perform a sensitivity analysis in the following section byusing the test sequence in Corridor 2.

1https://vision.in.tum.de/data/datasets/rgbd-dataset

0 0.1 0.2 0.3RPE Translation (m)

0

10

20

30

40

50

60

Fre

quen

cyVINetFake VINet

0 5 10 15RPE Rotation (degree)

0

10

20

30

40

50

Fre

quen

cy

VINetFake VINet

Fig. 4. Relative Pose Error (RPE) distribution between VINet and FakeVINet for both translation and rotation.

TABLE IRPE BETWEEN VINET AND FAKE VINET

Model t (m) r (◦)

VINet 0.1124 5.1954Fake VINet (L2) 0.1197 5.1926Fake VINet (Huber) 0.1128 5.0739

1) Validating the Hallucination Network: To validate thehallucination network, we replace the visual decoder networkfrom VINet with the hallucination network from DeepTIOas seen in Fig. 3. By feeding the hallucinated visual features(fake RGB features) to the original VINet, we can measurehow accurate the learnt representation produced by thehallucination network are. Fig. 4 shows the distribution ofRPE between VINet and Fake VINet (VINet with inputfrom fake RGB features). It can be seen that the errordistribution for both translation and rotation are very similar,showing the success of training the hallucination network.Table I shows how close the average RPE between VINet andFake VINet are. Surprisingly, the Fake VINet got a slightlybetter result for rotation estimation, showing the efficacy oftraining using Huber Loss. Fig. 5 illustrates the visualizationof the output features from VINet and from hallucinationnetwork in the test sequence (Corridor 2). It can be seen thatthe network can hallucinate visual features accurately (top)although there are also cases when the hallucination networkproduces erroneous features (bottom) due to blurriness orlack of thermal edges. In this case, selective fusion playsimportant roles for selecting only relevant information fromthe hallucination network. It can be seen that in the erroneouscase example, the DeepTIO’s selective fusion produces lessdense fusion masks, indicating less features are being used.

2) The Influence of Each Feature Modality: To understandthe influence of each feature modality, we decouple eachfeature modality, train it separately, and test the result. Theresult can be seen in Table II and shows that thermal alonegot the worst accuracy, implying the difficulty of estimatingodometry solely based on temperature profile information.IMU alone clearly shows much stronger performance al-though the optimal solution may require to produce only3-DoF poses (instead of 6-DoF) as seen in [39] since thereis not enough information from the IMU data to produceaccurate 6-DoF poses. Incorporating IMU with thermal fea-tures or fake RGB features improves the accuracy (ATE)as the thermal or the visual features constraints the IMU

TABLE IITHE IMPACT OF EACH FEATURE MODALITY AND SELECTIVE FUSION

Features SF† t (m) r (◦) ATE (m)Thermal - 0.1497 6.5839 6.8347IMU - 0.1204 5.0151 1.7779IMU+Thermal∗ - 0.1133 5.3112 1.4731IMU+Thermal+ 0.1192 5.0461 0.7122IMU+Fake RGB∗ - 0.1153 5.2378 1.5021IMU+Fake RGB+ 0.1090 5.2300 1.0280IMU+Thermal+Fake RGB∗ - 0.1080 5.2127 1.2824IMU+Thermal+Fake RGB+ 0.1074 4.8826 0.5267∗ has 52 M weights, while + has 136 M weights.†Whether Selective Fusion (SF) is employed or not.

error growth. Adding Fake RGB features to the model withIMU+Thermal further reduces the ATE, indicating that thehallucinated visual features help generate more accurateposes. Note that all feature fusions (with the same mark inTable II) have the same network capacity, indicating that theimproved accuracy is due to more useful information, ratherthan increased network capacity.

3) The Influence of Selective Fusion: As seen in Table II,incorporating selective fusion to the combined features con-sistently reduces the ATE over the network without selectivefusion. This shows that selective fusion plays important rolein producing accurate results as each feature modality comeswith intrinsic noises, the hallucination network may produceerroneous visual features, and there is time misalignmentbetween the sensors and the ground truth. Finally, puttingtogether all feature modalities with selective fusion yieldsthe strongest performance for both RPE and ATE.

D. Experimental Evaluation

1) Test in Benign Environment: We test our model acrossdifferent buildings and compare it with the state-of-the-art VIO frameworks to show that our DeepTIO solutionis comparable, even though the image representation hasfewer channels and useful features. For the traditional ap-proach we employ ROVIO [38], which tightly fuses IMUand visual data with an iterated extended Kalman filter.For deep learning based approaches, we use VINet [23]which fuses IMU and visual features in the intermediatelayer. We also compare with Vanilla DeepTIO, a version ofDeepTIO without the visual hallucination network. Fig. 6(a)-(e) depicts the qualitative results in this scenario.

Table III shows the numerical evaluation results in terms ofRPE and ATE. ROVIO provides good accuracy in Corridor2, although it suffers from large scaling problem and losestracking in Corridor 1 due to lack of visual features whenthe camera faces white, flat walls. In misaligned sequences,ROVIO completely fails to initialize, since it requires tightlysynchronized inputs. VINet also performs well when goodalignment is available but suffers from large drift in presenceof time misalignment. This shows that directly concatenatingfeatures may lead to sub-optimal performances. Nevertheless,VINet can still produce odometry where ROVIO completelyfails, showing that deep learning approaches are more robust

RGB Image Thermal Image

0

0.5

1RGB Features

0

0.5

1Hallucination

0

0.5

1Fusion Mask

0

0.5

1

0

0.5

1

0

0.5

1

0

0.5

1

0

0.5

1

Fig. 5. Comparison between RGB features produced by VINet and fake RGB features produced by DeepTIO’s hallucination network in Corridor 2. Top:example of accurate hallucination. Bottom: example of erroneous hallucination. From left to right: RGB image, thermal image, original RGB features,hallucinated RGB features, and fusion mask for the hallucination features generated by DeepTIO.

-15 -10 -5 0x (m)

-2

0

2

4

6

8

10

y (m

)

VINS-Mono (SLAM)ROVIOVINetVanilla DeepTIODeepTIO

(a) Corridor 1

0 5 10 15x (m)

-2

0

2

4

6

8

10

y (m

)

VINS-Mono (SLAM)ROVIOVINetVanilla DeepTIODeepTIO

(b) Corridor 2

-20 -10 0 10x (m)

-5

0

5

10

15

20

y (m

)

VINS-Mono (SLAM)VINetVanilla DeepTIODeepTIO

(c) Large Office

-10 -5 0 5 10 15x (m)

0

5

10

15

20

y (m

)


(d) Library 1

-5 0 5 10x (m)

0

2

4

6

8

10

y (m

)


(e) Library 2

-12 -10 -8 -6 -4 -2 0x (m)

-3

-2

-1

0

1

2

3

4

5

6

y (m

)

ZUPT aided INSDeepTIODeepTIO (w/ scale adjustment)

(f) Firefighter Training facility

Fig. 6. Qualitative evaluation in test sequences. (a)-(b) are test with good sensor alignment. (c)-(e) are test with bad sensor alignment - ROVIO fails toinitialize and VINet produces inaccurate estimates. Our model with selective fusion tackles this issue to some extent. (f) is test in real emergency scenariowith smoke-filled environment. We compared DeepTIO with ZUPT aided INS as VINS-Mono or even Lidar odometry does not work in this visually-deniedscenario.

against sensor alignment issue. However, the best resultsare achieved by Vanilla DeepTIO and DeepTIO as theyemploy selective fusion which is proven to be robust to timesynchronization issues [14]. Note that Vanilla DeepTIO andDeepTIO use a smaller thermal image resolution (464x348)compared to the RGB images used by ROVIO and VINet(848x480).

Vanilla DeepTIO achieves excellent results in Corridor 2,Large Office, and Library 2, but suffers from drift in Corridor1 and Library 1. DeepTIO, on the other hand, produces betterresults due to the additional information provided by thehallucination network. Nonetheless, estimating an accurate

scale is a problem in some sequences. As seen in the LargeOffice sequence, both Vanilla DeepTIO and DeepTIO giveinaccurate scale, possibly due to a large variation of walkingspeeds. This scaling problem is very common in VO or VIO(as seen in ROVIO test in Corridor 1) and remains an openproblem. Overall, DeepTIO yields the best ATE against thecompeting approaches, with an average ATE of 1.67 m.

2) Test in Smoke-filled Environment: In the smoke-filledenvironment, none of the VIO frameworks can work as thecamera only captures black frames. Even Lidar odometrydoes not work well as near-visible light is blocked by thesmoke [40]. In this case, we cannot provide quantitative

TABLE IIICOMPARISON BETWEEN DEEPTIO AND STATE-OF-THE-ART VISUAL-INERTIAL ODOMETRY IN BENIGN ENVIRONMENTS

ROVIO (VIO) VINet (VIO) Vanilla DeepTIO (ours) DeepTIO (ours)Sequence t (m) r (◦) ATE(m) t (m) r (◦) ATE(m) t (m) r (◦) ATE(m) t (m) r (◦) ATE(m)

Corridor 1 0.1928 3.5154 6.2496 0.0905 3.8111 1.8825 0.1198 3.9007 2.1975 0.1065 3.9294 1.9333Corridor 2 0.0398 1.7796 0.3343 0.1124 5.1954 1.1036 0.1192 5.0461 0.7122 0.1074 4.8826 0.5267Large Office* failed to initialize 0.1202 3.0351 4.4359 0.1325 2.9469 3.3088 0.1348 3.1266 3.2648Library 1* failed to initialize 0.2041 3.9357 5.2647 0.2046 3.3424 2.5698 0.2027 3.2498 2.0532Library 2* failed to initialize 0.1290 3.1720 1.6812 0.1307 3.1546 1.5741 0.1286 3.0403 0.5735Mean 0.1163 2.6475 3.2920 0.1312 3.8299 2.8736 0.1414 3.6781 2.0725 0.1360 3.6457 1.6703*In these sequences, there is time misalignment among sensor modalities.

evaluation with any (pseudo) ground truth. We instead pro-vide a qualitative comparison with a zero-velocity-aidedInertial Navigation System (INS) [41], which is not impactedby visibility. This navigation system utilizes foot-mountedinertial sensors to detect Zero Velocity Updates (ZUPT) andthereby mitigates the fast error growth of stand-alone inertialnavigation. Fig. 6 (f) shows the result of the test. It can beseen that DeepTIO yields a similar trajectory shape to ZUPT.This shows that our model, despite being trained in a benignenvironment, can generalize to a smoke-filled environment asthe thermal camera is not affected by the smoke. However,there is scaling issue which probably due to different speedof the camera (as an effect of different walking speed) ordifferent temperature profile compared to the one observedin the training data. If we adjust the scale of DeepTIO, itcan be seen that the prediction is very close to ZUPT. Thisshows that our model is promising for odometry estimationin smoke-filled environments.

3) Memory and Execution Time: The network was trainedon an NVIDIA TITAN V GPU and required around 18 hoursfor training the hallucination network and 20 hours to trainthe remaining networks. The network contains around 136millions weights, requiring 847 MB of space. Neglecting thetime to load and normalize the input, the model can run at40 fps on a TITAN V and 5 fps on a standard CPU.

11 12 13 14 15 16 17Sampling distance between two frames

0

2

4

6

8

Mea

n S

quar

e A

TE

(m

)

ATE DeepTIO in Corridor 2Mean ATE

Fig. 7. Sensitivity towards sampling rate (fps). By using 14 samplingdistance between two frames (optimum performance), it keeps the predictionrate around 4.2 (for 60 thermal fps), which is still in the range of 4-5 fpsused during training.

E. Challenges and Limitations

Despite the fact that DeepTIO can work well in our testscenarios, there are some limitations:

1) Sensitivity to sampling rate. As we trained DeepTIOwith a frame rate of 4-5 fps, the network will onlyperform well by using that frame rate. When inferringwith lower or faster fps, the accuracy will degrade asseen in Fig. 7. Training with multiple fps at the sametime might be possible to obtain robustness againstdifferent sampling rates. This may also alleviate theproblem of scaling as the network would be trainedwith more variations of parallax. However, this mightrequire a (pseudo) ground truth with constant fps,i.e. not irregularly sampled by a key frame selectionprocess as in VINS-Mono.

2) Robustness against distributional shift. DNNs are usu-ally vulnerable to distributional (covariate) shift whichoccurs when the test data are sampled from a differentdistribution than the training data. As we train ourmodel in a benign environment, when we test it insmoke-filled environment, it is expected that we willexperience some covariate shift as the temperatureprofile will be different. Development of an odometryapproach robust against this distributional shift mightbe necessary to enable practical odometry in adverseenvironments.

VI. CONCLUSIONS

We presented a novel DNN-based method for thermal-inertial odometry (DeepTIO) using hallucination networks.We demonstrated that the hallucination network can provideside information for the thermal network to produce accurateodometry estimation. Future work includes incorporatingother sensor modalities for more accurate scale estimation indiverse scenarios and developing robust techniques to copewith distributional (covariate) shifts in the test data.

ACKNOWLEDGMENT

This work is funded by the US National Institute ofStandards and Technology (NIST) grant “Pervasive, Accu-rate, and Reliable Location-Based Services for EmergencyResponders” (No. 70NANB17H185). M. R. U. Saputra wassupported by Indonesia Endowment Fund for Education(LPDP).

REFERENCES

[1] D. Nister, O. Naroditsky, and J. Bergen, “Visual Odometry,” in IEEEConference on Computer Vision and Pattern Recognition (CVPR),2004, pp. 652–659.

[2] C. Forster, M. Pizzoli, and D. Scaramuzza, “SVO: Fast Semi-DirectMonocular Visual Odometry,” in IEEE International Conference onRobotics and Automation (ICRA), 2014, pp. 15–22.

[3] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “ORB-SLAM: AVersatile and Accurate Monocular SLAM System,” IEEE Transactionon Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.

[4] S. Wang, R. Clark, H. Wen, and N. Trigoni, “DeepVO: TowardsEnd-to-End Visual Odometry with Deep Recurrent ConvolutionalNeural Networks,” in IEEE International Conference on Robotics andAutomation (ICRA), 2017.

[5] M. R. U. Saputra, P. P. B. d. Gusmao, S. Wang, A. Markham, andN. Trigoni, “Learning monocular visual odometry through geometry-aware curriculum learning,” in IEEE International Conference onRobotics and Automation (ICRA), 2019.

[6] Y. Almalioglu, M. R. U. Saputra, P. P. de Gusmao, A. Markham, andN. Trigoni, “Ganvo: Unsupervised deep monocular visual odometryand depth estimation with generative adversarial networks,” in IEEEInternational Conference on Robotics and Automation (ICRA), 2019.

[7] H. Zhan, R. Garg, C. Saroj Weerasekera, K. Li, H. Agarwal, andI. Reid, “Unsupervised learning of monocular depth estimation andvisual odometry with deep feature reconstruction,” in IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2018, pp. 340–349.

[8] C. Kanellakis and G. Nikolakopoulos, “Evaluation of visual localiza-tion systems in underground mining,” in 2016 24th MediterraneanConference on Control and Automation (MED). IEEE, 2016, pp.539–544.

[9] T. Peynot, J. Underwood, and S. Scheding, “Towards reliable percep-tion for unmanned ground vehicles in challenging conditions,” in 2009IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS). IEEE, 2009, pp. 1170–1176.

[10] P. B. Quater, F. Grimaccia, S. Leva, M. Mussetta, and M. Aghaei,“Light unmanned aerial vehicles (uavs) for cooperative inspection ofpv plants,” IEEE Journal of Photovoltaics, vol. 4, no. 4, pp. 1107–1113, 2014.

[11] Z. Yu, S. Lincheng, Z. Dianle, Z. Daibing, and Y. Chengping,“Camera calibration of thermal-infrared stereo vision system,” in 2013Fourth International Conference on Intelligent Systems Design andEngineering Applications. IEEE, 2013, pp. 197–201.

[12] A. Averbuch, G. Liron, and B. Z. Bobrovsky, “Scene based non-uniformity correction in thermal images using kalman filter,” Imageand Vision Computing, vol. 25, no. 6, pp. 833–851, 2007.

[13] M. R. U. Saputra, A. Markham, and N. Trigoni, “Visual slam andstructure from motion in dynamic environments: A survey,” ACMComputing Surveys (CSUR), vol. 51, no. 2, p. 37, 2018.

[14] C. Chen, S. Rosa, Y. Miao, C. X. Lu, W. Wu, A. Markham, andN. Trigoni, “Selective sensor fusion for neural visual-inertial odome-try,” in IEEE Conference on Computer Vision and Pattern Recognition,2019, pp. 10 542–10 551.

[15] T. Mouats, N. Aouf, L. Chermak, and M. A. Richardson, “Thermalstereo odometry for uavs,” IEEE Sensors Journal, vol. 15, no. 11, pp.6335–6347, 2015.

[16] S. Khattak, C. Papachristos, and K. Alexis, “Keyframe-based directthermal-inertial odometry,” in Proceedings of the IEEE InternationalConference on Robotics and Automation (ICRA), 2019.

[17] P. V. K. Borges and S. Vidas, “Practical infrared visual odometry,”IEEE Transactions on Intelligent Transportation Systems, vol. 17,no. 8, pp. 2205–2213, 2016.

[18] J. Poujol, C. A. Aguilera, E. Danos, B. X. Vintimilla, R. Toledo,and A. D. Sappa, “A visible-thermal fusion based monocular visualodometry,” in Robot 2015: Second Iberian Robotics Conference.Springer, 2016, pp. 517–528.

[19] S. Khattak, C. Papachristos, and K. Alexis, “Visual-thermal landmarksand inertial fusion for navigation in degraded visual environments,” in2019 IEEE Aerospace Conference. IEEE, 2019, pp. 1–9.

[20] C. Papachristos, F. Mascarich, and K. Alexis, “Thermal-inertial local-ization for autonomous navigation of aerial robots through obscurants,”in 2018 International Conference on Unmanned Aircraft Systems(ICUAS). IEEE, 2018, pp. 394–399.

[21] A. Valada, N. Radwan, and W. Burgard, “Deep auxiliary learningfor visual localization and odometry,” in 2018 IEEE InternationalConference on Robotics and Automation (ICRA). IEEE, 2018, pp.6939–6946.

[22] N. Radwan, A. Valada, and W. Burgard, “Vlocnet++: Deep multitasklearning for semantic visual localization and odometry,” IEEE Roboticsand Automation Letters, vol. 3, no. 4, pp. 4407–4414, 2018.

[23] R. Clark, S. Wang, H. Wen, A. Markham, and N. Trigoni, “Vinet:Visual-inertial odometry as a sequence-to-sequence learning problem,”in Thirty-First AAAI Conference on Artificial Intelligence, 2017.

[24] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “UnsupervisedLearning of Depth and Ego-Motion from Video,” in IEEE Comput.Soc. Conf. Comput. Vis. Pattern Recognit., 2017.

[25] R. Li, S. Wang, Z. Long, and D. Gu, “Undeepvo: Monocular visualodometry through unsupervised deep learning,” in 2018 IEEE Interna-tional Conference on Robotics and Automation (ICRA). IEEE, 2018,pp. 7286–7291.

[26] N. Yang, R. Wang, J. Stuckler, and D. Cremers, “Deep virtualstereo odometry: Leveraging deep depth prediction for monoculardirect sparse odometry,” in European Conference on Computer Vision(ECCV), 2018, pp. 817–833.

[27] T. Feng and D. Gu, “Sganvo: Unsupervised deep visual odometry anddepth estimation with stacked generative adversarial networks,” arXivpreprint arXiv:1906.08889, 2019.

[28] J. Hoffman, S. Gupta, and T. Darrell, “Learning with side informationthrough modality hallucination,” in Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR), 2016, pp.826–834.

[29] C. Choi, S. Kim, and K. Ramani, “Learning hand articulations by hal-lucinating heat distribution,” in Proceedings of the IEEE InternationalConference on Computer Vision, 2017, pp. 3104–3113.

[30] J. Lezama, Q. Qiu, and G. Sapiro, “Not afraid of the dark: Nir-vis facerecognition via cross-spectral hallucination and low-rank embedding,”in Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, 2017, pp. 6628–6637.

[31] J. T. Connor, R. D. Martin, and L. E. Atlas, “Recurrent neuralnetworks and robust time series prediction,” IEEE transactions onneural networks, vol. 5, no. 2, pp. 240–254, 1994.

[32] A. Dosovitskiy, P. Fischery, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov,P. V. D. Smagt, D. Cremers, and T. Brox, “FlowNet: Learning OpticalFlow with Convolutional Networks,” in IEEE Int. Conf. Comput. Vis.,vol. 11-18-Dece, 2016, pp. 2758–2766.

[33] W. Wang, M. R. U. Saputra, P. Zhao, P. Gusmao, B. Yang, C. Chen,A. Markham, and N. Trigoni, “Deeppco: End-to-end point cloudodometry through deep parallel neural network,” IEEE/RSJ Interna-tional Conference on Intelligent Robots and Systems (IROS), 2019.

[34] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-dinov, “Dropout: A simple way to prevent neural networks fromoverfitting,” The Journal of Machine Learning Research, vol. 15, no. 1,pp. 1929–1958, 2014.

[35] M. R. U. Saputra, P. P. de Gusmao, Y. Almalioglu, A. Markham,and N. Trigoni, “Distilling knowledge from a deep pose regressornetwork,” arXiv preprint arXiv:1908.00858, 2019.

[36] P. J. Huber, “Robust estimation of a location parameter,” in Break-throughs in statistics. Springer, 1992, pp. 492–518.

[37] T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monoc-ular visual-inertial state estimator,” IEEE Transactions on Robotics,vol. 34, no. 4, pp. 1004–1020, 2018.

[38] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers,“A benchmark for the evaluation of rgb-d slam systems,” in 2012IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS). IEEE, 2012, pp. 573–580.

[39] C. Chen, X. Lu, A. Markham, and N. Trigoni, “Ionet: Learning tocure the curse of drift in inertial odometry,” in Thirty-Second AAAIConference on Artificial Intelligence, 2018.

[40] M. Bijelic, F. Mannan, T. Gruber, W. Ritter, K. Dietmayer, andF. Heide, “Seeing through fog without seeing fog: Deep sensorfusion in the absence of labeled training data,” arXiv preprintarXiv:1902.08913, 2019.

[41] J. Wahlstrom, I. Skog, F. Gustafsson, A. Markham, and N. Trigoni,“Zero-velocity detection – A Bayesian approach to adaptive thresh-olding,” IEEE Sensors Letters, vol. 3, no. 6, 2019.

deeptio: a deep thermal-inertial odometry with visual … · 2019-09-17 · deeptio: a deep...

Documents