versavis: an open versatile multi-camera visual-inertial ...1 versavis: an open versatile...

9
1 VersaVIS: An Open Versatile Multi-Camera Visual-Inertial Sensor Suite Florian Tschopp 1 , Michael Riner 1 , Marius Fehr 1,2 , Lukas Bernreiter 1 , Fadri Furrer 1 , Tonci Novkovic 1 , Andreas Pfrunder 3 , Cesar Cadena 1 , Roland Siegwart 1 , and Juan Nieto 1 Abstract—Robust and accurate pose estimation is crucial for many applications in mobile robotics. Extending visual Simulta- neous Localization and Mapping (SLAM) with other modalities such as an inertial measurement unit (IMU) can boost robustness and accuracy. However, for a tight sensor fusion, accurate time synchronization of the sensors is often crucial. Changing exposure times, internal sensor filtering, multiple clock sources and unpredictable delays from operation system scheduling and data transfer can make sensor synchronization challenging. In this paper, we present VersaVIS, an Open Versatile Multi-Camera Visual-Inertial Sensor Suite aimed to be an efficient research platform for easy deployment, integration and extension for many mobile robotic applications. VersaVIS provides a complete, open-source hardware, firmware and software bundle to perform time synchronization of multiple cameras with an IMU featuring exposure compensation, host clock translation and independent and stereo camera triggering. The sensor suite supports a wide range of cameras and IMUs to match the requirements of the application. The synchronization accuracy of the framework is evaluated on multiple experiments achieving timing accuracy of less than 1 ms. Furthermore, the applicability and versatility of the sensor suite is demonstrated in multiple applications in- cluding visual-inertial SLAM, multi-camera applications, multi- modal mapping, reconstruction and object based mapping. I. I NTRODUCTION Autonomous mobile robots are well established in controlled environments such as factories where they rely on external infrastructure such as magnetic tape on the floor or beacons. However, in unstructured and changing environments, robots need to be able to plan their way and interact with their environment which, as a first step, requires accurate positioning [1]. In mobile robotic applications, visual sensors can provide solutions for odometry and Simultaneous Localization and Mapping (SLAM), achieving good accuracy and robustness. Using additional sensor modalities such as inertial measurement units (IMUs) [2]–[4] can additionally improve robustness and accuracy for a wide range of applications. For many frameworks, a time offset or delay between modalities can lead to bad results or even render the whole approach unusable. There are frameworks such as VINS-Mono [4] that can estimate a time offset during estimation, however, 1 Authors are members of the Autonomous Systems Lab, ETH Zurich, Switzerland; {firstname.lastname}@mavt.ethz.ch 2 Marius Fehr is with Voliro Airborne Robotics, Switzerland; [email protected] 3 Andreas Pfrunder is with Sevensense Robotics, Switzerland; [email protected] LED IMU LiDAR Thermal camera Camera 1 Camera 0 (a) Lidarstick ToF Camera Camera 0 IMU (b) RGB-D-I sensor Camera 0 IMU Camera 1 (c) Stereo VI Sensor [5] (d) VersaVIS triggering board Fig. 1: The VersaVIS sensor in different configurations. VersaVIS is able to synchronize multiple sensor modalities such as IMUs and cameras (e.g. monochrome, color, ToF and thermal) but can also be used in conjunction with additional sensors such as LiDARs. convergence of the estimation can be improved by accurate timestamping, i.e. assigning timestamps to the sensor data that precisely correspond to measurement time from a common clock across all sensors. For a reliable and accurate sensor fusion, all sensors need to provide timestamps matchable between sensor modalities. Typically, the readout of camera image and IMU measurement are not on the same device resulting in a clock offset between the two timestamps which is hard to predict due to universal serial bus (USB) buffer delay, operating system (OS) scheduling, changing exposure times and internal sensor filtering. Therefore, measurement correspondences between modalities are ambiguous and assignment on the host is not trivial. Currently, there is no reference sensor synchronization framework for data collection, which makes it hard to compare results obtained in various publications that arXiv:1912.02469v1 [cs.RO] 5 Dec 2019

Upload: others

Post on 25-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • 1

    VersaVIS: An Open Versatile Multi-CameraVisual-Inertial Sensor Suite

    Florian Tschopp1, Michael Riner1, Marius Fehr1,2, Lukas Bernreiter1, Fadri Furrer1, Tonci Novkovic1, AndreasPfrunder3, Cesar Cadena1, Roland Siegwart1, and Juan Nieto1

    Abstract—Robust and accurate pose estimation is crucial formany applications in mobile robotics. Extending visual Simulta-neous Localization and Mapping (SLAM) with other modalitiessuch as an inertial measurement unit (IMU) can boost robustnessand accuracy. However, for a tight sensor fusion, accuratetime synchronization of the sensors is often crucial. Changingexposure times, internal sensor filtering, multiple clock sourcesand unpredictable delays from operation system scheduling anddata transfer can make sensor synchronization challenging. Inthis paper, we present VersaVIS, an Open Versatile Multi-CameraVisual-Inertial Sensor Suite aimed to be an efficient researchplatform for easy deployment, integration and extension formany mobile robotic applications. VersaVIS provides a complete,open-source hardware, firmware and software bundle to performtime synchronization of multiple cameras with an IMU featuringexposure compensation, host clock translation and independentand stereo camera triggering. The sensor suite supports a widerange of cameras and IMUs to match the requirements of theapplication. The synchronization accuracy of the framework isevaluated on multiple experiments achieving timing accuracy ofless than 1ms. Furthermore, the applicability and versatilityof the sensor suite is demonstrated in multiple applications in-cluding visual-inertial SLAM, multi-camera applications, multi-modal mapping, reconstruction and object based mapping.

    I. INTRODUCTION

    Autonomous mobile robots are well established incontrolled environments such as factories where they relyon external infrastructure such as magnetic tape on thefloor or beacons. However, in unstructured and changingenvironments, robots need to be able to plan their way andinteract with their environment which, as a first step, requiresaccurate positioning [1].

    In mobile robotic applications, visual sensors can providesolutions for odometry and Simultaneous Localizationand Mapping (SLAM), achieving good accuracy androbustness. Using additional sensor modalities such as inertialmeasurement units (IMUs) [2]–[4] can additionally improverobustness and accuracy for a wide range of applications. Formany frameworks, a time offset or delay between modalitiescan lead to bad results or even render the whole approachunusable. There are frameworks such as VINS-Mono [4]that can estimate a time offset during estimation, however,

    1Authors are members of the Autonomous Systems Lab, ETH Zurich,Switzerland; {firstname.lastname}@mavt.ethz.ch

    2Marius Fehr is with Voliro Airborne Robotics, Switzerland;[email protected]

    3Andreas Pfrunder is with Sevensense Robotics, Switzerland;[email protected]

    LED

    IMU

    LiDAR

    Thermal

    camera

    Camera 1

    Camera 0

    (a) Lidarstick

    ToF Camera

    Camera 0

    IMU

    (b) RGB-D-I sensor

    Camera 0IMU

    Camera 1

    (c) Stereo VI Sensor [5] (d) VersaVIS triggering board

    Fig. 1: The VersaVIS sensor in different configurations.VersaVIS is able to synchronize multiple sensor modalitiessuch as IMUs and cameras (e.g. monochrome, color, ToF andthermal) but can also be used in conjunction with additionalsensors such as LiDARs.

    convergence of the estimation can be improved by accuratetimestamping, i.e. assigning timestamps to the sensor data thatprecisely correspond to measurement time from a commonclock across all sensors.For a reliable and accurate sensor fusion, all sensors needto provide timestamps matchable between sensor modalities.Typically, the readout of camera image and IMU measurementare not on the same device resulting in a clock offset betweenthe two timestamps which is hard to predict due to universalserial bus (USB) buffer delay, operating system (OS)scheduling, changing exposure times and internal sensorfiltering. Therefore, measurement correspondences betweenmodalities are ambiguous and assignment on the host is nottrivial.Currently, there is no reference sensor synchronizationframework for data collection, which makes it hard tocompare results obtained in various publications that

    arX

    iv:1

    912.

    0246

    9v1

    [cs

    .RO

    ] 5

    Dec

    201

    9

  • 2

    deal with visual-inertial (VI) SLAM. Commercially andacademically established sensors such as the SkybotixVI-Sensor [6], Intel RealSense [7], SkyAware sensor basedon work from Honegger et. al. [8] or PIRVS [9] are eitherunavailable and/or are limited in hardware configurationregarding image sensor, lens, camera baseline and IMU.Furthermore, it is often impossible to add further extensionsof those frameworks to enable fusion with other modalitiessuch as LiDAR sensors or external illumination.

    In this paper, we introduce VersaVIS1 shown in Figure 1,the first open-source framework that is able to accuratelysynchronize a large range of camera and IMU sensors. Thesensor suite is aimed for the research community to enablerapid prototyping of affordable sensor setups in differentfields in mobile robotics where visual navigation is important.Special emphasis is put on an easy integration for differentapplications and easy extensibility by being based on well-known open-source frameworks such as Arduino [10] andRobot Operating System (ROS) [11].The remainder of the paper is organized as follows: InSection II, the sensor suite is described in detail includingall of its features. Section III provides evaluations of thesynchronization accuracy of the proposed framework. Finally,Section IV showcases the use of the Open Versatile Multi-Camera Visual-Inertial Sensor Suite (VersaVIS) in multipleapplications while Section V provides a conclusion with anoutlook on future work.

    II. THE VISUAL-INERTIAL SENSOR SUITE

    The proposed sensor suite consists of three different parts,(i) the firmware which runs on the microcontroller unit (MCU),(ii) the host driver running on a ROS enabled machine, and (iii)the hardware trigger printed circuit board (PCB). An overviewof the framework is provided in Figure 2. Here, the procedureis described for a reference setup consisting of two camerasand an Serial Peripheral Interface (SPI) enabled IMU.The core component of VersaVIS is the MCU. First of all,it is used to periodically trigger the IMU readout togetherwith setting the timestamps and sending the data to the host.Furthermore, the MCU sends triggering pulses to the camerasto start image exposure. This holds for both independent cam-eras and stereo cameras (see Section II-A). After successfulexposure, the MCU reads the exposure time by listening tothe cameras’ exposure signal in order to perform exposurecompensation described in Section II-A1 and setting mid-exposure timestamps. The timestamps are sent to the hosttogether with a strictly increasing sequence number. The imagedata is hereby sent directly from the camera to the hostcomputer to avoid massive amounts of data through the MCU.This enables to use high-resolution cameras even with a low-performance MCU.Finally, the host computer merges image timestamps fromthe MCU with the corresponding image messages based ona sequence number (see Section II-A1).

    1VersaVIS is available at: www.github.com/ethz-asl/versavis

    IndependentIndependent

    Independent

    IMU

    MCU

    Host Computer

    IMU data

     Image data (USB) 

    Image time-

    stamps

    Time sync

     IMU data (SPI)  Trigger (SPI) 

     Trigger (pulse) 

     Trigger (pulse) 

    Exposure time 

     Image data (USB) 

    Slave

    Master

    Slave

    MasterMaster

    Slave

    Trigger (pulse)Exposure time 

    Stereo cameras

    Independent cameras

    Fig. 2: Design overview of VersaVIS. The MCU sends triggersto both IMU and the connected cameras. Image data is directlytransferred to the host computer where it is combined with thetimestamps from the MCU.

    A. Firmware

    The MCU is responsible for triggering the devices at thecorrect time and to capture timestamps of the triggered sensormeasurements. This is based on the usage of hardware timersand external signal interrupts.

    1) Standard cameras: In the scope of the MCU, standardcameras are considered sensors that are triggerable with signalpulses and with non-zero data measurement time, i.e. imageexposure time. Furthermore, the sensors need to provide anexposure signal (often called strobe signal) which indicatesthe exposure state of the sensor. While the trigger pulse andthe timestamp are both created on the MCU based on itsinternal clock, the data is transferred via USB or Ethernetdirectly to the host computer. To enable correct associationof the image data and timestamp on the host computer (seeSection II-B1), both the timestamp and the image data areassigned an independent sequence number nV ersaV IS andnimage, respectively. The mapping between these sequencenumbers is determined during initialization as a simultaneousstart of the cameras and the trigger board cannot be guaran-teed.

    • Initialization procedure: After startup of the camera andtrigger board, corresponding sequence numbers are foundby very slowly trigger the camera without exposure timecompensation. Corresponding sequence numbers are thendetermined by closest timestamps. This holds true if thetriggering period time is significantly longer than theexpected delay and jitter on the host computer. As soon asthe sequence number offset ocam is determined, exposurecompensation mode at full frame rate can be used.

    • Exposure time compensation: Performing auto-exposure(AE), the camera adapts its exposure time to the currentillumination resulting in a non-constant exposure time.Furgale et al. [12] showed, that mid-exposure times-tamping is beneficial for image based state estimation,

  • 3

    t

    IMUmeasurements

    Periodictriggering

    Exposurecompensatedcam0

    Exposurecompensatedcam2

    Stereoslavecam1

    Fig. 3: Exposure time compensation for multi-camera VIsensor setup (adapted from [6]). The blue lines indicatecorresponding measurements.

    especially when using global shutter cameras. Instead ofperiodically triggering the camera, a scheme proposed byNikolic et al. [6] is employed. The idea is to triggerthe camera for a periodic mid-exposure timestamp bystarting exposure half the exposure time earlier to itsmid-exposure timestamp as shown in Figure 3 for cam0,cam1 and cam2. The exposure time return signal isused to time the current exposure time and calculate theoffset to the mid-exposure timestamp of the next image.Using this approach, corresponding measurements can beobtained even if multiple cameras do not share the sameexposure time (e.g. cam0 and cam2 in Figure 3).

    • Master-slave mode: Using two cameras in a stereo setupcompared to a monocular camera can retrieve metric scaleby stereo matching. This can enable certain applicationswhere IMU excitation is not high enough and there-fore biases are not fully observable without this scaleinput e.g. for rail vehicles described in Section IV-B.Furthermore, it can also provide more robustness. Toperform accurate and efficient stereo matching, it ishighly beneficial if keypoints from the same spot havea similar appearance. This can be achieved by using theexact same exposure time on both cameras. Thereby, onecamera serves as the master performing AE while theother adapts its exposure time. This is achieved by routingthe exposure signal from cam0 directly to the trigger ofcam1 and also using it to determine the exposure timefor compensation.

    2) Other triggerable sensors: Some sensors enable mea-surement triggering but do not require or offer the possibilityto do exposure compensation e.g. thermal cameras or ToFcameras. These typically do not allow for adaptive exposurecompensation but rather have a fixed exposure/integrationtime. They can be treated the same as a standard camera, butwith fixed exposure time.Sensors that provide immediate measurements (such as exter-nal IMUs) do not need an exposure time compensation and canjust use a standard timer. Note that the timestamp for thosesensors are still captured on the MCU and therefore correspondto the other sensor modalities.

    3) Other non-triggerable sensors: In robotics, it is oftenuseful to perform sensor fusion with multiple available sensormodalities such as wheel odometers or LiDAR sensors [13].Most of such sensor hardware do not allow triggering or

    low-level sensor readout but send the data continuously toa host computer. In order to enable precise and accuratesensor fusion, having corresponding timestamps of all sensormodalities is often crucial.As most such additional sensors produce their timestampsbased on the host clock or synchronized to the host clock,VersaVIS performs time translation2 to the host. An on-boardextended Kalman filter (EKF) is deployed to estimate clockskew η and clock offset δ using

    thost = tslave + dt · ηk + δk + ∆, (1)

    where thost and tslave are the timestamps on the host andslave (in this case the VersaVIS MCU) respectively, k is theupdate step, dt = tslave− tkslave is the time since the last EKFupdate and ∆ refers to the initial clock offset set at the initialconnection between host and slave.Periodically, VersaVIS performs a filter update by requestingthe current time of the host

    tkhost = tahost −

    1

    2(taslave − trslave) , (2)

    where tr is the time at sending the request and ta is thetime when receiving the answer from the host assumingthat communication time delay between host and slave issymmetric. The filter update is then performed using standardEKF equations:

    δ̂k = δk−1 + dt · ηk,η̂k = ηk−1,

    P̂k =

    [1 dt0 1

    ]· Pk−1 ·

    [1 dt0 1

    ]>+

    [Qδ 00 Qη

    ] (3)where ·̂ depicts the prediction, P is the covariance matrix andQδ,η are the noise parameters of the clock offset and clockskew, respectively. The measurement residual � can be writtenas

    �k = tkhost −∆− tkslave − δ̂k, (4)

    where the Kalman gain K and the measurement update can bederived using the standard EKF equations.

    B. Host computer driver

    The host computer needs to run a lightweight application inorder to make sure the data from VersaVIS can be correctlyused in the ROS environment.

    1) Synchronizer: The host computer needs to take care ofmerging together the image data directly from the camera sen-sors and the image timestamps from the VersaVIS triggeringboard.During initialization, timestamp and image data are assignedbased on minimal time difference within a threshold for eachconnected camera separately such as

    ocam =nimage(timage ≈ tV ersaV IS)−nV ersaV IS(timage ≈ tV ersaV IS),

    (5)

    2In this context, clock synchronization refers to modifying the slave clockspeed to align to the master clock while clock translation refers to translatingthe timestamp of the slave clock to the time of the master clock [14].

  • 4

    TABLE I: Hardware characteristics for the VersaVIS triggeringboard.

    MCU ARM M0+Hardware interface SPI, I2C, UARTHost interface Serial-USB 2.0Weight 15.2 gSize 62× 40× 13.4mmPrice < 100 $

    where ocam is the sequence number offset. As the imagesare triggered very slowly (e.g. 1 Hz), the USB buffer and OSscheduling jitter is assumed to be negligible.As soon as ocam is constant and time offsets are small, thetrigger board is notified about the initialization status of thecamera. With all cameras initialized, normal triggering mode(e.g. high frequency) can be activated.During normal mode, image data (directly from the camera)and image timestamps (from VersaVIS triggering board) areassociated based on sequence number like

    timage = tV ersaV IS(nimage + ocam). (6)

    2) IMU receiver: In addition to the camera data, in a setupwhere the IMU is triggered and read out by the VersaVIStriggering board, the IMU message should only hold minimalinformation to minimize bandwidth requirements and thereforeneeds to be reassembled into a full IMU message on the hostcomputer.

    C. VersaVIS triggering board

    One main part of VersaVIS is the triggering board which isa MCU-based custom PCB shown in Figure 1d that is usedto connect all sensors and performs sensor synchronization.For easy extensibility and integration, the board is compatiblewith the Arduino environment [10]. In the reference design,the board supports up to three independently triggered cameraswith a four pin connector. Furthermore, SPI, Inter-IntegratedCircuit (I2C) or Universal Asynchronous Receiver Transmitter(UART) can be used to interface with an IMU or other sensors.Table I shows the specifications of the triggering board. Theboard is connected to the host computer using USB andcommunicates with ROS using rosserial [15].

    III. EVALUATIONS

    In this section, several evaluations are carried out thatshow the synchronization accuracy of different modules of theVersaVIS framework.

    A. Camera-Camera

    The first important characteristic of a good multi-cam timesynchronization is that multiple corresponding camera imagescapture the same information. This is especially importantwhen multiple cameras are used for state estimation (seeSection IV-B).For the purpose of evaluating the synchronization accuracyof multi-camera triggering, we captured a stream of imagesof an light emitting diode (LED) timing board shown inFigure 4 with two independently triggered but synchronized

    Fig. 4: LED timing board with indicated state numbers. Theleft and right most LEDs are used for position reference. Thefour right counting LEDs are striding while the four countingLEDs on the left side are binary encoded.

    and exposure-compensated cameras. The board features eightcounting LEDs (the left and right most LEDs are always onand used for position reference). The board is changing statewhenever a trigger is received. The LEDs are organized intwo groups. The right group of four indicate one count each,while the left group is binary encoded resulting in a counteroverflow at 64. The board is triggered with fl = 1 kHz alignedwith the mid-exposure timestamps of the images. Furthermore,both cameras are operated at a rate of fc = 10 Hz and with afixed exposure time of tc = 1 ms.For a successful synchronization of the sensors, images cap-tured with both cameras should show the same bit count cwith at most two of the striding LEDs on (since the boardchanges state at mid-exposure) and also the correct incrementbetween images of k = flfc = 100.Figure 5 shows results of three consecutive image pairs.All image pairs (left and right) show the same LED countwhile consecutive images show the correct increment of 100.An image stream containing 400 image pairs was inspectedwithout any case of wrong increment or non-matching pairs.We can therefore conclude that the time synchronization of twocameras has an accuracy better than 12·fl = 0.5 ms showingan accurate time synchronization.

    B. Camera-IMU

    For many visual-inertial odometry (VIO) algorithms suchas ROVIO [2] or OKVIS [3], accurate time synchronizationof camera image and IMU measurement is crucial for a robustand accurate operation.Typically, time offsets between camera and IMU can be theresult of data transfer delay, OS scheduling, clock offsets ofmeasurement devices, changing exposure times or internalfiltering of IMU measurements. Thereby, only offsets thatare not constant are critical as constant offsets can be cali-brated. Namely, offsets that are typically changing such as OSscheduling, clocks on different measurement devices or a notcompensated changing exposure time should be avoided.Using the camera-IMU calibration framework Kalibr [12], atime offset between camera and IMU measurement can be

  • 5

    (a) LED count c1 = 16

    (b) LED count c2 = c1 + k = 116%64 = 52

    (c) LED count c3 = c2 + k = 216%64 = 24

    Fig. 5: Measurements of two independently triggered synchro-nized cameras with exposure compensation (left cam0, rightcam2). Both cameras show strictly the same image with atmost two of the striding LEDs on. The increment betweenconsecutive measurements adds up correctly and overflows ata count of 64.

    determined by optimizing the extrinsic calibration togetherwith a constant offset between both modalities. To test theconsistency of the time offset, multiple datasets were recordedwith VersaVIS using different configurations for the IMUfiltering and exposure times. The window width B of thedeployed Barlett window finite impulse response (FIR) filteron the IMU [16] was set to B = {0, 2, 4} while the exposuretime te was set to AE or fixed to te = {1, 3, 5}ms.Furthermore, also the Skybotix VI-Sensor [6] and IntelRealSense T265 [7] were tested for reference.Table II and Figure 6 show import indicators of the calibrationquality for multiple datasets. The reprojection error representshow well the lens and distortion model fit the actual lensand how well the movement of the camera agrees with themovement of the IMU after calibration. The reprojection errorof VersaVIS and VI-Sensor are both low and consistent mean-ing that the the calibration converged to a consistent extrinsictransformation between camera and IMU and both sensormeasurements agree well. Furthermore, the reprojection errorsare independent of the filter and exposure time configurationshowing that exposure compensation is working as expected.On the other side, the RealSense shows high reprojectionerrors because of the sensor’s fisheye lenses which turn outto be hard to calibrate even with the available fisheye lensmodels [17] in Kalibr.This also becomes visible in the acceleration and gyroscopeerrors where the errors are very low as a result to the poorly fit-ting lens model. Kalibr estimates the body spline to be purely

    represented by the IMU and neglect image measurements. ForVersaVIS both errors are highly dependent on the IMU filtershowing a decrease of error with more aggressive filtering dueto minimized noise. However, also here similar or lower errorsof VersaVIS can be achieved compared to the VI-Sensor.Finally, the time offset between camera and IMU measure-ments shows that VersaVIS and VI-Sensor possess a similaraccuracy in time synchronization as the standard deviations arelow and the time offsets consistent indicating synchronizationaccuracy below 0.05 ms. However, time offset is highly de-pendent on the IMU filter configuration. Therefore, the delayshould be compensated either on the driver side or on theestimator side when more aggressive filtering is used (e.g.to reduce the influence of vibrations). Furthermore, the timeoffset is independent of exposure time and camera indicatingagain that exposure compensation is working as intended.RealSense shows inconsistent time offset estimations with abi-modal distribution delimited by half the inter-frame timeof ≈ 15 ms indicating that for some datasets, there might beimage measurements shifts by one frame.

    C. VersaVIS-Host

    As mentioned in Section II-A3, not all sensors are directlycompatible with VersaVIS. The better the clock translation ofdifferent measurement devices, the better the sensor fusion.Thanks to the bi-directional connection between VersaVIS andhost computer, clock translation requests can be sent fromVersaVIS and the response from the host can be analyzed.Such requests are sent every second. Figure 7 shows theevolution of the EKF states introduced in Section II-A3.In this experiment, there is a clock skew between the hostcomputer and VersaVIS resulting in a constantly decreasingoffset after EKF convergence. This highlights the importanceof estimating the skew when using time translation.Figure 8 shows results of the residual � and the innovation

    terms of the clock offset δ and skew η after startup andin convergence. After approximately 60 s, the residual dropsbelow 5 ms and keeps oscillating at ±5 ms due to USB jitter.However, thanks to the EKF, this error is smoothed to a zeromean innovation of the offset of ±0.2 ms resulting in a clocktranslation accuracy of ±0.2 ms. The influence of the skewinnovation can be neglected with the short update time of 1 s.

    IV. APPLICATIONS

    This section validates the flexibility, robustness and accu-racy of our system with several different sensor setups usingVersaVIS, utilized in different applications.

    A. Visual-inertial SLAM

    The main purpose of a VI sensor is to perform odometryestimation and mapping. For that purpose, we collected adataset walking around in our lab with different sensor setupsincluding VersaVIS equipped with a FLIR BFS-U3-04S2M-CS camera and the Analog Devices ADIS16448 IMU shownin Figure 1a, a Skybotix VI-Sensor [6] and an Intel RealSense

  • 6

    TABLE II: Mean values of camera to IMU calibration results for different sensors and sensor configurations for VersaVIS.

    B te N Reprojection error [px] Gyroscope error [rad/s] Acceleration error [m/s2] Time offset [ms]Mean Std Mean Std Mean Std Mean Std

    VersaVIS 0 AE 10 0.100078 0.064657 0.009890 0.006353 0.143162 0.191270 1.552360 0.034126VersaVIS 2 AE 6 0.098548 0.063781 0.007532 0.005617 0.083576 0.130018 5.260927 0.035812VersaVIS 4 AE 6 0.101866 0.067196 0.007509 0.005676 0.030261 0.018168 19.951467 0.049712VersaVIS 4 1ms 6 0.121552 0.075939 0.006756 0.004681 0.026765 0.014890 20.007137 0.047525VersaVIS 4 3ms 6 0.108760 0.062456 0.006483 0.004360 0.024309 0.014367 20.002966 0.035952VersaVIS 4 5ms 6 0.114614 0.074536 0.006578 0.004282 0.027468 0.016553 19.962924 0.029872VI-Sensor × AE 40 0.106839 0.083605 0.008915 0.007425 0.044845 0.042446 1.173100 0.046410Realsense × AE 40 0.436630 0.355895 0.000000 0.000005 0.000000 0.000002 9.884808 5.977421

    Mean StdReprojection error

    0

    0.02

    0.04

    0.06

    0.08

    0.1

    0.12

    0.14

    0.16

    0.18

    0.2

    [px]

    VersaVIS: B=0, AEVersaVIS: B=2, AEVersaVIS: B=4, AEVersaVIS: B=4, te=1msVersaVIS: B=4, te=3msVersaVIS: B=4, te=5msVI-Sensor: AERealSense: AECam0Cam1

    Mean StdGyroscope error

    0

    0.002

    0.004

    0.006

    0.008

    0.01

    0.012[r

    ad/s

    ]

    Mean StdAcceleration error

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    [m/s

    2 ]

    Time offset0

    5

    10

    15

    20

    25

    [ms]

    Fig. 6: Distribution of camera to IMU calibration results using the different sensors and different sensor configurations forVersaVIS.

    0 200 400 600 800 1000 1200Time [s]

    -0.1

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    Off

    set [

    s]

    -0.55

    -0.5

    -0.45

    -0.4

    -0.35

    -0.3

    -0.25

    -0.2

    Skew

    [m

    s/s]

    Fig. 7: EKF filter states after startup and in convergence. Theoffset is constantly decreasing after convergence due to a clockskew difference between VersaVIS and host computer.

    T265 all attached to the same rigid body. For reference, wealso evaluated the use of the FLIR camera together with theIMU of the VI-Sensor as a non-synchronized3 sensor setup.Figure 9 shows an example of the feature tracking window ofROVIO on the dataset. Both VersaVIS and the VI-Sensor showmany well tracked features even during fast motions depicted

    3Since both sensors do some time translation to the host, software timetranslation is available.

    0 200 400 600 800 1000 1200-10

    -5

    0

    5

    10

    Res

    idua

    l [m

    s]

    0 200 400 600 800 1000 1200-0.4

    -0.2

    0

    0.2

    0.4

    Off

    set i

    nnov

    atio

    n [m

    s]

    0 200 400 600 800 1000 1200Time [s]

    -2

    0

    2

    Skew

    Inn

    ovat

    ion

    [us/

    s]

    Fig. 8: EKF residual and innovation terms after startup andin convergence. The jitter in the serial-USB interface is di-rectly influencing the residual resulting in oscillating errors.However, as this jitter has zero-mean, the achieved clocksynchronization/translation has a much higher accuracy visiblein the innovation term of the clock offset.

  • 7

    Fig. 9: Feature tracking of ROVIO [2] on a dataset recordedin our lab. Inliers of the feature tracking are shown in greenwhile outliers are shown in red.

    on the image. Due to the high field of view (FOV) of theRealSense cameras and not perfectly fitting lens model, someof the keypoints do not reproject correctly to the image planeresulting in falsely warped keypoints. With a non-synchronizedsensor (VersaVIS non-synced), many of the keypoints cannotbe correctly tracked as the IMU measurements and the imagemeasurements do not agree well.

    Figure 10 shows trajectories obtained with the proceduredescribed above. While all sensors provide useful outputdepending on the application, VersaVIS shows the lowest drift.VI-Sensor and RealSense suffer from their specific camerahardware where the VI-Sensor has inferior lenses and camerachips (visible in motion blur) and RealSense has camera lenseswhere no well-fitting lens model is available in the usedframeworks. Without synchronization, the trajectory becomesmore jittery resulting in potentially unstable estimator alsovisible in the large scale offset and shows higher drift. Usingbatch optimization and loop closure [18], the trajectory ofVersaVIS can be further optimized (VersaVIS opt). However,the discrepancy between VersaVIS and VersaVIS opt is smallindicating an already good odometry performance.

    B. Stereo Visual-Inertial Odometry on Rail vehicle

    The need for public transportation is heavily increasingwhile current infrastructure is reaching its limits. Using on-board sensors with reliable and accurate positioning, theefficiency of the infrastructure could be highly increased [5].However, this requires the fusion of multiple independentpositioning modalities of which one could be visual-aidedodometry.For this purpose, VersaVIS was combined with two global-shutter cameras arranged in a fronto-parallel stereo setup usingmaster-slave triggering (see Section II-A) and a compact,precision six degrees of freedom IMU shown in Figure 1c.In comparison to many commercial sensors, the camera has toprovide a high frame-rate to be able to get a reasonable numberof frames per displacement, even at higher speeds and featurea high dynamic range to deal with the challenging lightingconditions. Furthermore, due to the constraint motion of thevehicle and low signal to noise ratio (SNR), the IMU shouldbe of a high quality and also should be temperature calibrated

    0 10 20 30

    -10

    -5

    0

    5VersaVIS

    0 10 20 30 40-15

    -10

    -5

    0

    5

    VersaVIS non synced

    0 10 20 30

    -10

    -5

    0

    5VI-Sensor

    0 10 20 30

    -10

    -5

    0

    RealSense

    0 10 20 30

    -10

    -5

    0

    5VersaVIS opt

    0 10 20 30 40

    -10

    -5

    0

    5 Floor plan

    Fig. 10: Trajectories of different sensor setups using ROVIO[2] on a dataset recorded in our lab. While all of the sensorsprovide a useful output, VersaVIS shows the lowest amount ofdrift while the non-synchronized sensor shows a lot of jitterin the estimation.

    TABLE III: Sensor specifications deployed for data collectionon rail vehicles [5]

    Device Type Specification

    Camera BasleracA1920-155uc

    Frame-rate 20 fps4,Resolution 1920× 1200,Dynamic range 73 dB

    Lense Edmund Optics Focal length 8mm ≈ 70 deg openingangle; Aperture f/5.6

    IMU ADIS16445 Temperature calibrated, 300Hz,±250 deg/s, ±49m/s2

    to deal with temperature changes due to direct sunlight. Thesensor specifications are summarized in Table III.

    The results obtained with VersaVIS (details found in ourprevious work [5]) show that by using stereo cameras andtightly synchronized IMU measurements we can improverobustness and provide accurate odometry up to 1.11 % ofthe travelled distance on railway scenarios and speeds up to52.4 km/h.

    C. Multi-modal mapping and reconstruction

    VI mapping as described in IV-A can provide reliablepose estimates in many different environments. However, forapplications that require precise mapping of structures inGPS denied, visually degraded environments, such as minesand caves, additional sensors are required. The multi-modalsensor setup as seen in Figure 1a was specifically developedfor mapping research in these challenging conditions. Thefact that VersaVIS provides time synchronization to the hostcomputer greatly facilitates the addition of other sensors. The

    4The hardware is able to capture up to 155 fps.

  • 8

    (a) Accumulated points clouds using only VI SLAM poses.

    (b) Dense reconstruction examplebased on [20].

    (c) Visible details of wheels lying onthe ground using dense reconstructionbased on [20].

    Fig. 11: 3D reconstruction of an underground mine in GonzenCH based on LiDAR data with poses provided by VI SLAM[18].

    prerequisite is that these additional sensors are time synchro-nized with the host computer as well, which in our case, anOuster OS-1 64-beam LiDAR, is done over Precision TimeProtocol (PTP). The absence of light in these undergroundenvironments also required the addition of an artificial lightingsource that fulfills very specific requirements to support thecamera system for pose estimation and mapping. The mainchallenge was to achieve the maximum amount of light,equally distributed across the environment (i.e. ambient light),while at the same time being bound by power and coolinglimitations. To that end a pair of high-powered, camera-shutter-synchronized LEDs, similar to to the system shownby Nikolic et. al. [19], are employed. VersaVIS provides atrigger signal to the LED control board, which represents theunion of all camera exposure signals, ensuring that all imagesare fully illuminated while at the same time minimizing thepower consumption and heat generation. This allows operatingthe LEDs at a significantly higher brightness level than duringcontinuous operation.

    Figure 11 shows an example of the processed multi-modalsensor data. VI odometry [2] and mapping [18] includinglocal refinements based on LiDAR data and Truncated SignedDistance Function (TSDF)-based surface reconstruction [20]were used to precisely map the 3D structure of parts of anabandoned iron mine in Switzerland.

    D. Object Based Mapping

    Robots that operate in changing environments benefit fromusing maps that are based on physical objects instead ofabstract points. Such object based maps are more consistentif individual objects move, as only a movement of an objectmust be detected instead of each point on the moved object

    (a) Scene reconstruction using the RGB-D-I sensor.

    (b) Segmented objects in a parkinggarage using [21].

    (c) Merging of objects from thedatabase [21].

    Fig. 12: Object segmentation and reconstruction based on [21]and [22] using data from the RGB-D-I sensor and cameraposes obtained with maplab [18].

    that is part of the map. Object based maps are also a betterrepresentation for manipulation tasks. The elements in the mapare typically directly the objects of interest in such tasks.For the object based mapping application, a sensor setup witha depth and an RGB camera, and an IMU was assembled,see Figure 1b. A Pico Monstar, which is a ToF camera thatprovides 352 × 287 resolution depth images at up to 60 Hz,was combined with a Flir Blackfly 1.6 MP color camera,and an ADIS16448 IMU. To obtain accurate pose estimates,the IMU and color camera were used for odometry andlocalization [18].With these poses, and together with the depth measurements,an approach from [21] was used to reconstruct the scene andextract object instances. An example of such a segmentationmap is shown in Figure 12a, and objects that were extractedand inserted into a database in Figure 12b and 12c.

    V. CONCLUSIONS

    We presented a hardware synchronization suite for multi-camera VI sensors consisting of the full hardware design,firmware and host driver software. The sensor suite supportsmultiple beneficial features such as exposure time compen-sation and host time translation and can be used in bothindependent and master-slave multi-camera mode.The time synchronization performance is analyzed separatelyfor camera-camera synchronization, camera-IMU synchroniza-tion and VersaVIS-host clock translation. All modules achievetime synchronization accuracy of < 1 ms which is expectedto be accurate enough for most mobile robotic applications.The benefits and great versatility range of the sensor suiteare demonstrated on multiple applications including hand-held

  • 9

    VIO, multi-camera VI applications on rail vehicles as wellas large scale environment reconstruction and object basedmapping.For the benefit of the community, all hardware and softwarecomponents are completely open-source with a permissive li-cense and based on easily available hardware and developmentsoftware. This paper and the accompanying framework canalso serve as a freely available reference design for researchand industry as it summarizes solution approaches to multiplechallenges of developing a synchronized multi-modal sensorsetup.The research community can easily adopt, adapt and extendthis sensor setup and rapid-prototype custom sensor setupsfor variety of robotic applications. Many experimental featuresshowcase the easy extensibility of the framework:• Illumination module: The VersaVIS triggering board can

    be paired with LEDs shown in Figure 1a which aretriggered corresponding to the longest exposure time ofthe connected cameras. Thanks to the synchronization,the LEDs can be operated at a higher brightness as whichwould be possible in continuous operation.

    • IMU output: The SPI output of the board enables to usethe same IMU which is used in the VI setup for a low-level controller such as the PixHawk [23] used in microaerial vehicle (MAV) control.

    • Pulse per second (PPS) sync: Some sensors such asspecific LiDARs allow synchronization to a PPS signalprovided by e.g. a Global Position System (GPS) receiveror real-time clock. Using the external clock input on thetriggering board, VersaVIS can be extended to synchro-nize to the PPS source.

    • LiDAR synchronization: The available auxiliary interfaceon VersaVIS could be used to tightly integrate LiDARmeasurements by getting digital pulses from the LiDARcorresponding to taken measurements. The merging pro-cedure would then be similar to the one described inSection II-B1 for cameras with fixed exposure time.

    ACKNOWLEDGEMENT

    This work was partly supported by Siemens Mobility,Germany and by the National Center of Competence in Re-search (NCCR) Robotics through the Swiss National ScienceFoundation. The authors would like to thank Gabriel Waibelfor his work adapting the firmware for a newer MCU.

    REFERENCES

    [1] Roland Siegwart, Illah R. Nourbakhsh, and Davide Scaramuzza, Intro-duction to Autonomous Mobile Robots, 2nd ed., Ronald C. Arkin, Ed.Camebridge: MIT Press, 2011.

    [2] M. Bloesch, S. Omari, M. Hutter, and R. Siegwart, “Robust visual iner-tial odometry using a direct EKF-based approach,” in IEEE InternationalConference on Intelligent Robots and Systems, Seattle, WA, USA, 2015,pp. 298–304.

    [3] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale,“Keyframe-based visualinertial odometry using nonlinear optimization,”The International Journal of Robotics Research, vol. 34, no. 3, pp. 314–334, 2015.

    [4] T. Qin, P. Li, and S. Shen, “VINS-Mono: A Robust and Versatile Monoc-ular Visual-Inertial State Estimator,” IEEE Transactions on Robotics,vol. 34, no. 4, pp. 1004–1020, 2018.

    [5] F. Tschopp, T. Schneider, A. W. Palmer, N. Nourani-Vatani, C. Ca-dena Lerma, R. Siegwart, and J. Nieto, “Experimental Comparison ofVisual-Aided Odometry Methods for Rail Vehicles,” IEEE Robotics andAutomation Letters, 2019.

    [6] J. Nikolic, J. Rehder, M. Burri, P. Gohl, S. Leutenegger, P. T. Furgale,and R. Siegwart, “A synchronized visual-inertial sensor system withFPGA pre-processing for accurate real-time SLAM,” in IEEE Inter-national Conference on Robotics and Automation, Hong Kong, China,2014, pp. 431–437.

    [7] Intel Corporation, “Intel® RealSense Tracking CameraT265,” 2019. [Online]. Available: https://www.intelrealsense.com/tracking-camera-t265/

    [8] D. Honegger, T. Sattler, and M. Pollefeys, “Embedded Real-Time Multi-Baseline Stereo,” in IEEE International Conference on Robotics andAutomation, vol. 688007, no. 688007, Singapore, 2017.

    [9] Zhe Zhang, Shaoshan Liu, Grace Tsai, Hongbing Hu, Chen-Chi Chu,and Feng Zheng, “PIRVS: An Advanced Visual-Inertial SLAM Systemwith Flexible Sensor Fusion and Hardware Co-Design,” in ICRA, Bris-bane, 2018.

    [10] Arduino Inc., “Arduino Zero,” 2019. [Online]. Available: https://store.arduino.cc/arduino-zero

    [11] Stanford Artificial Intelligence Laboratory et al., “Robotic OperatingSystem,” 2018. [Online]. Available: https://www.ros.org

    [12] P. Furgale, J. Rehder, and R. Siegwart, “Unified temporal and spatialcalibration for multi-sensor systems,” in IEEE International Conferenceon Intelligent Robots and Systems, Tokyo, Japan, 2013, pp. 1280–1286.

    [13] J. Zhang and S. Singh, “Visual-lidar Odometry and Mapping: Low-drift,Robust, and Fast,” in 2015 IEEE International Conference on Roboticsand Automation (ICRA), Seattle, Washington, 2015, pp. 2174–2181.

    [14] H. Sommer, R. Khanna, I. Gilitschenski, Z. Taylor, R. Siegwart, andJ. Nieto, “A low-cost system for high-rate, high-accuracy temporalcalibration for LIDARs and cameras,” in IEEE International Conferenceon Intelligent Robots and Systems, 2017.

    [15] M. Ferguson, P. Bouchier, and M. Purvis, “rosserial,” 2019.[16] Analog Devices, “ADIS16448BLMZ: Compact, Precision Ten

    Degrees of Freedom Inertial Sensor,” Analog Devices, Tech.Rep., 2019. [Online]. Available: https://www.analog.com/media/en/technical-documentation/data-sheets/ADIS16448.pdf

    [17] V. Usenko, N. Demmel, and D. Cremers, “The double sphere cameramodel,” in Proceedings - 2018 International Conference on 3D Vision,3DV 2018. Institute of Electrical and Electronics Engineers Inc., 102018, pp. 552–560.

    [18] T. Schneider, M. Dymczyk, M. Fehr, K. Egger, S. Lynen, I. Gilitschen-ski, and R. Siegwart, “maplab: An Open Framework for Research inVisual-inertial Mapping and Localization,” IEEE Robotics and Automa-tion Letters, vol. 3, no. 3, pp. 1418–1425, 11 2018.

    [19] J. Nikolic, M. Burri, J. Rehder, S. Leutenegger, C. Huerzeler, andR. Siegwart, “A UAV system for inspection of industrial facilities,” inIEEE Aerospace Conference, Big Sky, MT, 2013.

    [20] H. Oleynikova, Z. Taylor, M. Fehr, R. Siegwart, and J. Nieto, “Voxblox:Incremental 3D Euclidean Signed Distance Fields for on-board MAVplanning,” IEEE International Conference on Intelligent Robots andSystems, vol. 2017-Septe, pp. 1366–1373, 2017.

    [21] F. Furrer, T. Novkovic, M. Fehr, M. Grinvald, C. Cadena, J. Nieto, andR. Siegwart, “Modelify: An Approach to Incrementally Build 3D ObjectModels for Map Completion,” Manuscript submitted for publication,2019.

    [22] F. Furrer, T. Novkovic, M. Fehr, A. Gawel, M. Grinvald, T. Sattler,R. Siegwart, and J. Nieto, “Incremental Object Database: Building3D Models from Multiple Partial Observations,” in 2018 IEEE/RSJInternational Conference on Intelligent Robots and Systems (IROS),Madrid, Spain, 2018.

    [23] L. Meier and Auterion, “Pixhawk 4,” 2019. [Online]. Available:https://docs.px4.io/v1.9.0/en/flight controller/pixhawk4.html

    https://www.intelrealsense.com/tracking-camera-t265/https://www.intelrealsense.com/tracking-camera-t265/https://store.arduino.cc/arduino-zerohttps://store.arduino.cc/arduino-zerohttps://www.ros.orghttps://www.analog.com/media/en/technical-documentation/data-sheets/ADIS16448.pdfhttps://www.analog.com/media/en/technical-documentation/data-sheets/ADIS16448.pdfhttps://docs.px4.io/v1.9.0/en/flight_controller/pixhawk4.html

    I IntroductionII The Visual-Inertial Sensor SuiteII-A FirmwareII-A1 Standard camerasII-A2 Other triggerable sensorsII-A3 Other non-triggerable sensors

    II-B Host computer driverII-B1 SynchronizerII-B2 IMU receiver

    II-C VersaVIS triggering board

    III EvaluationsIII-A Camera-CameraIII-B Camera-IMUIII-C VersaVIS-Host

    IV ApplicationsIV-A Visual-inertial SLAMIV-B Stereo Visual-Inertial Odometry on Rail vehicleIV-C Multi-modal mapping and reconstructionIV-D Object Based Mapping

    V ConclusionsReferences