jing dong, john gary burnham, byron boots, glen c. …4d crop monitoring: spatio-temporal...

4D Crop Monitoring: Spatio-Temporal Reconstruction for Agriculture

Jing Dong, John Gary Burnham, Byron Boots, Glen C. Rains, Frank Dellaert

Abstract— Autonomous crop monitoring at high spatial andtemporal resolution is a critical problem in precision agri-culture. While Structure from Motion and Multi-View Stereoalgorithms can finely reconstruct the 3D structure of a fieldwith low-cost image sensors, these algorithms fail to capture thedynamic nature of continuously growing crops. In this paperwe propose a 4D reconstruction approach to crop monitoring,which employs a spatio-temporal model of dynamic scenes thatis useful for precision agriculture applications. Additionally,we provide a robust data association algorithm to addressthe problem of large appearance changes due to scenes beingviewed from different angles at different points in time, whichis critical to achieving 4D reconstruction. Finally, we collecteda high quality dataset with ground truth statistics to evaluatethe performance of our method. We demonstrate that our 4Dreconstruction approach provides models that are qualitativelycorrect with respect to visual appearance and quantitativelyaccurate when measured against the ground truth geometricproperties of the monitored crops.

I. INTRODUCTION & RELATED WORK

Automated crop monitoring is a key problem in precisionagriculture, used to maximize crop yield while minimizingcost and environmental impact. Traditional crop monitoringtechniques are based on measurements by human operators,which is both expensive and labor intensive. Early workon automated crop monitoring mainly relies on satelliteimagery [1], which is expensive and lacks sufficient res-olution in both space and time. Recently, crop monitoringwith Unmanned Aerial Vehicles (UAVs) [2], [3], [4], [5], [6]and ground vehicles [7], [8], [9], [10] has garnered interestfrom both agricultural and robotics communities due to theabilities of these systems to gather large quantities of datawith high spatial and temporal resolution.

Computer vision is a powerful tool for monitoring thecrops and estimating yields with low-cost image sensors [7],[10], [11], [12]. However, the majority of this work onlyutilizes 2D information in individual images, failing torecover the 3D geometric information from sequences ofimages. Structure from Motion (SfM) [13] is a maturediscipline within the computer vision community that enablesthe recovery of 3D geometric information from image. Whencombined with Multi-View Stereo (MVS) approaches [14],these methods can be used to obtain dense, fine-grained 3Dreconstructions. The major barrier to the direct use of thesemethods for crop monitoring is that traditional SfM and MVSmethods only work for static scenes, which cannot solve 3Dreconstruction problems with dynamic growing crops.

J. Dong, B. Boots and F. Dellaert are with College of Comput-ing, Georgia Institute of Technology, USA. [email protected],{bboots,frank.dellaert}@cc.gatech.edu. J. G. Burn-ham and G. C. Rains are with the Department of Entomology, Universityof Georgia, USA. {burnhamj,grains}@uga.edu

Fig. 1: Reconstructed 4D model of a peanut field by ourapproach. Each time slice of the 4D model is a densereconstructed point cloud.

Change detection in both 2D images [15], [16], 3D pointclouds [17], and volumetric reconstruction [18], [19], havebeen extensively studied. Taneja et al. [20] generates a 3Dreconstruction of changed parts from 2D images, and hasapplied the approach to city-scale change detection and 3Dreconstruction [21], [22]. 3D change detection has also beenapplied to urban construction site monitoring [23]. However,this type of scene change detection, which focuses on localchanges, does not apply to precision agriculture applicationswhere crops are constantly changing over time.

Related previous work includes data association ap-proaches that build visual correspondences between sceneswith appearance variations. Data association is a key elementof 3D reconstruction: by identifying geometric correspon-dences between scenes with appearance changes (e.g. due tochanging viewpoint), the scene can be reconstructed whileminimizing misalignment. Recently, data association hasbeen applied to even more difficult problems. For example,Griffith et al. studied the alignment of natural scene imageswith significant seasonal changes [24], [25]. Localizationand place recognition are applications that require data

arX

iv:1

610.

0248

2v1

[cs

.RO

] 8

Oct

201

6

association approaches that are highly robust to illuminationand seasonal appearance changes [26], [27]. A more compre-hensive survey of visual place recognition and localizationcan be found in [28]. Particularly, Beall et al. build aspatio-temporal map, which helps to improve robustness oflocalization across different seasons [29]. The difficulty withapplying these existing methods to field reconstruction is thatthey are usually designed for autonomous driving applica-tions, in which the camera has few view angle variations. Asa result, these approaches cannot solve the large-baseline dataassociation problems that are prevalent in field reconstructiondue to the large view angle variations typically encounteredby vehicles traversing a field.

The computer vision community has worked on time-lapsereconstruction of dynamic scenes for years, but most exist-ing approaches do not obviously apply to crop monitoringapplications. Martin et al. [30], [31] synthesizes smooth time-lapse videos from images collected on the Internet, but thiswork is limited to 2D results, without any 3D geometricinformation. Early work on 4D reconstruction includes [32],[33], which build city-scale 3D reconstructions via temporalinference from historical images. Further work includes [34],which offers better scalability and granularity. A probabilisticvolumetric 4D representation of the environment was pro-posed by [35], and then used in actual 4D reconstructionsenabled by 3D change detection [19]. The major issue withmost existing approaches is that they assume each geometricentity keeps nearly-constant appearance for the temporalduration, which is not the case in crop monitoring since cropsare changing continuously.

In this paper we address the problem of time-lapse 3Dreconstruction with dynamic scenes to model continuouslygrowing crops. We call the 3D reconstruction problem withtemporal information 4D reconstruction. The output of 4Dreconstruction is a set of 3D entities (point, mesh, etc.),associated with a particular time or range of times. Anexample is shown in Fig. 1. A 4D model contains all ofthe information of a 3D model, e.g. canopy size, height, leafcolor, etc., but also contains additional temporal information,e.g. growth rate and leaf color transition.

We also collected a field dataset using a ground vehicleequipped with various sensors, which we will make publiclyavailable. To our knowledge, this will be first freely availabledataset that contains large quantities of spatio-temporal datafor robotics applications targeting precision agriculture.

Our paper contains three main contributions:

• We propose an approach for 4D reconstruction for fieldswith continuously changing scenes, mainly targetingcrop monitoring applications.

• We propose a robust data association algorithm forimages with highly duplicated structures and significantappearance changes.

• We collect a ground vehicle field dataset with groundtruth crop statistics for evaluating 4D reconstruction andcrop monitoring algorithms.

Space (Rows)

Tim

e

Fig. 2: The field 4D model. The field contains multiple rows,and there are multiple time sessions of the field.

II. METHOD

We begin by stating several assumptions related to cropmonitoring, before specifying the details of our 4D recon-struction algorithm.• The scene is static during each data collection session.• The field may contain multiple rows.

The first assumption is acceptable because we only focuson modelling crops and ignore other dynamic objects likehumans, and the crop growth is too slow to be noticeableduring a single collection session. The second assumptionis based on the geometric structure of most fields. The 4Dfield model reflecting these two assumptions is illustrated inFig. 2.

Our proposed system has three parts.• A multi-sensor Simultaneous Localization and Mapping

(SLAM) pipeline, used to compute camera poses andfield structure for a single row in a single session.

• A data association approach to build visual correspon-dences between different rows and sessions.

• A optimization-based approach to build the full 4Dreconstruction across all rows and all sessions.

To generate the 4D reconstruction of the entire field, we firstcompute 3D reconstruction results for each row at each timesession, by running multi-sensor SLAM independently. Nextwe use the data association approach to match images fromdifferent rows and sessions, building a joint factor graph thatconnects the individual SLAM results. Finally we optimizethe resultant joint factor graph to generate the full 4D results.

To clarify notation, we assign the superscript of each sym-bol to the row index and time session index in the remainingarticle, unless otherwise mentioned. The superscript 〈ti, rj〉indicates the variable is associated to row rj of the field attime session ti.

A. Multi-Sensor SLAM

The SLAM pipeline used in this work has two parts,illustrated in Fig. 3.

The first part of the SLAM system is a front-end moduleto process images for visual landmarks. SIFT [36] featuresare extracted from each image, and SIFT descriptor pairs innearby image pairs are matched by the approximate nearestneighbor library FLANN [37]. The matches are furtherfiltered by 8-point RANSAC [38] to reject outliers. Finallya single visual landmark is accepted if there are more than6 images that have corresponding features matched to thesame landmark.

GPS IMU

Images

SIFT Feature Extraction/Matching

Factor Graph Optimization with iSAM2

3D Map andCamera Poses

Fig. 3: Overview of multi-sensor SLAM system.

GPS factor

GP factor

IMU factor

Vision factor (Landmark)

…

Fig. 4: Factor graph of multi-sensor SLAM.

The second part of the SLAM system is a back-end mod-ule for estimating camera states and landmarks using visuallandmark information from the front-end and other sensorinputs. Since the goal of the multi-sensor SLAM system isto reconstruct a single row during a single data collectionsession, the back-end module of the SLAM system estimatesa set of N camera states X〈ti,rj〉 = {x〈ti,rj〉0 , ...,x

〈ti,rj〉N−1 } at

row rj and time ti, given visual landmark measurementsfrom the front-end module, and other sensor measurements,including an Inertial Measurement Unit (IMU) and GPS. Foreach camera state, we estimate xj = {Rj , tj ,vj ,ωj ,bj},which includes camera rotation Rj , camera translation tj ,translational velocity vj , angular rotation rate ωj , and theIMU sensor bias bj .

The SLAM problem is formulated on a factor graph [39]where the joint probability distribution of estimated variablesX〈ti,rj〉 given measurements Z〈ti,rj〉 is factorized and rep-resented as the product

p(X〈ti,rj〉|Z〈ti,rj〉) ∝K∏

k=1

φ(X〈ti,rj〉k ), (1)

where K is the total number of factors, X〈ti,rj〉k is the setof variables the kth factor involved, and φ is the factorin the graph which is proportional to measurement likeli-hood l(X

〈ti,rj〉k ; z

〈ti,rj〉k ), given kth measurement z〈ti,rj〉k ∈

Z〈ti,rj〉. The states can then be computed by Maximum aPosteriori (MAP) estimation

X〈ti,rj〉 = argmaxX〈ti,rj〉

p(X〈ti,rj〉|Z〈ti,rj〉). (2)

The details of the factor graph is shown in Fig. 4. We usesmart factors [40] for visual landmarks, to reduce memorystorage by avoiding the explicit estimation of landmarkvariables. Outlier rejection is used to reject landmarks withre-projection error larger than 10 pixels. IMU measurementsare incorporated into the factor graph by preintegrated IMU

Bounding Box

Fig. 5: Diagram of robust data association.

factors [41]. GPS measurements are not synchronized withimages (detailed in Sec. III-A), so we use a continues-timeSLAM approach which formulates the SLAM problem as aGaussian Process (GP) [42], so the GPS measurements areeasily incorporated into factor graph as interpolated binaryfactors (magenta factors in Fig. 4), with GP prior factors (redfactors in Fig. 4). Details about continuous-time SLAM as aGP can be found in [43], [42], [44].

We optimize the factor graph by iSAM2 [45]. Oncecamera states are estimated, M landmarks L〈ti,rj〉 ={l〈ti,rj〉0 , ..., l

〈ti,rj〉M−1 } are triangulated by known camera poses.

B. Robust Data Association over Time and Large Baseline

The second key element of our approach is robust dataassociation. Data association is a key technique to get recon-struction results of more than a single row at a single time;however, the data association problem between different rowsor times is difficult, since there are significant appearancechanges due to illumination, weather or view point changes.The problem is even more difficult in crop monitoring dueto measurement aliasing [46]: fields contain highly periodicstructures with little visual difference between plants (seeFig. 1). As a result, data association problems betweendifferent rows and times is nearly impossible to solve byimage-only approaches.

Rather than trying to build an image-only approach, weuse single row reconstruction results output by SLAM asa starting point for data association across rows and time.The SLAM results provide camera pose and field structureinformation from all of the sensors (not just images), whichhelps us to improve the robustness of data association.

Specifically, the data association problem involves find-ing visual correspondences (matches between SIFT featurepoints) between two images, I1 and I2, which are taken bycamera C1 and C2 respectively, as shown in Fig. 5. CamerasC1 and C2 may come from the same or different rows,during the same or different time sessions. Each cameraC = {K,R, t} contains the intrinsic calibration K whichis known, and camera pose {R, t} estimated by SLAM.

We combine two methods to build a data association usingprior information from SLAM, back projection boundedsearch and homography image warping. The two methods

(a) I1 (b) I ′1 (c) I2

Fig. 6: (a) original I1, (b) warped image I ′1, and (c) originalI2. Patch center with green cross is feature point f1,i on (a),f ′1,i on (b), and back project point p2,i on (c).

are detailed here.Back Projection Bounded Search: The basic idea of back

projection bounded search is to reduce number of possibleoutliers by limiting the search range while seeking visualcorrespondences. Assume L1 is the set of all estimatedlandmarks visible in C1, and each landmark in L1 hascorresponding feature points in I1. For each li ∈ L1, thelinked feature point f1,i ∈ I1 might have a correspondingmatched point at p2,i ∈ I2, which is the back-projected pointof li on I2, if C1, C2 and li are accurately estimated and lidoes not change its appearance. With estimation errors, wedefine a relaxed search area as a bounding box centered atp2,i on I2, as shown in Fig. 5, to search for the correspondingfeature point for f1,i. This significantly limits the search areato match f1,i and reject many possible outliers, comparedwith searching the whole of I2.

Homography Image Warping: Although the back projec-tion bounded search rejects the majority of outliers, dataassociation is still difficult when the viewing angle changes:the object’s appearance may change significantly with largebaselines, causing the search for a match in object’s appear-ance from I1 to I2 based on SIFT descriptors, which alsochange with appearance. This is the major challenge for dataassociation across images collected in different rows.

To combat this problem, we use a homography basedmethod to eliminate appearance variations in li due toviewpoint changes. We assume that li lies on a local planeπi. If this assumption is satisfied, πi induces a homographyH1,2 from I1 to I2 [38, p.327]

H1,2 = K2

(R1,2 −

t1,2n>i

d

)K−11 (3)

where {R1,2, t1,2} define the relative pose from C1 to C2,ni is the normal vector of πi, and d is the distance fromC1 to πi. We use H1,2 warp I1 to get I ′1, which has sameview point with I2, and thus similar appearance. We nextextract a SIFT descriptor f ′1,i on I ′1 for bounded searchrather than using the original f1,i. Two example patchesare provided in Fig. 6: although I1 and I2 have significantappearance variation, since they are taken from differentrows, the warped I ′1 has a very similar appearance to I2,which makes SIFT descriptor matching possible.

Algorithm 1: Robust Data AssociationInput : Image I1, I2, Camera C1, C2, Landmarks L1

Output: Set of matched feature point pairs P1,2

set match set P1,2 = ∅foreach li ∈ L1 do

back project li to C2 → p2,iif C1 and C2 baseline length < threshold then

f ′1,i = f1,ielse

calculate homography H1,2 use Eq. 3use H1,2 warp I1 → I ′1calculate SIFT descriptor at p2,i on I ′1 → f ′1,i

endset li’s match set Pi = ∅foreach feature point f2,j ∈ I2 do

if f2,j in bounding box of p2,i theninsert [f ′1,i, f2,j ]→ Pi

endendfind min L2 of SIFT descriptor in Pi → [f ′1,i, f2,k]insert [f1,i, f2,k] into P1,2

endRANSAC 8pt reject outlier(P1,2)

(a) data association by FLANN and 8-point RANSAC

(b) data association by proposed approach

Fig. 7: Data association results of a image pair between 1stand 3rd row of June 9. Best viewed in digital.

Fig. 8: Data association results of a image pair between 1strow of June 9 and June 20. Best viewed in digital.

GPS IMU

Images + Camera Poses + Landmarks

PMVS Dense Recon-

struction

Factor Graph Optimization

with LM

Data Association over Rows

Data Association over Time

Fig. 9: Overview of 4D reconstruction pipeline. Dash box ofPMVS dense reconstruction step means it is optional.

The full data association pipeline is summarized inAlgo. 1. We estimate the normal vector ni by a locallandmark point cloud around li in L1. Homography imagewarping is only enabled when the baseline between C1 andC2 is longer than a threshold, in our system this is set to0.5m. After getting all nearest neighbour feature matches byback projection bounded search, a final outlier rejection isperformed by 8-point RANSAC.

Experiments validate the performance and robustness ofproposed approach. A cross-row (1st vs. 3rd row) data associ-ation result is shown in Fig. 7. The naive FLANN+RANSACapproach can only recover feature matches on the top, wherefar away objects do not change their appearance, and it failsto register any crops correctly in the field. However, theproposed approach can register feature points in the field,with significant changes of appearance. A successful crosssession matching result is also shown in Fig. 8.

C. 4D Reconstruction

The third of last part of our pipeline is a 4D reconstr-cution module. The complete 4D reconstruction pipeline isillustrated in Fig. 9. We define the goal of 4D optimization asjointly estimating all camera states X =

⋃ti∈T,rj∈RX

〈ti,rj〉

and all landmarks L =⋃

ti∈T,rj∈R L〈ti,rj〉, where R and T

are set of rows and sessions, respectively. The measurementsZ =

⋃ti∈T,rj∈R Z

〈ti,rj〉 ∪ Zcr includes all single rowinformation as well as data association measurements Zcr

that connect rows across space and time.Similar to the multi-sensor SLAM, we also formulate the

joint probability of all camera states

p(X|Z) ∝∏ti∈T

∏rj∈R

φ(X〈ti,rj〉)

H∏h=1

φ(Xcr,h), (4)

where H here is the size of Zcr, and Xcr,h is the set of stateshth measurement of Zcr involved. This joint probability canbe expressed as a factor graph, shown in Fig. 10(a). The firstpart of the joint probability consists of the factor graphs fromall single rows. And the second part is the cross-row andcross-session measurements Zcr, which are vision factorsgenerated from cross-row and cross-session data association.We call the added factors shared landmarks, since they are

Row i

Row j

(a) 4D factor graph

Time

Spac

e (Ro

ws)

...

...

(b) 4D data association pattern

Fig. 10: (a) Factor graph of two rows with data association,connected vision factors are shared (matched) landmarks intwo rows. (b) Data association pattern of 4D reconstruction.

shared by two (or possibly more) rows, and they have two(or more) sessions associated with them.

Solving the MAP estimation problem results in estimatedcamera states

X = argmaxX

p(X|Z). (5)

We use the Levenberg-Marquardt algorithm to solve theoptimization problem, with initialization from the result ofmulti-sensor SLAM. Outlier rejection of vision factors [40]is also enabled during optimization, to reject possible falsepositive feature matches from cross-row and cross-sessiondata association. Landmarks L are estimated by triangulationgiven estimated camera poses.

Data association is performed across different rows andtimes to get Zcr. Exhaustive search between all row pairsis not necessary, since rows are not visible from each otherin the images, and large timespans makes matches betweenimages difficult to calculate. In our approach we only matchrows next to each other in either the space domain (near-by rows in the field), or the time domain (near-by date), asshown in Fig. 10(b).

The point cloud L is relatively sparse, since it comesfrom a feature-base SLAM pipeline, where only points withdistinct appearance are accepted as landmarks (in our systemSIFT key points are accepted). An optional solution is touse PMVS [14], which takes estimated camera states X toreconstruct dense point clouds.

III. EVALUATION

A. Dataset

To evaluate the performance of our approach with real-world data, we collected a field dataset with large spatial andtemporal scales. Existing datasets with both large scale spa-tial and temporal information include the CMU dataset [47],the MIT dataset [48], and the UMich dataset [49]. However,all of these datasets are collected in urban environments, andare not suitable for precision agriculture applications.

The dataset was collected from a field located in Tifton,GA, USA. The size of the field is about 150m×120m, andit contains total 21 rows of peanut plants. The map of thefield is shown in Fig. 11.

We use a ground vehicle (tractor) equipped with multiplesensors, shown in Fig. 11, to collect all of the sensor data.

-100 -50 0

X (m)

-150

-100

-50

0

Y (

m)

GPS trajectory

Measurement sites

Fig. 11: Top left is the tractor collected the dataset; Downleft shows sensors and computer (RTK-GPS is not shown);Right is a sample RTK-GPS trajectory, and sites of manualmeasurements are taken, overlay on Google Maps.

Fig. 12: Eight sample images taken at approximately samelocation in the field, dates taken are marked on images.

The equipped sensors include: (1) a Point Grey monocularglobal shutter camera, 1280×960 color images are streamedat 7.5Hz, (2) a 9DoF IMU with compass, acceleration andangular rate are streamed at 167Hz, and magnetic field data isstreamed at 110Hz, (3) a high accuracy RTK-GPS, and a lowaccuracy GPS, both of them stream latitude and longitudedata at 5Hz. No hardware synchronization is used. All dataare stored in a SSD by an on-board computer.

We recorded a complete season of peanut growth whichstarted May 25, 2016 and completed Aug 22, 2016, rightbefore harvest. The data collection had a total of 23 sessionsover 89 days, approximately two per week, with a few ex-ceptions due to severe weather. Example images of differentdates are shown in Fig. 12. Each session lasted about 40minutes, and consisted of the tractor driving about 3.8km inthe field.

In addition to sensor data, ground truth crop properties(height and leaf chlorophyll) at multiple sampling sites inthe field were measured weekly by a human operator. Therewere a total of 47 measuring sites, as shown in Fig. 11.

B. Results

We ran the proposed 4D reconstruction approach on thepeanut field dataset. We implemented the proposed approachwith the GTSAM C++ library.1 We used RTK-GPS data from

1https://bitbucket.org/gtborg/gtsam

-86X (m)

-87 -67

Z (

m)

-88

76

75

Y (m)-68 -69

May 25June 9June 23July 28

Ground Crops

Fig. 13: Cross section of part of the sparse 4D reconstructionresults at 3rd row. Only 4 sessions are shown to keep figureclear. Best viewed in digital.

the dataset as GPS input, and ignored lower accuracy GPSdata. Since the peanut field contains two sub-fields with littleoverlap (see Fig. 11), the two sub-fields were reconstructedindependently and aligned by GPS. Since the tractor runsback and forth in the field, we only use rows in whichthe tractor driving south (odd rows), to avoid misalignmentwith reconstruction results from even rows. Example denselyreconstructed 4D results are shown in Fig. 1.

Although Fig. 1 shows that the 3D reconstruction resultsfor each single session qualitatively appear accurate, to makethese results useful to precision agriculture applications,are interested in evaluating the approach quantitatively. Inparticular we wanted to answer the following questions:

• Are these 3D results correctly aligned in space?• Are these 3D results useful for measuring geometric

properties of plants useful for crop monitoring (height,width, etc.) ?

To answer the first question, we visualize the 4D model byshowing all 3D point clouds together. We visualize part of the4D sparse reconstruction result in Fig. 13. Point clouds fromdifferent dates are marked in different colors. We can seefrom the cross section that the ground surface point cloudsfrom different sessions are well aligned, which shows that allof the 3D point clouds from different dates are well registeredinto a single coordinate frame. This suggests that we arebuilding a true 4D result. We can see the growth of thepeanut plants, as the point cloud shows ‘Matryoshka doll’like structure, earlier crop point clouds are contained with inpoint clouds of later sessions.

To answer the second question, we show some preliminarycrop analysis results using reconstructed 4D point clouds,compared with ground truth manual measurements. We setupa simple pipeline to estimate height of peanut plants fromsparse reconstructed 4D point clouds at multiple sites, byfirst estimating the local ground plane by RANSAC fromMay 25’s point cloud (when peanuts are small and groundplane is well reconstructed), second separate peanut’s pointclouds by color (using RGB values), and finally estimate thedistance from peanut canopy’s top to ground plane.

Preliminary height estimations of twelve sampling sites

https://bitbucket.org/gtborg/gtsam

are shown in Fig. 14. With the exception of sites 22 and25, which have slightly biased estimated heights due to poorRANSAC ground plane estimations, results of the sites meetthe ground truth measurements well. This shows that wecan compute reasonable height estimates even with a simplemethod, and proves that the 4D reconstruction results containcorrect geometric statistics.

IV. CONCLUSION AND FUTURE WORK

In this paper we address the 4D reconstruction problem forcrop monitoring applications. The outcome of the proposed4D approach is a set of 3D point clouds, with pleasingvisual appearance and correct geometric properties. A robustdata association algorithm is also developed to address theproblems inherent in matching image features from differ-ence dates and different view points, while performing 4Dreconstruction. A side product of this paper is a high qualityfield dataset for testing crop monitoring applications, whichwill be released to the public.

Although we show some preliminary results of crop heightanalysis, this paper is mainly solving the problem of recon-structing the 4D point clouds, not the point cloud analysis.We leave the use of existing methods for analyzing pointclouds [50], [51] or proposing more sophisticated point cloudanalysis, as future work.

One possible extension to the proposed approach is furtherimprovement of the data association process. Although ourdata association method outperforms existing approaches, itstill needs assistance from high accuracy RTK-GPS as priorinformation, and we experience a few failure cases due tothunderstorms that wash out most features on the groundbetween sessions.

Although the system is designed for precision agricultureapplications, any tasks satisfying the assumptions mentionedin Sec. II can utilize the proposed system. For example, wecan use this system to build maps for autonomous driving,if we are not aiming to reconstruct dynamic objects, sinceevery street can be treated as a single ‘row’.

ACKNOWLEDGEMENT

This work is supported by National Institute of Food andAgriculture, U.S. Department of Agriculture, under awardnumber 2014-67021-22556.

REFERENCES

[1] F. Rembold, C. Atzberger, I. Savin, and O. Rojas, “Using lowresolution satellite imagery for yield prediction and yield anomalydetection,” Remote Sensing, vol. 5, no. 4, pp. 1704–1733, 2013.

[2] M. Bryson, A. Reid, F. Ramos, and S. Sukkarieh, “Airborne vision-based mapping and classification of large farmland environments,” J.of Field Robotics, vol. 27, no. 5, pp. 632–655, 2010.

[3] D. Anthony, S. Elbaum, A. Lorenz, and C. Detweiler, “On crop heightestimation with UAVs,” in IEEE/RSJ Intl. Conf. on Intelligent Robotsand Systems (IROS), 2014.

[4] J. Das, G. Cross, C. Qu, A. Makineni, P. Tokekar, Y. Mulgaonkar, andV. Kumar, “Devices, systems, and methods for automated monitoringenabling precision agriculture,” in IEEE Intl. Conf. on AutomationScience and Engineering (CASE), 2015.

[5] K. Zainuddin, M. Jaffri, M. Zainal, N. Ghazali, and A. Samad,“Verification test on ability to use low-cost UAV for quantifyingtree height,” in IEEE Intl. Colloquium on Signal Processing & ItsApplications (CSPA), 2016.

[6] S. K. Sarkar, J. Das, R. Ehsani, and V. Kumar, “Towards autonomousphytopathology: Outcomes and challenges of citrus greening diseasedetection through close-range remote sensing,” in IEEE Intl. Conf. onRobotics and Automation (ICRA), 2016.

[7] T. Hague, N. Tillett, and H. Wheeler, “Automated crop and weedmonitoring in widely spaced cereals,” Precision Agriculture, vol. 7,no. 1, pp. 21–32, 2006.

[8] J.-F. Lalonde, N. Vandapel, and M. Hebert, “Automatic three-dimensional point cloud processing for forest inventory,” RoboticsInstitute, p. 334, 2006.

[9] S. Singh, M. Bergerman, J. Cannons, B. Grocholsky, B. Hamner,G. Holguin, L. Hull, V. Jones, G. Kantor, H. Koselka, et al., “Com-prehensive automation for specialty crops: Year 1 results and lessonslearned,” Intelligent Service Robotics, vol. 3, no. 4, pp. 245–262, 2010.

[10] S. Nuske, K. Wilshusen, S. Achar, L. Yoder, S. Narasimhan, andS. Singh, “Automated visual yield estimation in vineyards,” J. of FieldRobotics, vol. 31, no. 5, pp. 837–860, 2014.

[11] D. Font, M. Tresanchez, D. Martınez, J. Moreno, E. Clotet, andJ. Palacın, “Vineyard yield estimation based on the analysis of highresolution images obtained with artificial illumination at night,” Sen-sors, vol. 15, no. 4, pp. 8284–8301, 2015.

[12] I. Sa, Z. Ge, F. Dayoub, B. Upcroft, T. Perez, and C. McCool,“Deepfruits: A fruit detection system using deep neural networks,”Sensors, vol. 16, no. 8, p. 1222, 2016.

[13] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and R. Szeliski,“Building Rome in a day,” in Intl. Conf. on Computer Vision (ICCV),2009.

[14] Y. Furukawa and J. Ponce, “Accurate, dense, and robust multi-viewstereopsis,” IEEE Trans. Pattern Anal. Machine Intell., vol. 32, no. 8,pp. 1362–1376, 2010.

[15] K. Sakurada, T. Okatani, and K. Deguchi, “Detecting changes in 3Dstructure of a scene from multi-view images captured by a vehicle-mounted camera,” in IEEE Conf. on Computer Vision and PatternRecognition (CVPR), 2013.

[16] P. F. Alcantarilla, S. Stent, G. Ros, R. Arroyo, and R. Gherardi, “Street-view change detection with deconvolutional networks,” in Robotics:Science and Systems (RSS), 2016.

[17] W. Xiao, B. Vallet, M. Bredif, and N. Paparoditis, “Street environmentchange detection from mobile laser scanning point clouds,” ISPRSJournal of Photogrammetry and Remote Sensing, vol. 107, pp. 38–49,2015.

[18] T. Pollard and J. L. Mundy, “Change detection in a 3-D world,” inIEEE Conf. on Computer Vision and Pattern Recognition (CVPR),2007.

[19] A. O. Ulusoy and J. L. Mundy, “Image-based 4-D reconstructionusing 3-D change detection,” in European Conf. on Computer Vision(ECCV), 2014.

[20] A. Taneja, L. Ballan, and M. Pollefeys, “Image based detection ofgeometric changes in urban environments,” in Intl. Conf. on ComputerVision (ICCV), 2011.

[21] A. Taneja, L. Ballan, and M. Pollefeys, “City-scale change detectionin cadastral 3D models using images,” in IEEE Conf. on ComputerVision and Pattern Recognition (CVPR), 2013.

[22] A. Taneja, L. Ballan, and M. Pollefeys, “Geometric change detection inurban environments using images,” IEEE Trans. Pattern Anal. MachineIntell., vol. 37, no. 11, pp. 2193–2206, 2015.

[23] M. Golparvar-Fard, F. Pena-Mora, and S. Savarese, “Monitoringchanges of 3D building elements from unordered photo collections,”in ICCV Workshop on Computer Vision for Remote Sensing of theEnvironment, 2011.

[24] S. Griffith, F. Dellaert, and C. Pradalier, “Robot-enabled lakeshoremonitoring using visual SLAM and SIFT flow,” in RSS Workshop onMulti-View Geometry in Robotics, Citeseer, 2015.

[25] S. Griffith and C. Pradalier, “Reprojection flow for image registrationacross seasons,” in British Machine Vision Conf. (BMVC), 2016.

[26] M. J. Milford and G. F. Wyeth, “SeqSLAM: Visual route-basednavigation for sunny summer days and stormy winter nights,” in IEEEIntl. Conf. on Robotics and Automation (ICRA), 2012.

[27] T. Naseer, B. Suger, M. Ruhnke, and W. Burgard, “Vision-basedmarkov localization across large perceptual changes,” in EuropeanConf. on Mobile Robots (ECMR), 2015.

[28] S. Lowry, N. Sunderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke,and M. J. Milford, “Visual place recognition: A survey,” IEEE Trans.Robotics, vol. 32, no. 1, pp. 1–19, 2016.

June July Aug0

0.3

0.6H

eigh

t (m

)

Row 2, Site 20

June July Aug0

0.3

0.6

Hei

ght (

m)

Row 1, Site 21

June July Aug0

0.3

0.6

Hei

ght (

m)

Row 4, Site 22

June July Aug0

0.3

0.6

Hei

ght (

m)

Row 4, Site 23

June July Aug0

0.3

0.6

Hei

ght (

m)

Row 6, Site 24

June July Aug0

0.3

0.6

Hei

ght (

m)

Row 6, Site 25

June July Aug0

0.3

0.6

Hei

ght (

m)

Row 15, Site 34

June July Aug0

0.3

0.6

Hei

ght (

m)

Row 15, Site 35

June July Aug0

0.3

0.6

Hei

ght (

m)

Row 17, Site 41

June July Aug0

0.3

0.6

Hei

ght (

m)

Row 17, Site 42

June July Aug0

0.3

0.6

Hei

ght (

m)

Row 19, Site 43

June July Aug0

0.3

0.6

Hei

ght (

m)

Row 19, Site 44

Fig. 14: Estimated peanut heights at 12 sampling sites in blue, with ground truth manual measurements in black lines.

[29] C. Beall and F. Dellaert, “Appearance-based localization across sea-sons in a metric map,” IROS Workshop on Planning, Perception andNavigation for Intelligent Vehicles (PPNIV), 2014.

[30] R. Martin-Brualla, D. Gallup, and S. M. Seitz, “Time-lapse miningfrom internet photos,” in ACM SIGGRAPH, 2015.

[31] R. Martin-Brualla, D. Gallup, and S. M. Seitz, “3D time-lapse re-construction from internet photos,” in Intl. Conf. on Computer Vision(ICCV), 2015.

[32] G. Schindler, F. Dellaert, and S. Kang, “Inferring temporal order ofimages from 3D structure,” in IEEE Conf. on Computer Vision andPattern Recognition (CVPR), 2007.

[33] G. Schindler and F. Dellaert, “Probabilistic temporal inference onreconstructed 3d scenes,” in IEEE Conf. on Computer Vision andPattern Recognition (CVPR), 2010.

[34] K. Matzen and N. Snavely, “Scene chronology,” in European Conf.on Computer Vision (ECCV), 2014.

[35] A. O. Ulusoy, O. Biris, and J. L. Mundy, “Dynamic probabilisticvolumetric models,” in Intl. Conf. on Computer Vision (ICCV), 2013.

[36] D. Lowe, “Distinctive image features from scale-invariant keypoints,”Intl. J. of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.

[37] M. Muja and D. G. Lowe, “Scalable nearest neighbor algorithms forhigh dimensional data,” IEEE Trans. Pattern Anal. Machine Intell.,vol. 36, no. 11, pp. 2227–2240, 2014.

[38] R. I. Hartley and A. Zisserman, Multiple View Geometry in ComputerVision. Cambridge University Press, second ed., 2004.

[39] F. Dellaert and M. Kaess, “Square Root SAM: Simultaneous localiza-tion and mapping via square root information smoothing,” Intl. J. ofRobotics Research, vol. 25, pp. 1181–1203, Dec 2006.

[40] L. Carlone, Z. Kira, C. Beall, V. Indelman, and F. Dellaert, “Elim-inating conditionally independent sets in factor graphs: A unifyingperspective based on smart factors,” in IEEE Intl. Conf. on Roboticsand Automation (ICRA), 2014.

[41] C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “IMU preinte-gration on manifold for efficient visual-inertial maximum-a-posterioriestimation,” Robotics: Science and Systems (RSS), 2015.

[42] S. Anderson and T. D. Barfoot, “Full STEAM ahead: Exactly sparsegaussian process regression for batch continuous-time trajectory es-timation on SE (3),” IEEE/RSJ Intl. Conf. on Intelligent Robots andSystems (IROS), 2015.

[43] T. Barfoot, C. H. Tong, and S. Sarkka, “Batch continuous-timetrajectory estimation as exactly sparse gaussian process regression,”Robotics: Science and Systems (RSS), 2014.

[44] X. Yan, V. Indelman, and B. Boots, “Incremental sparse GP regression

for continuous-time trajectory estimation & mapping,” Proc. of theIntl. Symp. of Robotics Research (ISRR), 2015.

[45] M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. Leonard, and F. Del-laert, “iSAM2: Incremental smoothing and mapping using the Bayestree,” Intl. J. of Robotics Research, vol. 31, pp. 217–236, Feb 2012.

[46] V. Indelman, E. Nelson, J. Dong, N. Michael, and F. Dellaert,“Incremental distributed inference from arbitrary poses and unknowndata association: Using collaborating robots to establish a commonreference,” IEEE Control Systems, vol. 36, no. 2, pp. 41–74, 2016.

[47] H. Badino, D. Huber, and T. Kanade, “The CMU visual local-ization data set.” http://3dvis.ri.cmu.edu/data-sets/localization, 2011.

[48] M. F. Fallon, H. Johannsson, M. Kaess, D. M. Rosen, E. Muggler, andJ. J. Leonard, “Mapping the MIT stata center: Large-scale integratedvisual and RGB-D SLAM,” in RSS Workshop on RGB-D: AdvancedReasoning with Depth Cameras, 2012.

[49] N. Carlevaris-Bianco, A. K. Ushani, and R. M. Eustice, “Universityof Michigan north campus long-term vision and lidar dataset,” Intl. J.of Robotics Research, vol. 35, no. 9, pp. 1023–1035, 2016.

[50] Y. Li, X. Fan, N. J. Mitra, D. Chamovitz, D. Cohen-Or, and B. Chen,“Analyzing growing plants from 4D point cloud data,” ACM Trans-actions on Graphics (TOG) - Proceedings of ACM SIGGRAPH Asia,vol. 32, no. 6, p. 157, 2013.

[51] L. Carlone, J. Dong, S. Fenu, G. Rains, and F. Dellaert, “Towards4D crop analysis in precision agriculture: Estimating plant heightand crown radius over time via expectation-maximization,” in ICRAWorkshop on Robotics in Agriculture, 2015.

http://3dvis.ri.cmu.edu/data-sets/localization

http://3dvis.ri.cmu.edu/data-sets/localization

jing dong, john gary burnham, byron boots, glen c. …4d crop monitoring: spatio-temporal...

Documents