progressive structure from motion3d reconstruction from individual photographs is a long studied...

9
Progressive Structure from Motion Alex Locher 1 Michal Havlena 1 Luc Van Gool 1,2 1 Computer Vision Laboratory, ETH Zurich, Switzerland 2 VISICS, KU Leuven, Belgium Abstract Structure from Motion or the sparse 3D reconstruction out of individual photos is a long studied topic in computer vision. Yet none of the existing reconstruction pipelines fully addresses a progressive scenario where images are only getting available during the reconstruction process and in- termediate results are delivered to the user. Incremental pipelines are capable of growing a 3D model but often get stuck in local minima due to wrong (binding) decisions taken based on incomplete information. Global pipelines on the other hand need the access to the complete viewgraph and are not capable of delivering intermediate results. In this paper we propose a new reconstruction pipeline work- ing in a progressive manner rather than in a batch process- ing scheme. The pipeline is able to recover from failed re- constructions in early stages, avoids to take binding deci- sions, delivers a progressive output and yet maintains the capabilities of existing pipelines. We demonstrate and eval- uate our method on diverse challenging public and dedi- cated datasets including those with highly symmetric struc- tures and compare to the state of the art. 1. Introduction 3D reconstruction from individual photographs is a long studied topic in computer vision [30, 1, 12]. The field of Structure from Motion (SfM) deals with the intrinsic and extrinsic calibration of sets of images and recovers a sparse 3D structure of the scene at the same time. Tra- ditional methods are usually designed as batch processing algorithms, where image acquisition and image processing are separated into two independent steps. This contrasts with current demand, when one would like to be able to convert an object or a scene into a 3D model anytime and anywhere, just by using the mobile phone one is carrying in her pocket. Recent developments in mobile technology and the availability of 3D printers raised the need for 3D content even more and underlay the importance of 3D modeling. In a user-centric scenario, images are taken on the spot and processed by a 3D modeling pipeline on-the-fly [15, 18]. Any feedback which gets available to the user helps to guide (a) incremental SfM points (b) progressive SfM points (c) progressive SfM clusters (d) progressive SfM viewgraph Figure 1. Opposed to existing methods, a favorable image order is not crucial for the proposed progressive SfM pipeline. While the baseline method (a) fails to recover the structure of the scene our method successfully reconstructs the temple (b). Individual clusters are identified in the viewgraph (d) and merged (c) based on a lightweight optimization. her acquisition and, even more importantly, assures that the 3D model represents the real-world object in the desired quality. In a collaborative scenario, multiple users acquire pictures of the same object and images are gathered on the reconstruction server in the cloud. The 3D model is pro- gressively built and intermediate reconstruction results are shown to the user. Images are getting available as they are taken and the SfM pipeline has never access to the com- plete set of information in the dataset. Moreover, the whole reconstruction process might have a starting, but no prede- fined end point. Users might always decide to add more images to an existing model. In this work we therefore propose a progressive SfM 1 arXiv:1803.07349v2 [cs.CV] 10 Jul 2018

Upload: others

Post on 11-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Progressive Structure from Motion3D reconstruction from individual photographs is a long studied topic in computer vision [30, 1, 12]. The field of Structure from Motion (SfM) deals

Progressive Structure from Motion

Alex Locher1 Michal Havlena1 Luc Van Gool1,2

1 Computer Vision Laboratory, ETH Zurich, Switzerland 2 VISICS, KU Leuven, Belgium

Abstract

Structure from Motion or the sparse 3D reconstructionout of individual photos is a long studied topic in computervision. Yet none of the existing reconstruction pipelines fullyaddresses a progressive scenario where images are onlygetting available during the reconstruction process and in-termediate results are delivered to the user. Incrementalpipelines are capable of growing a 3D model but oftenget stuck in local minima due to wrong (binding) decisionstaken based on incomplete information. Global pipelines onthe other hand need the access to the complete viewgraphand are not capable of delivering intermediate results. Inthis paper we propose a new reconstruction pipeline work-ing in a progressive manner rather than in a batch process-ing scheme. The pipeline is able to recover from failed re-constructions in early stages, avoids to take binding deci-sions, delivers a progressive output and yet maintains thecapabilities of existing pipelines. We demonstrate and eval-uate our method on diverse challenging public and dedi-cated datasets including those with highly symmetric struc-tures and compare to the state of the art.

1. Introduction3D reconstruction from individual photographs is a long

studied topic in computer vision [30, 1, 12]. The fieldof Structure from Motion (SfM) deals with the intrinsicand extrinsic calibration of sets of images and recovers asparse 3D structure of the scene at the same time. Tra-ditional methods are usually designed as batch processingalgorithms, where image acquisition and image processingare separated into two independent steps. This contrastswith current demand, when one would like to be able toconvert an object or a scene into a 3D model anytime andanywhere, just by using the mobile phone one is carrying inher pocket. Recent developments in mobile technology andthe availability of 3D printers raised the need for 3D contenteven more and underlay the importance of 3D modeling. Ina user-centric scenario, images are taken on the spot andprocessed by a 3D modeling pipeline on-the-fly [15, 18].Any feedback which gets available to the user helps to guide

(a) incremental SfM points (b) progressive SfM points

(c) progressive SfM clusters (d) progressive SfM viewgraph

Figure 1. Opposed to existing methods, a favorable image orderis not crucial for the proposed progressive SfM pipeline. Whilethe baseline method (a) fails to recover the structure of the sceneour method successfully reconstructs the temple (b). Individualclusters are identified in the viewgraph (d) and merged (c) basedon a lightweight optimization.

her acquisition and, even more importantly, assures that the3D model represents the real-world object in the desiredquality. In a collaborative scenario, multiple users acquirepictures of the same object and images are gathered on thereconstruction server in the cloud. The 3D model is pro-gressively built and intermediate reconstruction results areshown to the user. Images are getting available as they aretaken and the SfM pipeline has never access to the com-plete set of information in the dataset. Moreover, the wholereconstruction process might have a starting, but no prede-fined end point. Users might always decide to add moreimages to an existing model.

In this work we therefore propose a progressive SfM

1

arX

iv:1

803.

0734

9v2

[cs

.CV

] 1

0 Ju

l 201

8

Page 2: Progressive Structure from Motion3D reconstruction from individual photographs is a long studied topic in computer vision [30, 1, 12]. The field of Structure from Motion (SfM) deals

pipeline which avoids taking (potentially fatal) binding de-cisions and therefore is as independent of the input imageorder as possible. Moreover, the proposed pipeline reusesalready computed intermediate results in later steps and issuited for delivering progressive modeling results back tothe user within seconds.

1.1. Related Work

Classical SfM pipelines are typically not suited to beused in such a progressive – multiuser-centric scenario.Global pipelines [21, 28] start by estimating poses of allcameras in the dataset and estimate the structure in a sec-ond step. Evidently, this relies on the access to the com-plete dataset which contradicts the idea of progressive3D modeling. Sequential SfM pipelines, sometimes termedSLAM [7, 22], are inherently suitable for processing (po-tentially infinite) streams of images. Nevertheless, the un-derlying assumption often is that images neighboring in thesequence are spatially close in the scene, which is easilyviolated when streams from multiple users are combinedtogether. Incremental SfM pipelines [30, 35, 26] build a3D model by initializing the structure from a small seedand gradually growing it by adding additional cameras.This scheme is closer to the requested progressive scenariobut, unfortunately, is strongly dependent on the order inwhich images are added to the model [20, 27]. View se-lection algorithms [33] carefully determine the image or-der usually by employing the global matching informationwhich is not available in the progressive case. Hierarchi-cal SfM pipelines [10, 8] try to overcome the problem ofimproper seed selection by starting from several seed lo-cations at the same time and eventually merging the par-tial 3D models into a single global model in later stages.Sweeney et al. [32] group multiple individual cameras andoptimize them jointly as a distributed camera. However allhierarchical methods require the knowledge of all images inadvance. Our proposed method is partially inspired by theseapproaches but in order to provide the progressive capabil-ity, the hierarchy is not fixed but rather re-defined every timenew images are added to the reconstruction process.

Due to the lack of global information, an incrementalpipeline would connect new images to the existing modelbased on incomplete information which often causes cor-rupted 3D models [36, 25, 14, 11]. Even more importantly,incremental pipelines cannot recover from wrong decisionstaken based on missing information and therefore can eas-ily get stuck in only locally optimal reconstructions. Theonly reliable solution would be re-running an incremen-tal or global pipeline from scratch when new images be-come available which leads to an impractical algorithm run-time. Heinly et al. [11] detected erroneous reconstructionsof scenes with duplicate structure in a post processing stepby evaluating different splits of the model into submodels

and potentially merge them in the correct configuration byleveraging conflicting observations. Our method in contraryis a full fledged SfM pipeline which avoids getting trappedin a local minima in the first place. Faulty configurationsare detected and corrected on-the-fly and not in a post pro-cessing step.

Our work bases on a lightweight representation of thecomplete scene as a viewgraph. Many existing approachesinvestigated robustification of the global view-graph by fil-tering out bad epipolar geometries [33, 6] and enforcingloop constraints [37]. Wilson and Snavely [34] are ableto reconstruct scenes with repetitive structures by scoringrepetitive features using local clustering. Recent work ofCui [5] takes a similar approach to our pipeline. The HybridSfM pipeline estimates all rotations of the viewgraph in acommunity based global algorithm. The estimated orienta-tions are leveraged in the second phase of the pipeline by es-timating the translations and structure of the scene in an in-cremental scheme with reduced dimensionality. While shar-ing the idea of combining global and incremental schemes,HSfM is a pure batch processing algorithm. The global ro-tation averaging needs access to the complete view graph inadvance which is not available in a progressive scheme. Inour work we combine a dynamic global view graph with alocal clustering based on a connectivity score and combinethe advantages of incremental and global structure from mo-tion. In order to accommodate for the demands of a pro-gressive pipeline, we allow for flexibility in already recon-structed parts of the model and constantly verify the localreconstructions against the globally optimized viewgraph.

1.2. Contributions

We propose a novel progressive SfM pipeline which en-ables 3D reconstruction in an anytime anywhere multiuser-centric scenario. Unlike traditional pipelines, the proposedapproach avoids taking binding decisions, does not dependon the order of incoming images and is able to recoverfrom wrong decisions taken due to the lack of information.Moreover, the computed intermediate results are propagatedalong the reconstruction process resulting in an efficient useof computational resources.

2. Progressive ReconstructionIn the following section we give an overview of our pro-

gressive SfM pipeline and detail individual components andkey aspects later on.

2.1. Overview

Our progressive SfM pipeline takes an ordered sequenceof images with its geometrically verified correspondencesas the input and delivers a sparse pointcloud and calibratedcamera poses as the output (see Figure 2). The resultingsparse configuration is updated with every image added to

Page 3: Progressive Structure from Motion3D reconstruction from individual photographs is a long studied topic in computer vision [30, 1, 12]. The field of Structure from Motion (SfM) deals

Image Sequence in any order

Global Viewgraph

View Pre-processing

Viewgraph Creation / Update

Viewgraph Clustering

Cluster Reconstruction

Global ClusterposeEstimation

Local Submodels

Localized Clusters

Figure 2. An overview of the main steps of the progressive SfMpipeline and its involved components.

the scene. A viewgraph with nodes being images and edgesconnecting pairs of matched images is gradually built andserves as the global knowledge throughout the whole re-construction. On every iteration of the algorithm, the view-graph is clustered based on local connectivity and individualclusters are processed locally. In each of the local clustersa robust rotation averaging scheme filters out wrong two-view geometries and the 3D structure is estimated using ei-ther an incremental or a global SfM pipeline. The clusterconfiguration between two time steps is tracked and the al-ready estimated parts of the 3D model are passed to the nextstage. The global configuration of individual clusters is es-timated in the last stage by robustly estimating 7 DoF sim-ilarity transforms between them using the remaining inter-cluster constraints. Generally, the local incremental methodenables robust and efficient reconstructions while the view-graph combined with the robust rotation averaging injectsthe global knowledge and allows for correction of corrupted3D models.

2.2. Progressive Viewgraph

The algorithm takes a (randomly) ordered sequence ofimages I = (I0, I1, I2, . . . ) as the input. Every incomingimage is matched against the most relevant images alreadypresent in the scene and geometrically verified pairwise cor-respondences are obtained. A viewgraph G (Vt, Et) withimages as vertices Vt = {Ii|i ≤ t} is maintained at everytime step t. Two vertices (Vi, Vj) are connected by an undi-rected edge Eij iff there exists a minimum amount of cor-

respondences (Mij > η ∧ i < j ≤ t) between them whereM ∈ Rt×t is the matching matrix. Every edge Eij has anassociated relative rotation Rij and translation direction tijwhich is obtained by decomposing either the estimated es-sential matrix in the calibrated case or the fundamental ma-trix when the focal length is not known.

The order in which images are fed to a reconstructionpipeline plays an important role and every snapshot of theviewgraph only captures the past information of the recon-struction process. This is why filtering of supposedly wrongtwo-view geometries at this stage can be very dangerous. Itmight happen that a geometry is inconsistent with other lo-cal geometries in the neighborhood1 at the current time-stepbut connecting images added in later steps may show thatthe two-view geometry was actually correct. As a wrongdecision in the global viewgraph could lead to a local mini-mum in the reconstruction, i.e. a corrupted 3D model, we donot conduct any outlier rejection and defer robustification toa later stage.

2.3. Clustering

Motivated by the general observation that densely con-nected regions of the viewgraph are likely to form a3D model worth reconstruction, the viewgraph is clusteredin a second step. The distance dij between two verticesVij is based on the weighted Jaccard distance of the adja-cent edges where connections to neighboring vertices areweighted by the number of verified correspondences.

dij = 1−

∑n∈N

Min + Mnj

t∑n=0

Min + Mnj

∣∣∣∣∣∣∣∣N = {k|Mik ·Mkj > 0}

(1)A set of clusters Ct is hierarchically grown by single-

linkage clustering until no single edge with dij < η be-tween two clusters exists. Single-linkage clustering tendsto generate clusters with a chain-like topology where thetwo ending nodes might have a distance way larger than thedefined threshold. While this might be a disadvantage inother applications it is actually beneficial in our applicationas local (incremental) reconstruction pipeline performs wellfor such graph structures.

Incremental Clustering

Due to the single-linkage hierarchical clustering, a simpli-fied incremental scheme can be used to update an existingcluster topology with a new node. While generally a singleextra node can cause a complete change in topology of the

1This can, e.g., be checked by computing the cumulative rotation ofloops in which the edge is participating [37]

Page 4: Progressive Structure from Motion3D reconstruction from individual photographs is a long studied topic in computer vision [30, 1, 12]. The field of Structure from Motion (SfM) deals

clustered graph, the changes are limited to clusters whichare connected to the new node. As a result only clusterswith an edge connecting to the new node have to be re-clustered which leads to an efficient and scalable implemen-tation. Note that in worst case the whole graph still mightbe updated – but in most cases a new image only connectsto few clusters and therefore most of the existing clustersremain untouched.

2.4. Cluster Tracking and Recycling

Between the transition of two timesteps tb and ta =tb + 1 the topology of clusters can undergo large changesbut mostly will either stay the same or be extended by thenew image. All changes in the clusters have to be propa-gated to the eventually estimated 3D structure. We thereforekeep track of the nodes changing cluster between the twotimesteps and add or remove the corresponding images inthe local reconstruction. The recycling of intermediate (par-tial) 3D reconstructions can be realized by a merge and splitscheme. If a group of interconnected nodes transfer togetherfrom one cluster to another, the corresponding cameras inthe 3D model can be separated from the rest and mergedinto the potentially existing structure of the new cluster.

2.5. Cluster Reconstruction

Once individual clusters Ci have been identified by thehierarchical clustering, the 3D structure of the images andcorrespondences of every cluster C is estimated. The clus-ter reconstruction process is only triggered if its topologyhas changed meaning either some nodes were added orsome nodes were removed from the cluster. If the topol-ogy of the cluster is unchanged between the two time stepstb and ta, the reconstruction step is skipped as a whole.

A sub-graph VC capturing all vertices and edges of thecluster C is extracted in a first step. All following opera-tions are restricted to the scope of the extracted graph VC .

Robust Rotation Averaging

The unfiltered viewgraph is potentially corrupted by out-liers and has to be cleaned up for further usage. As thisstep is performed in every iteration it is safe to reject outliertwo-view geometries. As a result of the repetitive executionof the filtering stage, it is important to use a computation-ally efficient filtering scheme. As proposed by Chatterjee etal. [3] we estimate the global rotations of the subgraph witha robust `1 optimization. Global Rotations R are initial-ized by concatenating the relative rotations of a MaximumSpanning Tree (MST) extracted from the viewgraph VC us-ing the number of verified inliers Mij as edge weights. Theglobal rotations are then optimized by minimizing the rela-tive rotation errors ρij .

arg minR

∑(i,j)∈EC

ρij(Rij , RjR

−1i

)(2)

By using the Lie-algebraic approximation of the relativerotation, the error can be expressed as a difference of thecorresponding rotations ω. Where ω = Θn ∈ so(3) de-notes a rotation by angle Θ around the unit axis n.

ωij ≈ ωj − ωi (3)

The relative rotations can be encoded in a sparse matrix Awhere each row only has two nonzero −1,+1 entries. Therobust `1 norm combined with an edge weighting ρ depend-ing on the number of verified correspondences allows us toobtain the global rotations of the cluster C in an iterativescheme by optimizing Eqn. 4 in every step. For more de-tails we refer the reader to the original publication [3].

arg minωg

‖A ∆ωg − ρωrel‖`1 (4)

The obtained global rotations can optionally be refined byIteratively Reweighted Least Squares (IRLS) method usinga robustified `2 norm. As we are primarily interested in re-jecting outliers, the results of the `1 optimization is accurateenough and we omit the refinement step.

Finally edges with a relative error above ρgmax are re-moved from the local viewgraph.

Reconstruction

Using the global rotations and the pairwise matches as theinput, the structure of the individual clusters is estimatedwith different methods depending on the current status (Fig-ure 3).

New: If we could not recover any structure from tb and thecluster contains at least µmin nodes but less than µmax

images a new reconstruction is starting using an incre-mental SfM pipeline similar to [30, 35]. If the num-ber of images in the cluster is larger than µmax, incre-mental pipelines get inefficient and a global pipelineis more suited. We therefore refine the global rota-tions and estimate global position using the 1DSfMmethod [34]. The structure is then obtained by triangu-lating consistent feature-tracks and the configuration isrefined by a bundle adjustment step.

Extend: In most cases an existing local reconstruction canbe extended by one or multiple images which are ei-ther new or transferred from another cluster. In orderto extend an existing 3D reconstruction with a new im-age we extend existing feature tracks by the new corre-spondences and estimate the 3D position of the cameraby the P3P algorithm [16]. New tracks are added andpotential conflicting tracks are split up. A local bundle

Page 5: Progressive Structure from Motion3D reconstruction from individual photographs is a long studied topic in computer vision [30, 1, 12]. The field of Structure from Motion (SfM) deals

Incremental / Global Reconstruction&

Verification

Clu

ster

Tra

ckin

g

Transfer of partiallyestimated structure

Migration of completeclusters

Addition of new & loose nodes

Figure 3. Depending on the status of an individual cluster, itsstructure is estimated with different methods.

adjustment refines the structure and calibration of thenewly added cameras. If the model has grown by morethan ηgrow percent, a bundle adjustment step over thelocal reconstruction refines the whole structure.

Transfer: If a large part (more than µmin nodes) of analready estimated local model is transferred to an-other cluster, its already estimated structure is kept andtransferred to the new cluster. Commonly estimatedtracks are fused and a bundle adjustment step refinesthe structure. Potentially unestimated cameras are thenadded in an incremental scheme as described before.

Detection of Bad Configurations

Due to the incremental reconstruction scheme for cluster re-construction we can reuse most of the intermediate resultsfrom an earlier stage and propagate them to later stages. Theincremental scheme is highly dependent on the actual imageordering and therefore some unfortunate decisions taken inearly stages (e.g. wrongly connecting an image due to miss-ing information) cannot be recovered in later stages. Globalclustering usually solves this problem for us as the wronglyconnected image or sub-model is likely to be transferred toanother cluster at a later stage. But it might also happenthat the image actually belongs to the same cluster and yetis wrongly connected. Without any additional counter mea-surements, we would end up with a corrupted reconstructionin such cases.

By using the relative global rotations of the local view-graph (which is independent of the image order) we canevaluate an error measure between the rotations of the cur-rent local reconstruction R and the global rotations R.

ρl =∥∥∥{Rij R−1ij

∣∣∣ (i, j) ∈ EC}∥∥∥∞

(5)

If the rotation error ρl is above a certain limit ρlmax thereconstruction is likely to be erroneous. Hence it is resetand re-estimated in the next phase of the reconstruction.

2.6. Cluster Pose Estimation

The output of the pipeline so far are multiple locallyhighly consistent but disconnected model parts. In orderto merge them to a final representation we make use of theremaining inter-cluster connections. As this step of merg-ing multiple models is often rather fragile and can easilygo wrong, several methods employed a rather conservativemerging criteria as multiple cameras of overlap [10] or ahigh amount of common points [35]. Instead, we are es-timating pairwise similarity transforms Tab ∈ Sim(3) be-tween clusters Ca and Cb and optimize the global positionsof the cluster centers. For robustification, we use a two stageverification scheme for individual inter-cluster constraints.

In the first stage, a similarity transform based on 3D-to-3D correspondences from every single edge Eij betweenclusters Ca and Cb is estimated in a RANSAC scheme [13,9]. The vertices (camera centers) ofCa are afterwards trans-formed using the estimated transformation and the cameraconfiguration of the obtained combined model is comparedto the individual configuration before. A neighborhood sim-ilarity measure s evaluates how many of the cameras pk

in the merged cluster would retain their closest neighborsNNa(pk) and NNb(pk) for the cameras from clusters Ca

and Cb, respectively. pk denotes the camera center of thenode k and the operator Tij ◦ pk transforms a point by thesimilarity transform.

Tij ∀ i ∈ Ca ∧ j ∈ Cb ∧ (i, j) ∈ EC (6)

Pa = pk|k ∈ Ca (7)Pb = pk|k ∈ Cb (8)Pc = Pb ∪ Tij ◦ pk|k ∈ Ca (9)

s =1

|Pc|∑

pk∈Pc

sk (10)

sk =

1 k ∈ Ca ∧ Tij ◦NNa(pk) = NNc(Tij ◦ pk)

1 k ∈ Cb ∧NNb(pk) = NNc(pk)

0 otherwise(11)

Edges with the neighborhood similarity measure below λcare rejected as outliers for the final constraint estimation.The measure is motivated by the observation that nearbycameras should already be clustered by the original algo-rithm and prevents merged models where the cameras areextensively interleaved.

In the second stage, a final transformation Tij betweenthe clusters is estimated using all correspondences of edges

Page 6: Progressive Structure from Motion3D reconstruction from individual photographs is a long studied topic in computer vision [30, 1, 12]. The field of Structure from Motion (SfM) deals

that survived the first filtering stage. All relative con-straints are gathered and a cluster graph GC with local re-constructions as nodes and pairwise inter-cluster constraintsas edges is created.

arg minPC

∑(a,b)∈GC

∥∥δ (Tab PbP−1a

)∥∥2 (12)

The global poses PC of the clusters are afterwards obtainedby iteratively minimizing the pairwise relative pose con-straints Tab, where δ(.) denotes the robust Huber error func-tion.

While this in the optimal case leads to a single 3D model,the flexibility and efficiency of the global optimization al-lows us to choose a much less conservative threshold forcluster merging. In addition to its efficiency, the clustergraph representation can easily incorporate additional ex-trinsic constraints, e.g., the gravity vector for orientationwith respect to the ground plane or coarse cluster-wise GPSlocations for geo-referencing. Such additional constraintscan help to put individual models in relative perspectiveeven if pure vision based inter-cluster constraints are not(yet) available.

3. Experiments and Results

In the following section, we first give some implementa-tion details of the proposed solution and introduce the con-ducted experiments.

3.1. Implementation

We implemented the proposed pipeline as a C++ soft-ware. SIFT [19] features are detected on all images anddescribed with the RootSift descriptor [2]. The pipeline issetup for stream processing dedicated by the progressivereconstruction scheme. Every incoming image is indexedby a 1M word vocabulary trained on the independent Ox-ford5k [23] dataset and the top 100 nearest neighbors areretrieved using a fast tf-idf scoring [29]. Matches are geo-metrically verified by estimating the fundamental or essen-tial matrix (depending on the availability of the focal length)as well as a homography matrix in a RANSAC scheme us-ing the general USAC framework [24]. The relative rota-tion and translation direction are obtained by decomposingthe fundamental, essential, or homography matrix (depend-ing on the amount of inliers). The incremental and globalreconstruction of cluster centers bases on the publicly avail-able Theia library [31] and the final clustergraph optimiza-tion is realized in the g2o [17] framework. The followingparameters were used within all the conducted experiments:µmin = 5, µmax = 50, λs = 0.9, ρlmax = ρgmax = 10◦,η = 0.2, ηgrow = 0.15.

3.2. Baselines

Throughout our expermiments we compare the proposedprogressive pipeline against two linear pipelines (Visu-alSFM [35] and incremental Theia [31]) as well as to therecently published hybrid pipeline HSfM [5]. For an unbi-ased comparision, all geometrically verified matches wereprecomputed and imported into the various pipelines. Thestreaming part of the pipeline was skipped and matches di-rectly loaded from disk.

Within the experiments we run the pipelines in differentmodes:

batch mode All images are added to the pipeline at onceand the model is reconstructed without intermediateresults. This is the classical SfM operation mode.

progressive mode Images are fed to the pipeline in a cer-tain order and intermediate reconstructions are en-forced. In VisualSFM this is realized by the “AddImage” option and in Theia we dictated the order inwhich views are added to an ongoing reconstructionby a slight code modification.

3.3. Randomized Image Order

One of the key contributions of the proposed pipeline isthe ability of recovering from wrong connections betweenimage pairs made in previous reconstruction cycles. Wrongconnections often occure in symmetric environments whichis why we chose the publicly available TempleOfHeavendataset [14] with a rotationaly symmetric structure for thefirst experiment. The dataset consists of 341 images taken ina regular spacing and perfectly demonstrates problems aris-ing from unpleasant orderings. As a reference model, wereconstructed the scene using VisualSFM and restricted thematches to sequential matching in the original order. Fig-ure 5 shows the development of the model as a function ofthe timestep t. We report the number of clusters before andafter global cluster position estimation as well as the num-ber of outlier cameras. A camera is considered to be anoutlier if the position error is larger than half the minimumdistance between two cameras in the reference model.

As expected both methods perform well in the “linear”case, where the images are fed to the pipeline in the originalordering. A single cluster is reconstructed and maximum of8 cameras are classified as outliers. In a second experimentwe randomly shuffled the order of images. After an initialset of 20 images, the baseline merges all clusters to a singleone and as a result the number of outlier cameras increaseslinearly. The final 3D model is highly corrupted and onlycovers about a third of the temple (Figure 1a).

In contrast, the proposed pipeline creates up to 6 individ-ual but locally consistent clusters. After 84 out of the 341

Page 7: Progressive Structure from Motion3D reconstruction from individual photographs is a long studied topic in computer vision [30, 1, 12]. The field of Structure from Motion (SfM) deals

(a) t = 10 (b) t = 35 (c) t = 36 (d) t = 95

Figure 4. Individual snapshots showing the viewgraph (top) and the sparse structure (bottom) of a reconstruction of the temple sceneusing a very unfortunate image ordering document the algorithm’s recovery ability. Two separate parts of the temple are initially wronglyreconstructed in a single model (a-b) and are later successfully separated (c) and the real structure is recovered (d).

0 50 100 150 200 250 300 3500123456

# c

lust

ers

progSfM / VSfM - linearVSfM - random 0progSfM - random 0

0 50 100 150 200 250 300 3500

50100150200250300350

# r

egis

tere

d im

gs

0 50 100 150 200 250 300 350timestep t

050

100150200250300350

# o

utl

iers

Figure 5. Behavior of the incremental and progressive SfMpipelines on the TempleOfHeaven dataset. In the well behaving“linear” case (blue), both pipelines reconstruct the temple. If theinput order is shuffled the incremental pipeline gets stuck in a lo-cal minimum (green) whereas the proposed pipeline recovers andreconstructs the whole scene (red).

Figure 6. Comparison of the proposed progressive pipeline (left,center) to the incremental SfM pipeline (right) on the Stanforddataset.

images the clusters are correctly localized into a single ef-fective cluster by the cluster positioning (dashed line). Fig-ure 1 shows the resulting reconstruction after 100 images aswell as the corresponding clustered viewgraph.

3.4. Very Unfortunate Image Order

In order to demonstrate the recovery capabilities of ourproposed pipeline we conducted an additional experimenton the TempleOfHeaven dataset. The temple has a strongrotational symmetry which repeats with 60 degrees. Oneof the worst conditions for an algorithm would be if theimage ordering had exactly this 60 degree periodicity. Totest the algorithm, we created an artificial image sequenceby feeding the images alternatively in the following order:(I0, I60, I1, I61, ...). Figure 7 shows the number of clus-ters, images, and outliers in every timestep. Due to the highamount of matches between the symmetric part and the lackof images in between, the clustering algorithm sees enoughevidence for putting all images into a single cluster (Fig-ure 4a) and roughly every second image is an outlier. Afterthe addition of the 36th image, the connectivity has suffi-ciently changed so that the two clusters are recognized andthe 3D models are separated.

Due to the highly interleaved configuration the structureof the individual models cannot be recycled in this case andboth clusters are reconstructed from scratch. At this stagethere are no inter-cluster constraints available yet, whichis why the global cluster positioning does not succeed andthe models are displayed as independent sub-models (Fig-ure 4c). On the 95th image enough intra-cluster evidenceis available s.t. the models can be placed into a commoncoordinate system (Figure 4d) and with the 101st image asingle cluster is formed (this time the structure of the indi-vidual sub-models can be recovered and a simple merging isneeded). The experiment shows that the proposed pipelinecan recover from a wrong local minimum of the reconstruc-tion.

Page 8: Progressive Structure from Motion3D reconstruction from individual photographs is a long studied topic in computer vision [30, 1, 12]. The field of Structure from Motion (SfM) deals

0 20 40 60 80 100 1200.0

0.5

1.0

1.5

2.0

2.5

# c

lust

ers

0 20 40 60 80 100 120timestep t

0

20

40

60

80

100

120

# im

gs

registeredoutlier

Figure 7. Evolution of the clusters and images in a very unfortu-nately ordered image sequence. Despite the heavy rotational sym-metry of the temple dataset, the pipeline is able to recover from awrong configuration (timestep 36).

3.5. Realworld Progressive Reconstruction

While the experiments so far demonstrated the capabili-ties of the progressive pipeline for unfortunate image order-ings, its behavior in real world applications remains to beshown. Therefore we collected a series of 4516 images withthree different mobile devices (HTC Nexus 9, LG Nexus 5x,and Google Pixel) consisting of 29 sequences on the mainquad of the Stanford University. The sequences were thenprogressively reconstructed in an interleaved order (simu-lating multiple users collaboratively acquiring the pictures)as well as in the batch-processing mode. An additional sim-ilar experiment was conducted using the publicly availableQuad dataset [4] consisting of 6514 images, mainly takenby iPhone 3G. Images were shuffled randomly and pushedto the reconstruction algorithms.

Table 1 shows the proposed pipeline compared to theincremental Theia [31] pipeline in batch and progressivemodes. In addition we compare against the HSfM [5]pipeline purely in batch-processing mode. The proposedalgorithm as the only pipeline can deal with the progressivescheme while the other get stuck in a local minimum andonly reconstruct a very small subpart of the scene2. In thecase of the progressive pipeline, we also report the numberof failure recoveries during the whole reconstruction (thenumber of resets with a local cluster due to major topologi-cal changes). Figure 6 illustrates the final result of the pro-posed pipeline versus the local minimum of the incremen-tal pipeline. The compute times of the proposed pipelineare comparable to existing pipelines despite the continu-ous flow of intermediate results. This is due to the fact thatour pipeline practically never operates on the whole imagesets, but only on the local clusters. This allows significantlyfaster execution times of the bundle adjustment.

We furthermore run our method on several of the datasetspresented by [11]. We used random ordering of the crowd

2Experiments were repeated using multiple random orderings.

Figure 8. Result on the Radcliffe dataset with HSfM (left), Theia(incr.)(right) and the proposed method (right).

Stanford Quad [4] BigBen RadcliffeCamera

AlexanderNevskyCathedral

Branden-burg Gate

Arc DeTriomphe

total imgs [#] 4516 6514 402 282 448 175 434

batc

h

HSfM [5] imgs [#] 2,140 4,832 375 272 430 173 441time [s] 11,490 23,570 1,932 1,654 1,900 389 2,131

Theia (incr.) [31] imgs [#] 3,298 5,462 394 278 443 173 410time [s] 34,749 158,853 2,385 1,307 3,687 1,018 3,348

prog

ress

ive

Theia (incr.) [31] imgs [#] 51 527 394 280 418 173 416time [s] 7,605 215,064 2,026 1,473 2,268 1,141 3,516

proposed imgs [#] 3,165 3,894 285 279 427 173 405time [s] 13,276 25,713 552 1,121 2,163 627 377clusters [#] 76 92 5 2 4 2 5recoveries [#] 14 29 1 0 5 1 1

Table 1. Evaluation of our pipeline versus the incremental Theiapipeline and HSfM. Only the proposed pipeline can successfullyhandle the progressive scenario.

sourced data and pushed them to the progressive pipeline asdone in the Quad [4] experiment. Figure 8 shows a sam-ple result on the Radcliffe scene. While both HSfM and theincremental baseline method got confused by the duplicatestructure, our pipeline successfully separated the front andrear views of Radcliffe. The two clusters were merged suc-cessfully into the correct configuration by the global clusterpose optimization. In contrary to [11], this is not done in apost processing step but erroneous connections are detectedon-the-fly during the reconstruction. Table 1 shows the re-sults on the resulting numbers on the dataset equally to theexperiment before. Our method was able to separate dupli-cate structure in all datasets. As the reconstruction back-bone of the presented pipeline bases on the well known in-cremental [31] and global [34] pipelines, the accuracy of theresulting structure is equals the accuracy of these pipelines.

4. Conclusions

We proposed a novel progressive SfM pipeline which ad-dresses a multiuser-centric scenario, where a 3D model issimultaneously reconstructed from multiple image streamshandled by a cloud-based reconstruction service. In con-trary to existing work, our pipeline does not depend on theimage order and does not require any a-priori global knowl-edge about image connectivity. The progressive pipelineavoids taking any binding decisions and is able to recoverfor erroneous configurations. A global viewgraph is incre-mentally built and maintained. The graph is clustered basedon the local connectivity of cameras and individual clustersare reconstructed using either an incremental or a global re-construction pipeline. In the last step, individual models aremerged using a lightweight posegraph optimization just onthe cluster centers. We demonstrated the effectiveness andefficiency of our pipeline on multiple dataset and comparedit to existing solutions.

Page 9: Progressive Structure from Motion3D reconstruction from individual photographs is a long studied topic in computer vision [30, 1, 12]. The field of Structure from Motion (SfM) deals

References[1] S. Agarwal et al. Building rome in a day. ACM, pages 105–

112, 2011.[2] R. Arandjelovic and A. Zisserman. Three things everyone

should know to improve object retrieval c. IEEE Conf. Com-put. Vis. Pattern Recognit., (April):2911–2918, 2012.

[3] A. Chatterjee and V. M. Govindu. Efficient and robust large-scale rotation averaging. ICCV, pages 521–528, 2013.

[4] D. Crandall, A. Owens, N. Snavely, and D. Hutten-locher. SfM with MRFs: Discrete-continuous optimiza-tion for large-scale structure from motion. IEEE Transac-tions on Pattern Analysis and Machine Intelligence (PAMI),35(12):2841–2853, December 2013.

[5] H. Cui, X. Gao, S. Shen, and Z. Hu. Hsfm: Hybrid structure-from-motion. In CVPR, pages 1212–1221, 2017.

[6] Z. Cui and P. Tan. Global Structure-from-Motion by Similar-ity Averaging. In 2015 IEEE Int. Conf. Comput. Vis., pages864–872, 2015.

[7] A. J. Davison. Real-time Simultaneous Localisation andMapping with a Single Camera. ICCV, 2:1403–1410, 2003.

[8] M. Farenzena, A. Fusiello, and R. Gherardi. Structure-and-motion pipeline on a hierarchical cluster tree. 2009 IEEE12th Int. Conf. Comput. Vis. Work. ICCV Work. 2009, pages1489–1496, 2009.

[9] M. A. Fischler and R. C. Bolles. Random sample consensus:a paradigm for model fitting with applications to image anal-ysis and automated cartography. Commun. ACM, 24(6):381–395, 1981.

[10] M. Havlena, A. Torii, and T. Pajdla. Efficient Structure fromMotion by Graph Optimization. ECCV, pages 1–14, 2010.

[11] J. Heinly, E. Dunn, and J. M. Frahm. Correcting for du-plicate scene structure in sparse 3D reconstruction. Lect.Notes Comput. Sci. (including Subser. Lect. Notes Artif. In-tell. Lect. Notes Bioinformatics), 8692 LNCS(PART 4):780–795, 2014.

[12] J. Heinly, J. L. Schonenberger, E. Dunn, and J.-M. Frahm.Reconstructing the World * in Six Days. CVPR, 2015.

[13] B. K. P. Horn. Closed-form solution of absolute orientationusing unit quaternions. J. Opt. Soc. Am. A, 4(4):629, 1987.

[14] N. Jiang, P. Tan, and L. F. Cheong. Seeing double with-out confusion: Structure-from-motion in highly ambiguousscenes. Proc. IEEE Comput. Soc. Conf. Comput. Vis. PatternRecognit., pages 1458–1465, 2012.

[15] Z. Kang and G. Medioni. Progressive 3D model acquisitionwith a commodity hand-held camera. Proc. - 2015 IEEEWinter Conf. Appl. Comput. Vision, WACV 2015, pages 270–277, 2015.

[16] L. Kneip, D. Scaramuzza, and R. Siegwart. A NovelParametrization of the Perspective-Three-Point Problem fora Direct Computation of Absolute Camera Position and Ori-entation. CVPR, 2011.

[17] R. Kummerle, G. Grisetti, H. Strasdat, K. Konolige, andW. Burgard. G2o: A general framework for graph optimiza-tion. In ICRA, pages 3607–3613. IEEE, may 2011.

[18] A. Locher, M. Perdoch, and L. Van Gool. Progressive Prior-itized Multi-view Stereo. CVPR, pages 3244–3252, 2016.

[19] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. Int. J. Comput. Vision, 60(2):91–110, Nov. 2004.

[20] D. Martinec and T. Pajdla. Robust rotation and translationestimation in multiview reconstruction. Proc. IEEE Comput.Soc. Conf. Comput. Vis. Pattern Recognit., 2007.

[21] P. Moulon, P. Monasse, and R. Marlet. Global fusion of rela-tive motions for robust, accurate and scalable structure frommotion. ICCV, pages 3248–3255, 2013.

[22] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos. Orb-slam: A versatile and accurate monocular slam system. IEEETransactions on Robotics, 31(5):1147–1163, Oct 2015.

[23] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisser-man. Object retrieval with large vocabularies and fast spa-tial matching. Proc. IEEE Comput. Soc. Conf. Comput. Vis.Pattern Recognit., 2007.

[24] R. Raguram, O. Chum, M. Pollefeys, J. Matas, and J.-M.Frahm. A Universal Framework for Random Sample Con-sensus : USAC. PAMI, pages 1–17, 2013.

[25] R. Roberts, S. N. Sinha, R. Szeliski, and D. Steedly. Struc-ture from motion for scenes with large duplicate structures.Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recog-nit., pages 3137–3144, 2011.

[26] J. L. Schonberger and J.-M. Frahm. Structure-from-MotionRevisited. CVPR, 2016.

[27] J. L. Schonberger, F. Radenovic, O. Chum, and J. M. Frahm.From single image query to detailed 3d reconstruction. In2015 IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 5126–5134, June 2015.

[28] S. N. Sinha, D. Steedly, and R. Szeliski. A multi-stage lin-ear approach to structure from motion. Lect. Notes Comput.Sci. (including Subser. Lect. Notes Artif. Intell. Lect. NotesBioinformatics), 6554 LNCS(PART 2):267–281, 2012.

[29] J. Sivic and A. Zisserman. Video Google: a text retrieval ap-proach to object matching in videos. IEEE Int. Conf. Com-put. Vis., (Iccv):1470–1477, 2003.

[30] N. Snavely, S. M. Seitz, and R. Szeliski. Modeling the worldfrom internet photo collections. International Journal ofComputer Vision, 80(2):189–210, 2008.

[31] C. Sweeney. Theia multiview geometry library: Tutorial &reference. http://theia-sfm.org.

[32] C. Sweeney, V. Fragoso, T. Hollerer, and M. Turk. Largescale sfm with the distributed camera model. In 3D Vision(3DV), pages 230–238. IEEE, 2016.

[33] C. Sweeney, T. Sattler, H. Tobias, M. Turk, and M. Pollefeys.Optimizing the Viewing Graph for Structure-from-Motion.Int. Conf. Comput. Vis., pages 801–809, 2015.

[34] K. Wilson and N. Snavely. Robust global translationswith 1DSfM. Lect. Notes Comput. Sci. (including Subser.Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), 8691LNCS(PART 3):61–75, 2014.

[35] C. Wu. Towards linear-time incremental structure from mo-tion. 3DV, pages 127–134, 2013.

[36] C. Zach, A. Irschara, and H. Bischof. What can missing cor-respondences tell us about 3D structure and motion? CVPR,1, 2008.

[37] C. Zach, M. Klopschitz, and M. Pollefeys. DisambiguatingVisual Relations Using Loop Constraints. CVPR, 2010.