efficient multi-view 3d dense matching for large-scale...

6
2610 978-1-7281-1312-8/18/$31.00 ©2018 IEEE Efficient Multi-View 3D Dense Matching for Large-Scale Aerial Images Using a Divide-and- Conquer Scheme Junshi XUE Space Engineering University Beijing,China [email protected] Xiangning CHEN Space Engineering University Beijing,China [email protected] Hui YI Space Engineering University Beijing,China [email protected] Abstract—This paper proposes a novel multi-view 3D dense matching method for large-scale aerial images using a divide- and-conquer scheme. Firstly, the original sparse reconstruction result is divided into several sub-clusters based on the relationship of the camera projection location, and the bounding box of each sub-cluster is obtained. An efficient patch-based stereo matching strategy is then performed, followed by multi-photo geometrically constrained (MPGC) matching optimization to generate a depth map for each image in the sub-clusters, with the limited patch expansion range according to the bounding box of the sub-cluster. Redundancy points are removed by enhanced depth consistency in different views, which contributes to high-accuracy depth map fusion. Lastly, the dense points of each sub-cluster can be easily grouped together due to the determined boundaries. This method can be easily parallelized at the image level, and is highly suitable for the large-scale reconstruction of aerial images. The experimental results show that the proposed method has advantages over the state-of-the-art method in terms of reconstruction accuracy and efficiency. Keywords—machine vision; patch-based stereo matching; dense reconstruction; depth map fusion; limited patch expansion range I. INTRODUCTION (HEADING 1) The 3D reconstruction of multi-view images is a vitally important research field in computer vision and photogrammetry. With the popularization and application of unmanned aerial vehicles (UAVs) and digital cameras, it is becoming more and more convenient to obtain high-quality aerial images, and the 3D reconstruction of high-resolution aerial images is becoming more and more feasible. Structure from motion (SFM) and multi-view stereo (MVS) are two of the main processes involved in 3D reconstruction [1] . SFM recovers camera positions and sparse 3D structures from multiple views, which are composed of feature extraction and matching, camera registration, 3D triangulation, and bundle adjustment. Based on the camera parameters and sparse point clouds obtained from SFM, MVS can generate dense point clouds from calibrated images. After decades of effort, the SFM algorithm can now achieve large-scale sparse reconstruction for tens of thousands, or even millions, of high-resolution images [2] , although the MVS algorithm for large-scale high-resolution images requires further study, due to its large computational burden and memory consumption. Existing MVS methods can be divided into two main categories: volume-based methods and point-cloud-based methods [3] . Volume-based methods mainly optimize the surface through a photometric consistency aggregation function in a certain volume, to achieve a multi-view dense reconstruction. This type of method is more suitable for the reconstruction of small objects with initial conditions such as bounding box, silhouette contour, and visual hull [4] . However, due to their low computational efficiency and large memory consumption, it is difficult to perform 3D reconstructions of large-scale urban scenes using volume- based methods. There are two different types of point-cloud- based reconstruction methods: feature-point growing and depth map fusion. Methods based on feature-point growing first perform quasi-dense reconstruction using the extracted feature points, and then expand to nearby pixels, while outliers are filtered out by photometric consistency and geometric constraint to obtain a high-accuracy dense reconstruction. The patch-based multi-view stereo (PMVS) algorithm is the state-of-the-art dense reconstruction method based on feature-point growing [5] . PMVS first extracts Harris and Dog features, and then iteratively performs patch expansion and outlier filtering three times to reconstruct a dense point cloud. The reconstruction accuracy of this method is high, while the reconstruction completeness is poor, especially in textureless regions. The depth-map- fusion-based method calculates the minimized matching cost between pixels in the reference image and the searching images to generate a depth map of each image, and then generates a dense point cloud by depth map fusion. Multi- View Environment (MVE) designs an effective view- selection strategy to enhance the scalability of the algorithm [6] , as well as a MPGC-based region-growing process to generate high-accuracy depth maps. COLMAP proposes a joint estimation method of depth and normal information, which can estimate depth values and patch normal iterative propagation across multiple views [7] . COLMAP can be considered as the state-of-the-art depth- map-fusion-based MVS method due to its high accuracy and completeness. Depth-map-fusion-based methods are able to generate highly detailed geometries and large amounts of redundancy, which ensures completeness, but also produces a heavy burden for the computation of depth maps. It is obviously impossible, and unnecessary, to perform dense reconstruction for all of the images registered by SFM at one time, due to the limited nature of computational resources and the localities in camera clusters. Given thousands of high-resolution images, each image only overlaps spatially with a few images. The scalability of MVS methods can be improved by taking advantage of the locality of image overlaps [8,9]. At present, the overlapping relationship between cameras in the sparse reconstruction result is mainly used to perform clustering, with which the whole scene can be divided into several sub-clusters. These sub-clusters can be individually reconstructed by the state-of- the-art MVS method. CMVS algorithm is an early attempt to use this scheme, and mainly performs camera clustering based on the number of matching points between images [9] .

Upload: others

Post on 19-Mar-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Multi-View 3D Dense Matching for Large-Scale ...static.tongtianta.site/paper_pdf/794d1e0a-6a6b-11e9-8bc0-00163e08… · COLMAP proposes a joint estimation method of depth

2610978-1-7281-1312-8/18/$31.00 ©2018 IEEE

Efficient Multi-View 3D Dense Matching for Large-Scale Aerial Images Using a Divide-and-

Conquer Scheme

Junshi XUE Space Engineering University

Beijing,China [email protected]

Xiangning CHEN Space Engineering University

Beijing,China [email protected]

Hui YI Space Engineering University

Beijing,China [email protected]

Abstract—This paper proposes a novel multi-view 3D dense matching method for large-scale aerial images using a divide-and-conquer scheme. Firstly, the original sparse reconstruction result is divided into several sub-clusters based on the relationship of the camera projection location, and the bounding box of each sub-cluster is obtained. An efficient patch-based stereo matching strategy is then performed, followed by multi-photo geometrically constrained (MPGC) matching optimization to generate a depth map for each image in the sub-clusters, with the limited patch expansion range according to the bounding box of the sub-cluster. Redundancy points are removed by enhanced depth consistency in different views, which contributes to high-accuracy depth map fusion. Lastly, the dense points of each sub-cluster can be easily grouped together due to the determined boundaries. This method can be easily parallelized at the image level, and is highly suitable for the large-scale reconstruction of aerial images. The experimental results show that the proposed method has advantages over the state-of-the-art method in terms of reconstruction accuracy and efficiency.

Keywords—machine vision; patch-based stereo matching; dense reconstruction; depth map fusion; limited patch expansion range

I. INTRODUCTION (HEADING 1) The 3D reconstruction of multi-view images is a vitally

important research field in computer vision and photogrammetry. With the popularization and application of unmanned aerial vehicles (UAVs) and digital cameras, it is becoming more and more convenient to obtain high-quality aerial images, and the 3D reconstruction of high-resolution aerial images is becoming more and more feasible. Structure from motion (SFM) and multi-view stereo (MVS) are two of the main processes involved in 3D reconstruction[1]. SFM recovers camera positions and sparse 3D structures from multiple views, which are composed of feature extraction and matching, camera registration, 3D triangulation, and bundle adjustment. Based on the camera parameters and sparse point clouds obtained from SFM, MVS can generate dense point clouds from calibrated images. After decades of effort, the SFM algorithm can now achieve large-scale sparse reconstruction for tens of thousands, or even millions, of high-resolution images[2], although the MVS algorithm for large-scale high-resolution images requires further study, due to its large computational burden and memory consumption.

Existing MVS methods can be divided into two main categories: volume-based methods and point-cloud-based methods[3]. Volume-based methods mainly optimize the surface through a photometric consistency aggregation function in a certain volume, to achieve a multi-view dense reconstruction. This type of method is more suitable for the

reconstruction of small objects with initial conditions such as bounding box, silhouette contour, and visual hull[4]. However, due to their low computational efficiency and large memory consumption, it is difficult to perform 3D reconstructions of large-scale urban scenes using volume-based methods. There are two different types of point-cloud-based reconstruction methods: feature-point growing and depth map fusion. Methods based on feature-point growing first perform quasi-dense reconstruction using the extracted feature points, and then expand to nearby pixels, while outliers are filtered out by photometric consistency and geometric constraint to obtain a high-accuracy dense reconstruction. The patch-based multi-view stereo (PMVS) algorithm is the state-of-the-art dense reconstruction method based on feature-point growing[5]. PMVS first extracts Harris and Dog features, and then iteratively performs patch expansion and outlier filtering three times to reconstruct a dense point cloud. The reconstruction accuracy of this method is high, while the reconstruction completeness is poor, especially in textureless regions. The depth-map-fusion-based method calculates the minimized matching cost between pixels in the reference image and the searching images to generate a depth map of each image, and then generates a dense point cloud by depth map fusion. Multi-View Environment (MVE) designs an effective view-selection strategy to enhance the scalability of the algorithm[6], as well as a MPGC-based region-growing process to generate high-accuracy depth maps. COLMAP proposes a joint estimation method of depth and normal information, which can estimate depth values and patch normal iterative propagation across multiple views[7]. COLMAP can be considered as the state-of-the-art depth-map-fusion-based MVS method due to its high accuracy and completeness. Depth-map-fusion-based methods are able to generate highly detailed geometries and large amounts of redundancy, which ensures completeness, but also produces a heavy burden for the computation of depth maps.

It is obviously impossible, and unnecessary, to perform dense reconstruction for all of the images registered by SFM at one time, due to the limited nature of computational resources and the localities in camera clusters. Given thousands of high-resolution images, each image only overlaps spatially with a few images. The scalability of MVS methods can be improved by taking advantage of the locality of image overlaps[8,9]. At present, the overlapping relationship between cameras in the sparse reconstruction result is mainly used to perform clustering, with which the whole scene can be divided into several sub-clusters. These sub-clusters can be individually reconstructed by the state-of-the-art MVS method. CMVS algorithm is an early attempt to use this scheme, and mainly performs camera clustering based on the number of matching points between images[9].

Page 2: Efficient Multi-View 3D Dense Matching for Large-Scale ...static.tongtianta.site/paper_pdf/794d1e0a-6a6b-11e9-8bc0-00163e08… · COLMAP proposes a joint estimation method of depth

2611

CMVS can cooperate with the PMVS algorithm to achieve dense reconstruction for large scenes.

The dense reconstruction pipeline produced in this study is as follows. Firstly, the whole SFM result is divided into sub-clusters, whose bounding box is also obtained, by the proposed clustering method. The sparse points are used to estimate the best-fitting projection plane, and subsequently, by projecting the camera onto the plane, the overlapping relationship between cameras is obtained by analyzing the projection regions, and camera clustering is performed to divide the whole scene. Each sub-cluster is subsequently reconstructed by an effective patch-based depth map fusion method. Raw depth maps are generated by expanding the initial patches, which are further refined by the MPGC algorithm. Additionally, redundancy points are filtered out by enhanced depth consistency in multiple views before depth map fusion is performed. Finally, all sub-regions are concentrated together to form the final result of the whole scene. The proposed MVS method is modified during the

patch expansion process. By judging whether the new expanded patch center is located inside or outside of the bounding box, the reconstruction efficiency is improved, and the boundaries of each of the generated sub-regions are made to be neat.

II. CAMERA CLUSTERING In order to carry out 3D reconstruction for large-scale

aerial images under limited computing resources, the results of the input sparse reconstruction must be divided into sub-clusters, and each sub-cluster must subsequently be reconstructed respectively. After the reconstruction of all sub-clusters, the reconstruction results are merged into a whole. According to the characteristics of aerial photogrammetry, this paper proposes a camera clustering method based on the locations of camera projections. This method is divided into the following three steps.

(a)

π

iO

(b)

(c) Figure 1. Camera clustering. (a) The result of the input sparse reconstruction, in which the blue cones represent the cameras, and the gray points represent the sparse point cloud; (b) Projection of the camera onto the estimated plane. The red quadrilateral is its coverage; (c) After the projection of all cameras is completed, the bounding box of the region is divided into sub-regions.

A. Projection Plane Estimation The dense reconstruction of aerial images usually

involves using urban, mountainous, or rural farmland as the reconstruction scene, before distributing the region in a surface as a whole, so that the plane model can be used to estimate the approximate earth plane in the sparse points. According to the Hessian Normal Form plane model, supposing a point in space is ( , , )x y z , then the plane equation coefficient is ( , , , )A B C D , and the corresponding plane equation is given as:

0Ax By Cz D+ + + = (1)

However, since there are a large number of non-coplanar points in the scene, such as buildings, trees, and noise points, the RANSAC-based robust estimation method is required.

B. Projection Plane Estimation After the plane estimation is completed, all cameras are

projected onto the plane using the pinhole imaging model, and the spatial overlapping relationship of the images can be analyzed. In the result of the input sparse reconstruction, the rotation matrix of any image i is iR , the translation vector is

iC , and the matrix of the camera intrinsic parameters is iK . The homogeneous coordinates of point p in the image are

T[ , ,1]p p px u v= , and its corresponding 3D point is T[ , , ,1]p p pP X Y Z= . We adopt the following pinhole imaging

model:

i i p iP K R x C= + (2)

Supposing the length and width of the image i are ,H W , respectively, and the positions of the four edge points of the image are (0,0),(0, ),( ,0),( , )W H H W . The four points

1 2 3 4, , ,i i i iP P P P in 3D space can then be obtained by Equation (2). The camera center in 3D space is i i iO R C= − . As is shown in Figure 1b, the intersection of four rays passing through the camera center and 1 2 3 4, , ,i i i iP P P P , with the estimated plane π , can determine a quadrilateral, which is the projection of camera i onto the plane π . Figure 1c shows the camera projection of the aerial image reconstruction for Dayan Tower, in which the red quadrilaterals are the camera projection, the green points are points within the plane, and the blue line is the boundary of the clustering division.

C. camera clustering As is shown in Figure 1c, after the plane estimation and

camera projection, the bounding box of sparse points in the plane can be divided into several sub-regions, according to the preset number of clusters. If the projection area of a camera overlaps with the bounding box of the sub-cluster, then the image belongs to this sub-cluster. After traversing all the images and sub-clusters, 3D sparse points in each sub-cluster are retrieved according to the 2D feature points in images and the 2D–3D matching trace in the results of the sparse reconstruction. After determining the 3D sparse points in the sub-cluster, the bounding box of each sub-region can be obtained, i.e., the 3D space range of a sub-cluster is

Page 3: Efficient Multi-View 3D Dense Matching for Large-Scale ...static.tongtianta.site/paper_pdf/794d1e0a-6a6b-11e9-8bc0-00163e08… · COLMAP proposes a joint estimation method of depth

2612

determined by min max min max min max[ , ] [ , ] [ , ]x x y y z z× × . There are many images belonging to different sub-clusters, which will lead to repeated reconstruction and redundancy. In section 3, an efficient MVS method is proposed based on the limited patch-expansion range, which can effectively reduce computational complexity and improve computational efficiency. Another advantage of this method is that the boundaries between sub-clusters are clear, and the reconstructed data are easy to organize and post-process.

III. MVS METHOD WITH LIMITED PATCH EXPANSION RANGE

In this study, a multi-view dense reconstruction algorithm is proposed, consisting of view selection, initial patch generation and expansion, depth map calculation and optimization, and depth map fusion. For each image in a sub-cluster, we select a set of searching views for stereo computation. Raw depth maps that are generated by patch expansion, contain high levels of noise and errors. Depth maps are refined by an MPGC-based method. Finally, dense point clouds are generated with high accuracy after depth maps fusion.

A. View Selection Each image in the reconstructed dataset is selected as a

reference image in turn, and other images serve as searching images in stereo pairs. As mentioned above, a given image is only connected with a few images in the dataset. When performing reconstruction, it is appropriate to select a set of images to compute stereo with a reference image other than the whole image dataset. Given that the image sequence for reconstruction is = 1 2{ , , , }NI I II , and the reference image is

1 2{ , , , }p NI I I∈ =I I , the searching image set ( )V p is selected from I for depth map computation. The selection of searching image set can not only improve the efficiency of dense reconstruction, but can also have an important impact on the reconstruction accuracy. The selection of searching image set is mainly based on the results of the sparse reconstruction, which are determined by the number of matching points between images, the baseline distance, and the angle of view direction[10].

B. Patch Generation and Expansion "Patch" refers to the tangential plane at the 3D point of

the reconstructed scene. As a basic dense-reconstruction element, patch consists of the center of the patch ( )X p and a normal vector ( )n p pointing in the direction of the camera. If the patch is visible in the reference image, the center of the patch ( )X p always establishes a matching relationship with a point px in the reference image ( )RI p , assuming that the patch is { ( ), ( ), ( )}p Rf X p n p I p← , and the homogeneous

coordinates of point px are [ ]Tu v, ,1x x , in the camera

coordinate system of the reference image there is 1( ) p p pX p K xλ −= , where pλ is the depth value corresponding to

the pixel px , and 1pK− is the inverse of the camera intrinsic

matrix. The seed patches are initialized by sparse points and raw depth maps are calculated by patch expansion. In order to retrieve sparse point clouds effectively, Delaunay triangulation of sparse point clouds is performed. The normal

vector in at point ( )X p can be obtained by the cross-multiplication of the vector corresponding to the two edges of a point ( )X p on the triangulation face.

Patches generated by sparse point clouds are usually used as seed patches. According to the matching measures value,

wf , the seed patches are sorted in descending order, to add in reconstruction. The prioritization of candidates is very important for first considering patches with higher matching confidence, which itself is helpful for avoiding expansion to unstable regions. The aggregated matching measure adopted in this paper is the Zero Normalized Cross Correlation (ZNCC), given as:

12 2 2

[ ( ) ( )][ ( ) ( )]( , ) 1

[ ( ( ) ( )) ( ( ) ( )) ]x W

w

x W x W

I x i I x I x i I xf x x

I x i I x I x i I x∈

∈ ∈

′ ′+ − + −′ = −

′ ′+ − + − (3)

where W is the square window centered on pixel x, with size ω ω× (we set 7ω = pixel in the following experiment), and ( )I x is the average of the pixel greyscale value in W . High-resolution aerial images could provide reliable matching results, and accordingly ZNCC is sufficient to be the matching cost.

During the process of patch initialization, outliers and mismatch points are inevitably present in the SFM results. When adding in the queue, the initial patches whose matching cost are greater than 0.3 are deleted. Patch expansion mainly provides initial values for the optimization process, and good initial results can lead to fast convergence.

{ ( ), ( ), ( )}q Rf X q n q I q← is one of the initial patches, whose

corresponding pixel in the reference image is qx , from

which the extending plane is T T( ,1)Vς = , where T

TT T

q

q

nV n X= − .

When performing expansion, the neighboring pixels of qx (the blue area in Figure 2) are the priority expansion area. The intersection of the rays passing through the camera center RO and the neighbor pixel qx′ and the plane ς determines the center ( )X q′ of the new expanded patch qf ′ . The normal vector of the seed patch is used as the normal vector of the new patch, ( ) ( )n q n q′ = . The other pixels in the same triangle as the seed patch are initialized in the same way, and can be traversed by Advanced Rasterization The raw depth map following expansion is shown in Figure 6 c, in which it can be seen that there are blurred objects in the scene, which need to be refined.

Before expansion, the bounding box of the sub-cluster reconstruction should be enlarged by 10% so as to avoid bias caused by clustering, and the enlarged parts should be deleted in subsequent processing (such as meshing or texture mapping). When performing patch expansion, the newly generated patch should not exceed the extended bounding box. That is, if the newly generated patch is within the range

× × ×min max min m ax m in max1.1 [ , ] [ , ] [ , ]x x y y z z , it should be retained; otherwise, it should be deleted and no longer expanded in that direction.

Page 4: Efficient Multi-View 3D Dense Matching for Large-Scale ...static.tongtianta.site/paper_pdf/794d1e0a-6a6b-11e9-8bc0-00163e08… · COLMAP proposes a joint estimation method of depth

2613

qfq'f

qxq'x

ς

Figure 2. Patch expansion to neighboring pixels.

C. Depth Map Calculation and Optimization The patch corresponding to the pixel in the reference

image is defined in 3D space, which can avoid image rectification. The homography between the reference image and the search image can be obtained from the support region corresponding to the patch, and the cost of aggregation matching can be compared and propagated across multiple images. Supposing that the camera parameters of the stereo pairs { , }i jI I are { , , },{ , , }i i i j j jK R C K R C , and the patch is { ( ), ( ), ( )}p Rf X p n p I p← , the homography matrix between matching points is as follows [12]:

T

1 1T

( )j i j iij j j i i

i i

R C C nH K R R K

n X− −−

= + (4)

In Figure 3 RC is the camera center of the reference image, and 1 2, , , NC C C are the camera centers of the searching images. The supporting window W in the reference image is centered in p , while the matching windows 1 2, , , NW W W are centered in 1 2, , , Np p p .

π

1pp

2p3p

1Hπ2Hπ 3Hπ

Figure 3. The principle of MPGC optimization.

If the pixel [ , ]p p px y=x in the reference image and the pixel [ , ]i i ix y=x in the searching image are corresponding matching points, without considering distortion, the points have the following relationship [31]:

pi

i p

xxA B

y y= + (5)

where 10 11

20 21

i i

i i

h hA

h h= , 12

22

i

i

hB

h= , i

jkh is the element of the

homography matrix in Equation (4). A is the upper-left 2

2 matrix of the homography matrix H , and B is composed of the first two elements in the last column of H . After the initial value of the patch is obtained by expansion, the homography matrix H between the two points can be calculated using Equation (4), and the initial value 0 0( , )A B of the matrix ( , )A B can be obtained. These parameters can be optimized by minimizing equation (6):

2

,min [ ( ) ( )]R q i i qx WA B

I x c I Ax B dxε∈

= − + (6)

where ic is the color scale. In the aerial image reconstruction, illumination changes are generally limited, so we can set 1ic = . ( )i qI Ax B+ is linearized at 0 0( , )A B as follows:

( )T0 0 0( ) ( ) ( )i

i q i q ixI Ax B I A x B I a aa

∂+ = + + ∇ −

∂ (7)

where 10 11 20 21 12 22, , , , ,i i i i i ia h h h h h h= , ( )T ,i ix iyI I I∇ = . Let the integral term in equation (7) be equal to 0, then:

( )T0 0 0( ) ( ) ( ) 0i

R q i q ixI x I A x B I a aa

∂− + − ∇ − =

∂ (8)

After simplification, Equation (8) can be written as:

=G a g (9)

where 1 2( ) ( )q q ix iyF I k I k I dx= + +g x x ,T

( ) ( )q qF F dx=G x x 1 10 11i i

q qk h x h y= + and

2 20 21i i

q qk h x h y= + , T( ) , , , , ,q q ix q ix q iy q iy ix iyF x I y I x I y I I I=x .

According to Equation (11), the final value of the matrix ( , )A B can be refined by using Newton–Raphson iteration. The position of the matching pixel ( , )i ix y in the searching image can be determined by Equation (5). If the matching cost is less than 0.3, the result is considered to be reliable.

After the above deduction, when the pixel of the reference image is not a seed point, the corresponding pixel changes in the searching image can be obtained by updating the parameters ( , )A B , and the depth map can be refined by precisely matching the corresponding points. After refinement, all the depth maps are merged to achieve the 3D dense points of the whole scene.

IV. EXPERIMENT AND ANALYSIS Two groups of experiments are carried out in this paper.

Firstly, the DLU dataset[11] is used to evaluate the performance of the proposed MVS algorithm in terms of completeness and accuracy. The effectiveness of the presented clustering algorithms and limited patch expansion scheme are subsequently verified experimentally and by comparative analysis. Finally, three sets of large-scale aerial images are reconstructed to evaluate the performance of the proposed method. COLMAP and PMVS are the state-of-the-art methods that are used for comparative experiments. These are open programs, with available code, and are simple and convenient to implement. Their default parameter settings were used in our experiments. Our approach was

Page 5: Efficient Multi-View 3D Dense Matching for Large-Scale ...static.tongtianta.site/paper_pdf/794d1e0a-6a6b-11e9-8bc0-00163e08… · COLMAP proposes a joint estimation method of depth

2614

implemented using C++ on a PC with Intel Xeon(R) i7 2.0 GHz processors (32 threads), 96 GB of RAM, and a 500 GB hard disk drive for data storage.

A. Reconstruction Accuracy and Completeness Without considering clustering division, DLU data sets

were used to evaluate the accuracy and completeness of the MVS algorithm proposed in this paper. The DLU dataset provides datasets fully considering the influence of various factors and evaluation protocols for large-scale MVS algorithms. The datasets contain 80 scenes, each consisting

of 49 or 64 images with a resolution of 1200 1600 under different illumination conditions. The dataset also provides high-accuracy ground-truth point clouds obtained by structured-light scanning for each dataset. The evaluation protocol was carefully designed, with available MATLAB code to calculate distance metric. The mean and median values of the differences between the ground-truth point clouds and the reconstructed point clouds are performance measures of the MVS algorithms. The percentage metrics of reconstruction accuracy and completeness are also considered.

Table 1. Quantitative comparison of different methods for DTU datasets. Acc. refers to accuracy, and Comp. to completeness.

Mean Distance (mm) Median Distance(mm) Percentage (<3(mm)) Percentage (<5(mm))

Acc. Comp. Acc. Comp. Acc. Comp. Acc. Comp.

PMVS 0.472 0.613 0.275 0.298 73.15 54.52 80.79 69.33

COLMAP 0.418 0.377 0.241 0.276 75.47 65.35 83.95 77.62

Ours 0.353 0.336 0.237 0.265 81.43 68.71 88.54 81.28

A total of 25 groups of datasets were selected from the DLU datasets, and dense reconstruction was performed using our algorithm, and the PMVS and COLMAP algorithms. Five groups of the reconstruction results are shown in Figure 4, in which the green box represents a reduction of reconstruction completeness, and the red box indicates low reconstruction accuracy. As can be seen from Figure 7, PMVS can generate high-accuracy point clouds, yet its completeness is poor, while COLMAP improves the reconstruction accuracy and completeness, yet generates large amounts of redundancy points, especially at the edge of the object.

Figure 4. Reconstruction results for scans 1, 2, 6, and 37 for DLU datasets.

Our method shows an advantage over COLMAP and PMVS, both in completeness and accuracy. Additionally, our method can generate complete point clouds, even in textureless regions. A quantitative comparison of different methods using DTU datasets is shown in Table 1. A total of 81.43% of the point clouds generated by our proposed

method have accuracies of under 3mm, which outperforms the other methods, while the completeness values are also superior in the three methods.

B. Clustering Scheme Effectiveness

In order to verify the effectiveness of the clustering algorithm and limited patch growing range scheme, the Dayan Tower dataset was used for dense reconstruction, and the CMVS–PMVS algorithm was also tested as a comparison. The Dayan Tower dataset is composed of 1045 images, captured by UAV with five SONY DSC-QX100 cameras at different angles.

The SFM results of the dataset were divided into 16 sub-clusters using both CMVS and our proposed method. Each sub-cluster divided by our proposed method was subsequently reconstructed by limited patch expansion range and unlimited patch expansion range. Additionally, each sub-cluster divided by CMVS was reconstructed by PMVS for comparison. As is illustrated in Table 2, our proposed method is more than eight times faster than the CMVS–PMVS method, and the required RAM is much lower. With a limited patch expansion range, our proposed method generates fewer redundancy points and runs more efficiently. A limited patch expansion range is an effective strategy to modify the proposed MVS method, in order to deal with large-scale aerial images cooperating with a camera clustering method.

The concentration results for four neighboring sub-regions generated by the above methods are shown in Figure 5. It can be seen that the boundaries of the sub-regions reconstructed by our method with limited patch expansion range are clear, which facilitates data organization and makes further processing more convenient.

Table 2. Statistics for the reconstruction results for different reconstruction methods.

Method pN cN Time (min) Peak Memory (G) iN

Page 6: Efficient Multi-View 3D Dense Matching for Large-Scale ...static.tongtianta.site/paper_pdf/794d1e0a-6a6b-11e9-8bc0-00163e08… · COLMAP proposes a joint estimation method of depth

2615

Our Proposed (LR) 6988324 16 194 4.35 169

Our Proposed (ULR) 17344090 16 325 7.24 169

CMVS-PMVS 2630367 16 1592 62.47 105

pN represents for the average number of points in each sub-cluster after reconstruction, cN represents for the number

of sub-clusters, and iN represents the average number of images in each sub-cluster.

(a)

(b)

(c)

Figure 5. The reconstruction results for Dayan Tower using different methods: (a) Our Proposed method (limited expansion range); (b) Our Proposed method (unlimited expansion range); (c) The CMVS–PMVS method.

Using the same acquisition configuration as for the Dayan Tower dataset, the dataset of a town in Guangzhou was acquired with a ground resolution of 0.05 m. The SFM result was divided into 30 sub-clusters, which were reconstructed one by one using our proposed method. In Figure 6, we select four sub-clusters, which can be divided into two groups according to the adjacent relationship. Each group of concentrated results, and the results of the whole scene, can be seen in Figure 6. It can be concluded that the clustering method and limited patch expansion scheme can effectively deal with large-scale aerial images registered by SFM.

Figure 6. Reconstruction result of the Guangzhou dataset based on clustering.

V. CONCLUSIONS In this paper, a novel algorithm is presented for the 3D

reconstruction of dense point clouds from large-scale aerial images. A clustering scheme is proposed to divide the sparse result into sub-clusters, according to the characteristics of aerial image acquisition. To reconstruct sub-clusters, an efficient patch-based stereo matching strategy is adopted, followed by MPGC optimization, to generate a depth map

for each image, with the limited patch expansion range according to the bounding box of the sub-cluster. The experimental results demonstrate the completeness, efficiency, and high accuracy of the proposed method. This method outperforms the state-of-the-art method in terms of reconstruction accuracy and efficiency.

REFERENCES [1] Locher A, Perdoch M, Gool L V. Progressive Prioritized Multi-view

Stereo[J]. Computer Vision and Pattern Recognition. 2016, 3244-3252.

[2] Heinly J, Schönberger J L, Dunn E, et al. Reconstructing the world* in six days[J]. Computer Vision and Pattern Recognition. 2015, 3287-3295

[3] Strecha C, Von Hansen W, Van Gool L, et al. On benchmarking camera calibration and multi-view stereo for high resolution imagery[J]. Computer Vision and Pattern Recognition. 2008, 1-8,

[4] 14. Kolev K, Brox T, Cremers D. Fast joint estimation of silhouettes and dense 3D geometry from multiple images[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2012, 34, 493-505.

[5] Furukawa Y, Ponce J. Accurate, dense, and robust multi-view stereopsis[J]. IEEE transactions on pattern analysis and machine intelligence 2010, 32, 1362-1376.

[6] Goesele M, Snavely N. Multi-view stereo for community photo collections[C]. IEEE 11th International Conference on Computer Vision 2007, 1-8.

[7] Schönberger, J. L., Zheng, E., Frahm, J. M., et al. Pixelwise view selection for unstructured multi-view stereo[C]. In European Conference on Computer Vision 2016, 501-518.

[8] Furukawa Y, Curless B, Seitz S M, et al. Towards internet-scale multi-view stereo[J]. Computer Vision and Pattern Recognition 2010, 1434-1441.

[9] Snavely N, Seitz S M, Szeliski R. Modeling the World from Internet Photo Collections[J]. International Journal of Computer Vision. 2008, 80, 189-210

[10] Li J, Li E, Chen Y, et al. Bundled depth-map merging for multi-view stereo[J]. Computer Vision and Pattern Recognition 2010, 2769-2776.

[11] Aanæs H, Jensen R R, Vogiatzis G, et al. Large-scale data for multiple-view stereopsis[J]. International Journal of Computer Vision 2016, 120, 153-168.