Hierarchical Image Geo-Location
On a World-Wide Scale
A Dissertation Presented by
Alexandru Nicolae Vasile
to
The Department of Electrical and Computer Engineering
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
in
Electrical Engineering
Northeastern University Boston, Massachusetts
December 2014
ii
Keywords: Internet Data, Image Geo-location, Scene Classification, Image Retrieval, 3D Reconstruction, Structure from Motion , 3D Registration, Sensor Fusion, 3D Ladar, Geiger-Mode, Data Filtering, Coincidence Processing
iii
Abstract
Hierarchical Image Geo-Location On a World-Wide Scale
by Alexandru N. Vasile
Doctor of Philosophy in Electrical Engineering
Northeastern University, December 2014 Dr. Octavia Camps, Advisor
There are increasingly vast amounts of imagery and video collected from a variety of sensor modalities. Considering that each individual image may contain considerable amounts of information, the ability to interpret, understand and extract scene information is highly beneficial. In order to enable automated scene understanding, there is a need for an organizing principle to store, visualize and exploit the data. Three-dimensional geometry provides such an organizing principle as imagery and video have inherent 3D structure and can be associated with geographic coordinates. In this thesis, we leverage multiple large geo-spatial databases to create a 3D world model and develop a hierarchical image geo-location framework using a coarse-to-fine localization approach. Starting at the coarsest level, a query image is geo-located to regions of the world though a probabilistic terrain classification approach using a 6.5 million image Flickr database. Next, a novel medium-scale localization method is developed to rule out most of the regions and establish candidate geo-locations with geo-positioning accuracy at a city level. Results from the combined hierarchical classifier demonstrate a 10% improvement over current state-of-the-art. A fine-scale geo-location stage was also developed to determine the pose of a query image to street-level geo-positioning accuracy. The fine-scale algorithm introduced an efficient structure-from-motion (SfM) 3D reconstruction approach that scales to city-sized image databases, incorporating ground video imagery as well as aerial video imagery for a more complete 3D city model. The newly developed SfM approach is demonstrated to have an order of magnitude computational speed-up compared to prior work, and validated to produce a 3D city model that is absolutely geo-located to within 3 meters compared to 3D Laser radar (Ladar) truth imagery. The fine geo-location stage was also tested using a 500 image hold-out set and demonstrated to geo-locate close to 80% of query images to within better than 100m, exceeding the system goal of street-level geo-positioning accuracy. As a proof-of-concept, we demonstrate improved image understanding by leveraging the newly developed 3D world model to perform information transference to example query images from other geo-located, labeled data sources.
In support of fine-scale geo-location validation, we also developed an algorithm to process 3D Ladar data using a novel 3D noise filtering technique that is shown to be a significant improvement over current state of the art, resulting in a 9x improvement in signal-to-noise ratio, a 2-3x improvement in angular and range resolution, a 21% improvement in ground detection and a 5.9x improvement in computational efficiency.
iv
Acknowledgements
I would like to thank my thesis advisor, Professor Octavia Camps for her advice, guidance and patience throughout the 7-year duration of this research. I would also like to thank my Lincoln Lab group leaders Dr. Richard Heinrichs and Dr. M. Jalal Khan for funding and enthusiastically supporting my research. Many of my colleagues and other researchers helped me in developing the presented algorithms. In particular, Prof James Hays from Brown University was kind enough to share the image database and baseline code. Also, my former colleague, Karl Ni, was instrumental in getting the funding for the fine-scale geo-location data collects and also contributed to algorithm development. Furthermore, I have had many white-board discussions with my colleague and friend, Luke J. Skelly. Some of the results in this thesis, namely the 3D Ladar processing, were due to our joint collaboration in code and algorithm development. Also, the everyday advice on how to navigate and successfully finish a PhD from my colleague, friend, neighbor and former Master’s thesis advisor, Richard Marino, was invaluable for keeping my sanity through the long process. Finally, I would like to thank my family. My wife, Osiris Vasile and my parents Stefan and Eliza Vasile have been a source of constant support, encouragement and inspiration. On numerous occasions, they provided me with advice on how to conduct my thesis research based on their own research experience. Also, my kids, Owen Vasile and Oksana Vasile, brought a lot of joy to my life and kept me well grounded.
v
Table of Contents
Abstract
List of Figures
List of Tables
iii
vii
x
1 Introduction
1.1 Data ……..……………………………………………………
1
8
2 Coarse Scale Geo-location
2.1 Background and Related Work .………………………….…
2.1.1 Determining Geo-spatial Coordinates Given Terrain
Type …………………………….…………………………...
2.1.2 Recognizing Terrain Types from an Image…………
2.1.3 Towards a Coarse Scale Geo-location Approach…...
2.2 Coarse Scale Geo-location Approach………………………..
2.3 Coarse Scale Geo-location Experimental Setup and Results .
9
9
12
15
19
20
32
3 Medium Scale Geo-location
3.1 Background …………………………………………………
3.2 Learning for Scene Classification and Image Retrieval ……
3.3 Medium Scale Geo-location Algorithm and Results ……….
35
35
36
39
4 Fine Scale Geo-location
4.1 Background and Related Work ……………………………..
4.2 Fine Scale Geo-Registration Approach …………………….
49
49
53
vi
4.2.1 3D Reconstruction from Video Imagery …………..
4.2.2 3D Reconstruction Merge ………………………….
4.2.3 Geo-locating a new image ………………………….
4.3 Fine Scale Geo-location Results ...…………………………..
4.3.1 Reconstruction and Geo-location of Aerial Imagery..
4.3.2 Reconstruction and Geo-location of Ground
Imagery ……………………………………………………...
4.3.3 Geo-Registration of Combined Aerial-Ground
Imagery ……………………………………………………...
4.3.4 Geo-locating a new image ...………………………..
4.4 Towards Improved Image and Scene Understanding ...……..
53
58
59
61
62
64
67
70
71
5 3D Ladar Processing: An Extension to 2D Image Geo-location
5.1 Background and Related Work ……………………………..
5.2 3D Ladar Background ……………………………...……….
5.3 3D Ladar Processing Approach ……………………………..
5.4 3D Filtered Results and Discussion …………………………
5.4.1 Qualitative Results …………………………………
5.4.2 Quantitative Results ………………………………...
74
74
76
79
85
85
87
6 Conclusion
6.1 Contributions .……………………………………………….
6.2 Recent Developments ……………………………………….
6.2 Future Work …………………………………...……………
90
90
93
95
References
97
vii
List of Figures
1.1 3D world-model representation .…………………………….
1.2 Mapping Flickr .……………………………………………..
1.3 Proposed hierarchical geo-location approach ………………
2.1 Examples of land coverage and terrain type maps ………….
2.2 Images afflicted with various types of noise ….…………….
2.3 Computing a Gist feature ....………………………………...
2.4 Examples of terrain labeled images ...………………………
2.5 Geographical distribution of photos from Flickr database ….
2.6 Block diagram of coarse-scale geo-location algorithm …......
2.7 UNEP Mountains and Tree cover in Mountain Regions 2002
Database ...…………………………………………………..
2.8 Pseudo-code for the first training stage ....…………………..
2.9 Results of processing GLCC land-cover database into 4
general land-types ...………………………………………...
2.10 Results of processing GLCC land-cover database into 5
general land-types …………………………………………..
2.11 Pseudo-code for the proposed coarse geo-location algorithm
3.1 Image attributes that might be used for medium-scale geo-
location .…..………….…………….…………….………….
2
2
5
13
16
17
18
19
22
24
25
26
28
31
35
viii
3.2 Mean shift clustering of urban-only database using 200km
bandwidth ...…………………………………………………
3.3 Accuracy as function of urban database size for proposed
algorithm ...………………………………………………….
3.4 Accuracy of geo-location estimates once lower ranked
clusters are considered ...……………………………………
3.5 Sample of 2k random data set, showing 30 images ...………
3.6 Geo-location results 1 ...………..…………………..………..
3.7 Geo-location results 2 ...………..…………………..………..
3.8 Geo-location results 3 ...………..…………….……..……….
4.1 Example of data used by the 3D reconstruction system ...…..
4.2 Structure from motion 3D reconstruction pipeline for video
imagery ...………………..……………..……..……………..
4.3 3D Geo-location method ...………..……..……………….....
4.4 3D Merge Method …..…………….………………………...
4.5 Geo-locating a new image ...…………….……..……………
4.6 Aerial 3D reconstruction of 1x1km area of Lubbock, Texas .
4.7 Qualitative geo-registration results of aerial reconstruction ..
4.8 Quantitative geo-registration results of aerial reconstruction
4.9 Qualitative results of ground reconstruction ...……………...
4.10 Initial geo-location of ground reconstruction ...……………..
4.11 Improvement in geo-location after applying 3-D Merge
algorithm ...……….…………….…………….……………..
4.12 Examples of merged aerial-ground reconstruction ...……….
4.13 Fine-scale geo-location accuracy for a 500 image test subset
4.14 Towards improved image understanding ...…………………
41
44
44
45
46
47
48
52
54
57
59
60
62
63
63
65
66
68
69
71
73
ix
5.1 3D Laser Radar (Ladar) system concept ....…………………
5.2 3D Ladar concept of operations ....…………………….........
5.3 Line-of-sight (LOS) coordinate systems for various sensor
platforms ...…………………………………………………..
5.4 Raw Lidar data showing salt and pepper noise ....…………..
5.5 Raw 3D Lidar point cloud color-coded by scan pattern
induced output variation ...…………………………………..
5.6 Method for correcting for photon and detector range
attenuation effects ...………………………………………...
5.7 Computation of laser-detector 3D point spread function ...…
5.8 MPSCP algorithm block diagram ….………….……………
5.9 Visual comparison of MAPCP versus MPSCP results …..…
5.10 Coincidence processing quantitative results ….….…………
75
76
78
79
80
82
83
85
86
88
x
List of Tables
2.1 Terrain Coverage Numbers of 48 Contiguous Stages ....……
2.2 Mapping from USGS land-use terrain classes to a reduced
set of terrain classes …..…………………………………….
2.3 Geo-Label database statistics ……..………………………...
2.4 Confusion Matrix for baseline method …….…………….…
2.5 Confusion Matrix for proposed coarse-scale geo-location
method ………….…………….…………….…………….…
3.1 Medium-scale Geo-Location Confusion Matrix at the city-
level ………….…………….…………….…………….……
4.1 Timing comparison of 3D Reconstruction method to prior
state of art …………………………………………….......…
14
23
29
33
33
42
67
1
Chapter 1
Introduction
In the last decade, there has been an explosion in the amount of digital imagery and
video. Vast numbers of photos and videos, shot with ever increasingly higher quality
digital cameras and smartphones, can now be accessed via the web using online databases
such as Flickr, Facebook, Instagram and YouTube. Though the total number of images
and videos is not well known, tens of billions of images and videos are now accessible on
the World Wide Web. Considering that each individual image may contain considerable
amount of information, the ability to interpret, understand and extract scene information
is highly beneficial for many communities, including, but not limited to, online social
networking sites, intelligence agencies and companies dealing with large-scale data
mining.
Image understanding algorithms are designed to take the burden off human analysts and
process the data automatically, in a timely manner. With such a vast volume of image
data, some organizing principle is needed to enable efficient navigation, understanding
and exploitation of these large imagery archives. Fortunately, three-dimensional
geometry provides such an organizing principle. For example, suppose we have a set of
photos of some ground scene. Those photos represent 2D projections of the 3D world
structure onto a variety of image planes. If the geometry of the scene is captured in a 3D
map, it can be utilized to mathematically relate the different photos to one another.
Moreover, the 3D map connects together data collected by completely different sensors at
different times, places and perspectives. For instance, one can relate a photo of a city shot
by a ground camera with a corresponding satellite image or a 3D Ladar point cloud.
Thus, the 3D map can add a lot of context to a scene and improve scene understanding
through the process of information transference from one data modality to another. But,
we can only get this improvement in scene context and understanding if all these data
products are geo-located with the 3D map. Figure 1.1 captures this common 3D world
model representation.
2
Fig 1.1 - 3D world-model representation for organizing data from multiple sensing modalities. The 3D map provides a geometrical framework for organizing imagery collected at different times, places and perspectives, enabling improved image context and scene understanding.
Fig 1.2 - Mapping Flickr. A map of geo-tagged images from Flickr, as of April 2011, with data binned at 0.5 x 0.5 degree latitude-longitude square (about 55x55km at Equator) on the Earth’s surface [2]. Certain regions, color-coded in magenta, have upwards of 1 million images, translating to thousands of images per square km.
3
In order to initialize this 3D world model representation, we need to start out with enough
imagery that not only has geo-spatial metadata but also samples a wide variety of scenes
and locations across the entire world. Fortunately, there is a wealth of imagery available
online that has geo-spatial metadata. For example, Flickr has enabled geo-tagging of
images since August 2006; within the first day, 1.2 million images were geo-tagged [1].
A more recent map of Flickr from April 2011 in Figure 1.2 reveals an explosion in the
number of geo-tagged images [2], with millions of images now available, with some
locations having upwards of thousands of images per square kilometer.
However, most images on the web usually do not have geo-spatial metadata available,
leaving the task up to a human operator to manually annotate the data, which can often be
tedious or impractical. In the absence of metadata, we need to rely on scene content to
deduce geo-spatial information. Depending on the scene content, such as how many
features we might be able to extract and how salient those features are, we can geo-locate
an image to various levels of geo-spatial accuracy, such as to a particular continent,
region, city, street or even an actual camera pose.
The problem of image geo-location has been addressed by several authors. Most of the
geo-location research falls into two categories: 1. localization by landmark recognition
using local image features and 2. localization by similar image retrieval using global
features that capture whole image content. Geo-location by landmark recognition
[3][4][5][6] tends to focus on limited image datasets (100s of thousands) that are already
highly localized around a set of landmarks or comprised of images from a single city.
Many of these methods apply feature matching and structure-from-motion techniques to
estimate camera location and pose. For the problem of geo-location on a world-wide
scale using million of images, direct localization by landmark recognition is not
computationally tractable. Another drawback of the above methods is that they require
extremely dense sampling of the world, with at least one or two instances that have the
exact scene content as the query image. Unfortunately, currently available image
databases with world-wide coverage do not have such dense image sampling, leading to
poor localization performance.
4
The second research category, of image geo-location by similar image retrieval, holds
more promise for geo-location on a world-wide scale. Seminal work by Hays et al. [7][8]
demonstrated the feasibility and potential for image localization on a worldwide scale.
The method applied a single-stage unsupervised algorithm on a multi-million image
world-wide database to directly geo-locate a query image to a set of likely locations in
the world. One of the drawbacks of the method was the use of a single stage classifier,
resulting in the need for both a high-dimensional feature space to separate highly
complex classes and the need to use an unsupervised classification method for
computational efficiency. Applying an unsupervised classification method in high
dimensions is not ideal as such methods are known to suffer in classification performance
as feature dimensionality increase due to their inability to discard irrelevant feature
dimensions for a given task [8].
To improve on the previously reported methods in [7][8], we propose to use a
hierarchical image geo-location approach. From an algorithmic perspective, developing a
hierarchical geo-location framework has several advantages. Rather than resorting to the
use of a high-dimensional feature space to separate highly complex geographic classes in
one step, by implementing multiple hierarchical stages we solve multiple simpler
classification problems, each in a lower dimensional feature space to avoid the curse of
dimensionality [9]. Furthermore, a hierarchical approach has the potential for improved
geo-location accuracy by allowing for both the use of simple, unsupervised classifiers for
the initial stages, as well as more complex classifiers for the later stages. Establishing
such a hierarchical framework also makes sense from a computational point of view.
Because several models relating to specific locations can exist, comparing all models
over the vast space of all possible images may not be computationally feasible. Paring
down the search space using coarse geo-location models with rough spatial descriptors on
large databases, followed by increasingly complex descriptors applied on reduced-size
databases makes the geo-location problem much more computationally tractable.
From a classification standpoint, the problem of geo-locating an image with no geo-
spatial metadata to a city-sized geographic class is very challenging. There are many
thousands such city-sized geographic classes in the world that we need to separate.
5
Besides the sheer number of classes, the boundary between geographic classes (e.g. is
this Bangkok or Paris?) is extremely complex because it must divide a spectrum of scene
types (indoors, outdoors, close-up, perspective, street, highway, tall and short buildings)
that might be present in both locations.
Fig. 1.3 - Proposed hierarchical geo-location approach. The method starts out with a query image and a 3D world model representation composed of several geo-spatial databases with millions of data samples. The coarse-scale geo-location method applies a computationally efficient terrain classifier to the query image in order to reduce the search space. On the resulting reduced-size database, the medium and fine scale geo-location methods apply more complex classifiers in order to obtain improved geo-location accuracy, with eventual localization at the city to street-level scale.
In order to overcome this complex classification problem, in this paper we propose a
novel hierarchical image classification approach that geo-locates a query image of an
urban scene to a particular city location in the world. As shown in Figure 1.3, we start out
with a query image and a 3D world model representation composed of several large
databases, namely the 6.5 million image geo-spatial image database from [7][8], a world-
wide land-coverage and terrain type database and a terrain-labeled image database. At the
coarse scale, we consider a query image as a whole by extracting rough scene content to
assign the image to a land class type, such as urban, forest, coast, country or mountain.
6
Once a terrain label is obtained, such as ‘urban’ for instance, we can reduce the image
and geo-spatial search space by filtering the larger database for images with geo-tags in
close proximity to urban areas. This has the effect of reducing the geo-spatial and
database search space anywhere from 70 to 90%. For the medium-scale geo-location
method, additional image content is extracted through the use of multiple low-level
features per image. To obtain geo-location accuracy at city level those features are
matched against a pre-computed feature database using a novel supervised classifier to
reduce the geographical search space by up to 99%. Once a city location is determined,
we further refine the geo-location of the query image to a pose with accuracy at the
metric scale using an improved structure from motion 3D reconstruction pipeline.
The key contributions of our work are:
1. The development of a new geo-tagged and terrain-labeled large-scale image database
to represent the 3D world model and the application to a novel coarse geo-location
method, with terrain classification results that are an improvement of 6% over previously
reported results. The course geo-location method has several advantages over prior non-
hierarchical approaches, namely: (a) the method is robust to noisy geo-labels, (b) the
method works in a low dimensional feature space to avoid the curse of dimensionality [9]
and (c) the method reduces the database size in order to allow for more complex follow-
on stages to be computationally tractable.
2. The development of a medium scale geo-location algorithm that improves upon
previous image retrieval techniques to geo-locate a query image to city-level accuracy.
The hierarchical course and medium geo-location framework was tested on a geo-tagged
6.5 million image database and demonstrated to have an improvement of 10% in geo-
location accuracy compared to previous methods applied up to city level geo-location.
3. The development of a fine-scale geo-location approach that is an order of magnitude
more computationally efficient, as well the development of a novel method to pre-
process a city database using both aerial and ground video imagery to effectively cover
and more uniformly sample a whole city, allowing for the localization of the query image
to meter scale accuracy. The technique is demonstrated with ground video imagery as
well as aerial video imagery. Geo-location performance for the reconstructed 3D city
7
model is validated in a systematic manner over a 1x1km area using a 20x40km 3D Ladar
data as truth. Our contribution is to develop: 1. a method to geo-locate images over a
wide scale city area, incorporating both aerial as well as ground imagery for a more
complete city model, 2. systematic validation of geo-location using wide area 3D Ladar
truth data.
4. A novel method to process noisy 3D Ladar imagery collected from an operational
airborne Ladar sensor, in support of fine scale geo-location validation. The 3D Ladar
filtering method is shown to be a marked improvement in terms of image quality and
speed compared to prior methods. Our contribution to the 3D Ladar processing area was
to develop an algorithm that is a significant improvement over prior methods, with a 9x
improvement in signal-to-noise ratio, a 2-3x improvement in angular and range
resolution, a 21% improvement in ground detection and a 5.9x improvement in
computational efficiency.
The thesis is organized as follows:
- Chapter 2 discusses the coarse-scale geo-location algorithm and presents
quantitative results to demonstrate improvement over prior methods.
- Chapter 3 describes the medium scale geo-location algorithm and presents results
using the hierarchical course and medium scale geo-location to demonstrate
improvement over prior state of the art.
- Chapter 4 describes the fine-scale geo-location method and presents qualitative as
well as quantitative results that demonstrate a large speedup compared to prior
work as well as high geo-location accuracy, validated against 3D Ladar truth
imagery.
- Chapter 5 describes a novel method to process noisy 3D Ladar imagery followed
by a qualitative imagery comparison as well as quantitative metrics that
demonstrate a large improvement over prior state of the art.
- Chapter 6 concludes with a discussion of the significance of results obtained so
far and explores areas of future work.
In the remainder of this chapter, we give an overview of the data sources and data sets
used for the rest of the thesis.
8
1.1 Data
In order to achieve state-of-the-art performance in terms of image classification and geo-
location on a world-wide scale, as well as validate geo-location using truth 3D Lidar data,
we need to leverage as many available training and truth data sets as possible. We utilize
several sources of data, with varying degrees of labeling accuracies in terms of geo-
spatial content as well as image content, namely:
1. A previously existing world-wide terrain classification geospatial database with
1km resolution, in support of the coarse localization stage.
2. A previously existing truth database of 2689 images with accurate terrain type
labels, in support of the coarse localization stage.
3. A 6.5 million image Flickr database with low to medium accuracy geo-spatial data
(city to street level accuracy), in support of both the coarse and medium geo-
location stages.
4. A new 125,000 ground-collected 1Hz video database with accurate GPS metadata
(~10m), using a 60 Megapixel multi-camera system, in support of the fine geo-
location stage.
5. A new 72,000 aerial-collected 2Hz video database, with highly accurate GPS/INS
(~2m) metadata using a 66 Megapixel multi camera system integrated into an
airborne platform, in support of the fine geo-location stage.
6. A new 20x40km 3D Ladar map at 1 meter resolution with sub-meter geo-location
accuracy, in support of truth validation of the fine geo-location stage.
7. A new 40 GB 3D Ladar data point database, corresponding to 2 billion raw 3D
points to showcase the results of an improved algorithm for processing 3D Ladar
data collected using an operational airborne 3D Ladar sensor, in support of
validation of the fine-scale geo-location results.
9
Chapter 2
Coarse Scale Geo-location
Chapter Summary. We present the coarse geo-location method where we geo-locate a query image to a particular region of the world by classifying the terrain type in that image based on image content. To achieve the goal of image geo-location by terrain classification, we first create a 3D world model representation composed of a large training database of geo-tagged, terrain labeled images. This database is created by merging knowledge from three publicly available databases, namely a geo-spatial terrain type and land coverage database, a 6.5 million image database that is only geo-tagged and a database of terrain-labeled images. We develop a coarse geo-location method that uses the generated 3D world model to test a hold-out set of 5000 images and demonstrate an improvement over current state of the art in terrain classification, with over 91% terrain classification accuracy. The resulting terrain label for the image is used to reduce geographical search space and segment the original large database by filtering for images with geo-tags in close proximity to the resulting terrain-label. The reduction in search space allows the usage of more complex medium-scale and fine-scale geo-location classifiers that are both accurate and computationally tractable.
2.1 Background and Related Work
There has been a significant interest in applying machine learning methods to scene
classification and image retrieval. Most of these methods use feature vectors such as
SIFT [10], texton dictionaries [11], color histograms, Gist [12] or a combination of these
[8], with feature vector dimensions on the order of 100-1000. When the training database
is on the order of tens of thousands of examples, the most common and reliable method
to train is using Kernel Support Vector Machines (SVMs). However, the problem of
image geo-location on a world-wide scale requires much larger image training databases,
on the order of millions of images, to have enough sampling of the various locations
around the world.
For computational scalability, retrieval methods that use millions of training samples
typically use the K-nearest neighbor (KNN) training algorithm in a feature space defined
by a few image feature types and then use those nearest neighbors for various tasks, such
as object recognition [13][14][15][16], image completion [8] and object classification.
Nearest neighbor techniques are attractive in that they are trivially parallelizable, require
10
no training, have good classification performance and perform well from a computational
perspective with query complexity that scales linearly with the size of the data set.
Nearest neighbor methods rely on a feature vector of low to medium size in dimensions
(100-200), either using SIFT [10], texton vocabularies [11], Gist [12] or a combination of
these [8]. Given a new image, the same feature vector is computed and the nearest
neighbors in feature space are found from the training database. Using those neighbors, a
majority rule is implemented to determine the label for the new query image. These KNN
methods tend to work well in low to medium sized dimensional spaces, but tend to suffer
as feature dimensionality increases [8][9]. The reason for this is that nearest neighbor
methods lack one of the fundamental advantages of supervised learning methods, which
is the ability to discard irrelevant feature dimensions for a given task [8].
In the context of image geo-location, the KNN algorithm has been used in seminal work
by [8] as part of a single stage algorithm that extracted a single feature vector per image
from a 6.5 million image database. To separate the highly complex city-sized classes
present in image geo-location, the feature vectors were high dimensional, of size close to
3000. Given a query image and its associated feature vector, KNN was applied in this
high-dimensional space to retrieve the k-nearest images, thereby directly obtaining the k
most likely candidate geo-locations for the query image [8]. As noted beforehand, KNN
tends to suffer as feature dimensionality increases, so working in a 3000 dimensional
feature space is not ideal. Furthermore, [8] only used data from a geo-tagged image
database and did not use additional knowledge to penalize unlikely matches.
Considering prior research work, there are several lessons to be learned, namely: 1. for
databases on order of millions of images, KNN is one of the only computationally
tractable approaches and 2. KNN performance tends to suffer as dimensionality
increases. One immediate conclusion that can be drawn is that in order to get good
classification performance, we need to use KNN in a lower dimensional space.
Considering that there are more than 3400 cities in the world with population over one
hundred thousand [17], accurate classification at the city level or even street-level using a
low dimensional feature becomes challenging, if not impractical. Thus, we need to
consider solving a simpler classification problem where the world is broken into fewer
11
class types. This again motivates our initial proposal to implement a hierarchical image
geo-location approach.
One possible solution that reduces the world into a few general classes is to classify
images by terrain type. One advantage of classifying by terrain type is that terrains look
physically different, and thus imagery of the terrains will appear different and have
discernible attributes. For example, deserts look substantially different from forests,
which on the whole look different from urban areas. This good separation in attributes
makes the problem of terrain classification from images very attractive. Another
advantage of classifying by terrain type is that landscapes have a fairly contiguous
distribution, with low spatial variance in terrain types. This contiguous, clumped
distribution of land types is advantageous in several ways, namely: 1. most images should
have no more than one or two terrain types, making accurate classification of the whole
image as single terrain class feasible and 2. considering the low spatial variance in terrain
labels, low accuracy geo-tagged images with no terrain labels might be used to accurately
train a terrain classifier, enabling improved classification. Once we have determined that
an image belongs to a particular terrain type, we can outright reject large contiguous
regions of the world. This in turn reduces our search space, allowing for ever-more
complex algorithms to be used for the follow-on image geo-location stages.
It is noteworthy to mention that research by Hays in [8] transferred information from a
terrain labeled GIS database to show that the terrain label indexed from the computed
geo-location estimate correlated well with the actual image content. However, the
research presented in [8] did not use the additional information for training, but rather
showed the existence of high correlation between the geo-location derived terrain label
and image content; this only demonstrated that information transference from one GIS
database to another is beneficial, but that benefit was not exploited. Our proposed
approach is significantly different; for training we are taking advantage of the heavy
correlation between scene content and GPS derived terrain label to obtain an improved
geo-location estimate.
12
To achieve the goal of image geo-location by terrain classification, we have to resolve the
following two problems: 1. determine terrain type given geo-spatial coordinates, and 2.
recognize most prevalent terrain type(s) from a single image. Once we have a solution to
these two problems, an algorithmic chain becomes apparent:
1. Starting with a single image, determine the most likely terrain type(s).
2. Mark areas on the globe that belong to or are near to such terrain types, as likely
candidate regions.
3. Take the world-wide geo-tagged image database and down-select to an image
database composed of images with geo-tags that fall within the defined candidate
regions.
4. Pass this reduced database to a next processing stage for further geo-location
refinement.
In the next section, we discuss data collections and prior research in support of
developing the above algorithmic method. In particular, we first focus on solving the
problem of determining terrain type given geospatial location, followed by recognition of
terrain types from a single image.
2.1.1 Determining Terrain Type Given Geo-spatial Coordinates
Towards the first goal of determining terrain type given geo-spatial coordinates, we need
training data that has both geo-spatial information as well as annotation of terrain types at
that geo-spatial location. This can be obtained from geological land surveys, with a
multitude of surveys available online from both the Unites States Geological Survey
(USGS) as well as the European Space Agency (ESA). One example of a geological
survey is the USGS "Global Land Cover Characteristics Data Base” (GLCC) world-wide
land coverage database as well as the US National Land Cover Dataset (NLCD), with
examples shown in Figure 2.1 [16,17] Listed in Table 1 alongside the coverage maps in
Fig. 2.1 are some percentage breakdowns by terrain type. An example coverage area in
the Washington DC metro area is also shown in Fig 2.1-C, where roughly 150 square km
is subdivided as a function of latitude and longitudinal coordinates into several
categories.
13
A
B C
Fig. 2.1 - Examples of land coverage and terrain type maps. A) Ecosystems map based on USGS "Global Land Cover Characteristics Data Base (GLCC World)" at a 30 arc second (~1km) resolution using 17 terrain categories, b) National Land Cover Dataset 2011 (NLCD) for the contiguous 48 US stages with spatial resolution of 30m and c) NLCD zoomed-in in view of Washington DC area.
14
Table 2.1
Terrain Coverage Numbers of 48 Contiguous Stages (NLCD, 1992)
From the world-wide terrain classification data in Figure 2.1-A, we observe that terrain
labels exhibit low spatial variance, where large contiguous regions are labeled using a
single terrain type. This property might be exceedingly helpful, considering that we have
a large amount of non-terrain labeled image training data that has low accuracy geo-
spatial metadata, as it suggests that terrain classification performance might be insensitive
to the accuracy of an image’s geo-tag (eg. we can accurately infer the terrain type in an
image solely based on its low accuracy geo-tag). This key insight, that low-accuracy geo-
tagged images may be used for training a terrain classifier, will play a key role in the
development of our proposed terrain classification algorithm.
The data in Table 2.1 show that terrain types are distributed somewhat unevenly, with a
large percentage of terrain labeled into the more broad categories of forest and country
areas, with low percentage representation of coastal/water regions and urban areas. At
first glance, this might appear of concern in terms how much of an image database
reduction can we obtain if we classify an image into one of these highly represented
15
classes. As further explained in Section 2.2, the image database density distribution tends
to cancel out this effect, as many more images are collected in urban and coastal
environments, with each final terrain class being more or less evenly represented in the
image database. By having a more even representation of each terrain class in the image
database, we can maintain a predictable reduction in database size that is on the order of
the number of terrain classes, allowing for more complex algorithms to be used in the
next geo-location stages.
While these databases provide a direct method to index geo-spatial data to terrain type,
we still need a training database and classification method to help us recognize terrain
types from our query image.
2.1.2 Recognizing Terrain Types from an Image
While imagery of the same type of terrain can appear different and variable, on the whole
such imagery will look very similar when viewing the image in its entirety. That is, if we
ignore the fine detail of an image, we should still be able to understand the context of an
image. This is evident in Figure 2.2 where the type of terrain can still be deduced despite
the fact that the image has been significantly degraded by noise or blurred out. One
approach to the study of environmental scenes has been to model an image using a
holistic representation [20][21][22]. This area of research has been the subject of intense
study over the last decade, and has grown to model the human visual system with cross-
disciplinary applications in both the computer vision as well as cognitive science
community. The most notable development is building scene understanding by describing
an image by its spectral envelope that describes the gist of the scene [21][22] -
researchers have come to know this feature as the Gist feature.
16
Fig. 2.2 - Images afflicted with various types of noise demonstrating that we can still deduce terrain type despite loss of fine detail. A) Additive speckle applied to an urban scene B) Severe blurring applied to a forest scene.
Images have long been known to have some interesting spectral properties, though they
have not been quantified rigorously until seminal work by Aude Oliva and Antonio
Torralba [21]. The work models the shape of a scene by holistically evaluating the so-
called spatial envelope of a scene in a mathematical and computational way. The model
is a multidimensional space that defines perceptual concepts of naturalness, openness,
roughness, expansion, and ruggedness that describe the dominant scene structure. The
primary goal of the study dedicated to defining roughly what a scene actually is, that is,
the gist of a scene, hence the name Gist features. Figure 2.3 describes how the Gist
feature is computed using orientation, color and intensity information.
17
Fig. 2.3 - Computing a Gist feature from an image using orientation, color and intensity information over the whole image at multiple scales. Orientation information is captured using Gabor filters at 4 angles (0,45,90,135) on 4 scales, leading to 16 sub-channels. Color information is captured using red-green and blue-yellow center surround each with 6 scale combinations, leading to 12 sub-channels. Intensity is captured using dark-bright center surround with 6 channel combinations, leading to 6 sub-channels, for a total of 34 sub-channels. Each channel is encoded by a 16 bin histogram, leading to a feature vector of length 544.
Torralba et al. [21] utilized Gist to classify images into natural and man-made semantic
groups, with each semantic group further split into 4 classes, namely coast, country,
forest, mountain for natural scenes and highway, street, close-up and tall-buildings for
man-made scenes. In our research, we decided to use the Gist feature for geo-location
since it provides good terrain classification accuracy, with the Gist descriptor fast to
compute, leading to a fast training and testing phase, necessary requirements for
algorithm scalability to large data sets. Since we are implementing a multi-stage
classifier, we choose a lower number of classes for the coarse classification stage, namely
5 instead of Torralba’s 8 original classes. Our coarse geo-location classes are `”coast,''
``country,” ``forest,” “mountain,'' and “urban”, with the urban class encapsulating
Torralba’s man-made semantic group (highway, inside city, street and tall building).
18
Examples images of these terrain classes are shown in Figure 2.4, from Torralba et al
[21] image-terrain label database annotated with LabelMe [23].
Fig 2.4 - Examples of terrain labeled images from Torralba et al. [21] truth database.
19
2.1.3 Towards a Coarse Scale Geo-location Approach
We now have the necessary methods to go from an image to a rough geo-location on a
world wide scale by first going from an image to a terrain type and then going from a
terrain type to a geo-spatial location. We leverage the Gist feature as designed by
Torralba et al[21] to achieve of the first goal of recognizing terrain types from our query
image. Once we have a terrain type, we can assign the image as belonging to certain
regions of the globe through the use of a world-wide terrain coverage classification
database. While there is quite a lot of previous research work in the area of image to
terrain classification, and extensive work on geo-spatial terrain classification, research on
the combined field of image geo-location on a world wide scale is currently in its
infancy, with the most notable research work done by Hays et al[5,6]. Indeed, we build
upon Hays’ work by using his 6 Million image Flickr database as well as some of the
image features and matching techniques that he used to achieve improved terrain
classification over current state of the art as well as reasonable computational
performance considering the large scale database used for training. Figure 2.5 shows the
geospatial distribution of Flickr image database, which we kindly obtained from the
James Hays for our research work.
Fig. 2.5 - Geographical distribution of photos from Flickr database in [8]. Photo location are shown in cyan, with density overlaid using a jet color-map (blue indicates low density, yellow medium, red high density). Our contribution to Hays et at[7] and Torralba et al’s [21] research is to improve on
classification performance by developing a method to robustly upgrade Hays et al [7] 6.5
Million geo-tagged image database with terrain labels. For his work, Hays only used the
20
geo-tagged image database to directly match a new query image to the closest K images
in his feature space using a K-Nearest Neighbor classifier [24]. His method does not
make use of knowledge that could be gained from a geo-spatial land coverage database,
nor does it use information from a truth database of image to terrain labels, information
that might help by making the geo-location problem more robust to noisy GPS labels. For
our research, we propose to use additional prior knowledge sources in order to improve
both geo-location and terrain classification performance.
We propose an improved method where we first probabilistically label the 6.5 Million
geo-tagged image database used in [7] with terrain classification labels using two
additional knowledge sources, namely: a geo-spatial land-coverage database and a truth
terrain-labeled image database from [21]. The enhanced image database is used to
classify a hold-out set of query images, with a significant improvement over previous
state of the art in terrain classification performance, which in turn enables improved geo-
location capability. Section 2.2 describes the setup of the coarse geo-location stage,
while Section 2.3 describes in detail our coarse geo-location method.
2.2 Coarse Scale Geo-location Approach
The coarse geo-location method builds upon research on image terrain classification from
[21] as well as image geo-location research from [8]. Our method uses the same image
database from [8], but improves on the geo-location approach by not only avoiding the
problem of KNN in high dimensions but also using additional data sources to enrich our
3D world model representation in order to penalize unlikely matches. Starting with the
6.5 million image geo-tagged database from [8], we develop a method to probabilistically
annotate terrain labels to each of the images by combining knowledge from two
additional databases: a world-wide land-coverage and terrain-type geospatial database
and 2689 image terrain-labeled truth database from [21]. By adding these new data
sources, we are now able to penalize unlikely matches that might otherwise happen with
an image database that only has geo-tags. For instance, the correct recognition of a
coastal image as being a coastal scene would make the image a highly unlikely match to
an image with a geo-tag from an inland area. Thus, our method can robustly discount
21
images with noisy geo-labels and prevent such images from negatively impacting geo-
location performance.
We create this probabilistically labeled geo-tagged/terrain data set using a two-stage
training algorithm. In the first stage, we use the geo-tags from the 6.5 million image
database, along with the world-wide land coverage geo-labeled database, to weakly label
the images as belonging to a subset of 5 terrain classes. In essence, we are using the geo-
spatial metadata embedded with the image to determine a terrain label probability prior.
In the second stage, we extract feature vectors for each of the images in a 6.5 million
image database as well as from a truth, terrain-labeled, 2689 image database. We
compute a probability that an image falls into a certain terrain class by comparing the
feature vector associated with that particular image to the feature vectors in the truth
terrain-labeled database using a KNN approach. The end-result is an enhanced 6.5
million image database that is not only geo-tagged by also probabilistically labeled by
terrain type. For classification, we again use a KNN approach to compute the most likely
terrain label given query image. The outcome is a terrain label that helps reduce our
search space to a subset of the original multi-million image database. Figure 2.6
summarizes the overall algorithmic flow for the proposed algorithm.
In the first training stage, we start labeling the world using 5 terrain types, namely: coast,
country, forest, mountain, and urban. We primarily use the USGS GLCC Database [18]
to assign a subset of labels to each 1km x 1km land tile. From this data base, we
determine layers for four of our five classes, namely urban, forest, country and coast.
Table 2.2 summarizes the mapping from the USGS Land Use labeling [25], containing 24
labels, to the reduced set of 4 labels (urban, forest, country and coast).
22
Fig. 2.6 - Block diagram of coarse-scale geo-location algorithm. Three databases are used to create a multi-million image terrain labeled, geo-tagged database. The training is divided into two stages. The first stage determines for each image a probability prior of a label, given the image’s geo-tag. The second stage extracts a feature vector from a terrain-labeled image truth database and a multi-million image database in order to determine the conditional probability of an image being a particular terrain class given its feature vectors. The enhanced terrain-labeled, geo-tagged database is used to classify a new query image to obtain a terrain label, resulting in a reduction in the size of the image database. The reduction in search space allows the usage of more complex medium-scale and fine-scale geo-location classifiers that are both accurate and computationally tractable.
23
Table 2.2
Mapping from USGS land-use terrain classes to a reduced set of terrain classes
To reduce gridding errors and improve terrain classification based on wide-field of view
images, or images that might have low accuracy geo-tags, we allow each 1km x 1km tile
to have multiple labels. We apply a 1km image dilation operation for urban, forest and
country label regions, and a 3km dilation operation for coastal regions that are derived
from sea-land contour lines. Since the USGS GLCC Database does not contain labels for
our mountain regions class, we extract this information from UNEP, Mountains and Tree
24
cover in Mountain Regions 2002 Database [19]. Figure 2.7 shows UNEP-Mountains
layer used for our method. Considering that mountains form a landscape feature that
might feature prominently in images taken from neighboring non-mountainous areas, we
perform a 5km dilation operation on the mountains layer obtained from [19]. The result
of this first training stage is a world-wide multi-label image at 1x1km resolution, where
all of the 1x1km pixels are assigned to a subset of terrain labels, producing a terrain
labeled geo-spatial database henceforth referred to as GeoLabel.
Fig. 2.7 - United Nations Environment Programme, Mountains and Tree cover in Mountain Regions 2002 Database Colors represent various sub-classes of mountain regions.[19]
The newly created GeoLabel terrain coverage database is now used to create probabilistic
terrain class priors on the 6.5 million geo-tagged image database, where the probability of
a terrain label for an image given that image’s geo-tagged latitude-longitude information
can be derived as follows:
| , max∈ ,
∑ ∈ , , ɛ
[Eq. 2.1]
25
where C=5 is the number of classes, i is the ith image in the 6.5 million image database,
GeoLabel() is the terrain labeled spatial database indexed in lat-lon coordinates and ɛ is a
small value to prevent the conditional probability of any label from being set to zero
(enables images to not be complexly ignored in the case where an image has a noisy geo-
tag, or the case when an image has an accurate geo-tag but contains land coverage types
other than those predicted from the GeoLabel() database). We now have completed the
first training stage, in which we obtain a terrain classification prior by probabilistically
labeling each image given its geo-tag. Figure 2.8 provides pseudo-code with details on
the data sets and algorithmic steps that are part of the first training stage.
First Stage Training Pseudo-code:
1. Start off with GLCC land-cover database image (gusgs2_0ll.img) that has 24 land-cover types.
2. Use translation index described in Table 2.2 to create a new image with 4 general land-cover types.
3. Create binary masks for the 4 remaining classes (forest, country, coastal, urban). 4. Do 1km image dilation operation on the country, forest and urban masks using 3x3km
square kernel. 5. Create a coastal contour map. Do 1km image dilation and 4km image erosion on coastal
regions by iterating multiple times using a 3x3km square mask. Save difference image between dilated binary mask and eroded binary mask to define a coastal contour map.
6. Create mask in lat-lon of mountainous regions from UNEP Mountains Data set using Level 8 data (0.6km resolution). Do 5km image dilation of mountain mask using 5 applications of a 3x3km square mask.
7. Create GeoLabel database, a 5-deep image stack of dilated masks. 8. Use GeoLabel to compute probability prior as a function of lat-lon.
Fig. 2.8 - Pseudo-code for the first training stage. The pseudo-code describes in detail the steps needed to combine information from two geo-spatial databases to obtain a GeoLabel database with 5 terrain classes, which is used to obtain a terrain label prior conditional on lat-lon location.
Figure 2.9 depicts the results from pseudo-code step 2 in the Figure 2.8, where we create
4 general land types from the GLCC 24-type database, using the translation index
provided in Table 2.2.
26
A
B
Fig. 2.9 - Results of processing GLCC land-cover database into 4 general land-types, namely coastal (blue), country(cyan), fores(yellow) and urban (red). A) Results depicting coverage for whole word. B) Zoomed-in view depicting coverage for north-eastern US.
Figure 2.10 captures the image masks created in pseudo-code steps 3-6 from Figure 2.10.
From the images, we note that coastal and urban regions represent a low percentage of
the globe, while large regions are classified as country, forest and mountain. Table 3 adds
further details on the percentages of land area, as well as percentages of images in the
database that fall into each of the five classes. The percent of land area covered varies
significantly, from 0.5% for urban to 7% for coastal and up to 64% for country. Thus,
classification into certain classes, such as urban or coastal can lead to reduction upwards
27
of 200x in search area, while other classes such a country leads to modest reductions
closer to 2x. Nonetheless, classification into any these land-types helps reduce the overall
search space, meeting our goal for hierarchical geo-location. Ideally, the classes would be
more uniformly distributed, with a uniform reduction of 5x (given 5 classes), which is not
the case in terms of land-area percentage. However, from an algorithm performance
perspective, uniform class distribution in terms of land area is not as important as much
as uniform class distribution in terms of image database size. The goal of hierarchical
geo-location is to progressively reduce image database size in order to allow for ever-
more complex algorithms to be applied to each reduced set of images. As detailed in
Table 2.3, the class distribution in term of number of images in the database is more
evenly balanced, with coast, forest and mountain each accounting for about a ¼ of the
database, and country and urban accounting for about ½. The obtained class distribution
in terms of database size allows anywhere from a 2x to 4x reduction in search space.
A B
C D
E F
28
G H
I
J
Fig. 2.10 - Results of processing GLCC land-cover database into 5 general land-types. A) Coastal areas mask for whole world, B) Zoomed-in view of coastal mask over continental US, C) Urban mask over whole world, showing very few large hot-spots of urban areas, D) Zoomed-in view of urban mask over continental US, depicting the major cities, with crumb-trails of urban areas along major highways, E) Country mask, F) Forest mask G) Mountains mask, H) Super-position of all 5 masks in a 5 bit image (32 distinct colors) with a jet color map (blue-yellow-red), using the following least significant bit to most significant bit order: mountain, country, forest, coast, urban. Open-sea is represented as dark-blue, followed by barren mountains, mountains and country areas, etc. Urban-only areas are represented as value 16 (green), with yellow and red representing urban areas with multiple labels, where red represents regions that are both urban and coastal. J) Histogram of land-type combinations for the chosen 5 classes, leading to 32 possible combinations.
29
Table 2.3
Geo-Label database statistics
By creating this new GeoLabel database, we have completed the first processing stage to
obtain a probability prior for each class conditional on the geographic coordinates. We
now describe the second training stage, where we attempt to find a probability prior for
each class label conditions on the feature vectors extracted from each image. Towards
this goal, we first extract feature vectors for images in both the 6.5 million image geo-
tagged database as well as for images in the terrain-labeled truth database from [21]. We
utilize the Gist feature descriptor [21], which has been shown to work well for terrain
classification and scene categorization [7][8]. Using the extracted Gist feature vectors,
we initialize a KNN classifier on the 2689 image terrain-labeled database from [21]. For
each of the 6.5 million probabilistically labeled images, we run the pre-computed Gist
feature through the KNN classifier. Instead of determining a single label based on the
typical KNN majority-rule, we instead take the K nearest neighbors and their associated
truth labels to find the probability of a terrain label given the image’s Gist feature vector,
Fi:
| ∑
[Eq. 2.2]
,where K is the number of nearest neighbor feature vectors in L2 distance, j is the nearest
j’th Gist vector from the truth database T to Gist vector of image i and i is the ith image
in the 6 Million image database. We now combine the information from the first and
30
second training stages to express the probability of a label for an image given its geo-tag
and Gist feature as:
| , , | ∗ | , [Eq. 2.3]
This probabilistically labeled geo-tagged/terrain data set composed of feature vectors,
associated terrain labels and geo-tags serves as the improved representation of the 3D
world model. Next, we train an additional classifier on the world-model database and test
on a hold-out set of images, for which we have both geo-tags and terrain labels. Similar
to [8], we chose a KNN classifier to make the classification problem computationally
tractable. For each test image, we compute a Gist feature and use the KNN classifier
with K’ nearest neighbors (note that this is a different parameter than K used in equation
2 above). Unlike [8] who used KNN to label the query image by neighbor majority rule,
we choose the label for the image by computing the label likelihood over the
neighborhood Gist features as follows:
∑ | , ,′ [Eq. 2.4]
, where K’ is the number neighbors used in KNN, j is the nearest jth Gist vector from the
6.5 million image database to the Gist feature derived from the query image. For
reference, the coarse geo-location algorithm is shown in pseudo-code in Figure 2.11.
31
Fig. 2.11 - Pseudo-code for the proposed coarse geo-location algorithm.
Compared to the standard KNN terrain classification approach in [21] where KNN is
trained on a 2689 terrain labeled database, our proposed terrain classification method
utilizes knowledge from multiple much larger database. This might allow our KNN
method to better distinguish complex boundaries as we now have 6.5 million image
samples for our truth database, which is over three orders of magnitude more data for
training compared to the 2689 terrain label database in [21]. Furthermore, compared to
the direct, single step geo-location method developed by Hays in [8], our method can
penalize images with incorrect geo-tags, leading to robust classification in the presence of
databases with noisy geo-tags (exe image of a coastal area with geo-tag far inland).
32
2.3 Coarse-Scale Geo-location Experimental Setup and Results
The coarse geo-location method was tested on a hold-out set of 5000 test images, 1000
images per terrain class. We extracted a Gist feature for each image, using Gabor filters
at 4 angles and 4 scales (16 channels). Color information is captured using red-green and
blue-yellow center surround, each with 6 scale combinations, leading to 12 sub-channels
[21]. Intensity is captured using dark-bright center surround with 6 channel combinations,
leading to 6 sub-channels, for a total of 34 sub-channels. Each channel is encoded by a 16
bin histogram, leading to a feature vector of length 544. Using the above procedure,
feature vectors were also computed for images in Torralba’s 2689 terrain labeled
databases as well as for the entire 6.5 million image flickr database. To reduce
dimensionality and avoid sparsity concerns, principal component analysis (PCA) was
applied to the training database. For computational reasons, only a subset of the training
database was used for PCA analysis, namely Torralba’s entire 2689 image database as
well as a 50000 random image selection from the multi-million image flickr database.
Feature dimensionality was reduced from 544 to 80 dimensions. Based on results from
[12], we chose K=19 as a good value for creation of the geo-tagged/terrain labeled
database. By cross-validation, we found that a parameter of K’=200 worked well for the
k-nearest neighbors used to predict the label of a test image. We compare the results of
our method, shown in Table 2.5, to the baseline method as detailed in [21], shown in
Table 2.4. The baseline had an average accuracy of 85.5%, while our method had an
accuracy of 91.3%, a modest improvement of 5.7% over the baseline. In particular, our
method was able to much better classify coastal areas with an 11.3% improvement over
the baseline and country areas with a 6.4% improvement. The improved results
demonstrate that the proposed method can leverage the additional database information to
improve accuracy. Also, the high level of correct classification makes the proposed
algorithm suitable for use as a first stage in hierarchical classification chain.
33
Table 2.4.
Confusion Matrix for baseline method. Numbers in red denote correct classification
Table 2.5.
Confusion Matrix for proposed coarse-scale geo-location method.
Numbers in red denote correct classification
In summary, we have developed a method to coarsely geo-locate images on a world wide
scale by classification of terrain types. Results indicate 91.26% correct classification,
with a 71% weighted average geo-spatial search space reduction and upwards of 99.5%
reduction for urban query images. In terms of image database reduction, the algorithm
resulted in an average 66% reduction, and upwards of 76% for mountain query images.
The new method provides a significant improvement in terms of accuracy, with 5.72%
improvement over the baseline. The proposed method has several advantages over prior
non-hierarchical approaches [8], in that the method is robust to images with noisy geo-
labels, works in a low dimensional feature space to avoid the curse of dimensionality [9]
and reduces the database size in order to allow for more complex follow-on stages to be
computationally tractable. The resulting terrain label for the query image is now used to
reduce our geographical search space, as well as choose a specific medium geo-location
classifier trained to further distinguish spatial locality within that particular terrain class.
34
Future areas of research might include expanding the number of terrain classes, allowing
for improved data reduction and geo-location specificity. In particular, the “country” and
“urban” classes tend to account for more than half of all images in the image database
and need to be further sub-divided. Towards that goal, we might consider adding several
additional classes, namely a “savanna/arid” class as well as further subdivide the “urban”
class into a “sub-urban” class, a “dense urban” class and possibly consider an “indoors”
class.
35
Chapter 3
Medium Scale Geo-location
Chapter Summary. We describe a medium geo-location approach, where a query image has already been labeled as belonging to certain terrain type by using our coarse geo-location approach, leading to a reduction in our geospatial space, in some cases up to 99%. Given this reduced search space, we attempt to further geo-locate the image to few candidate locations in the world. We develop an improved geo-location method using a classifier inspired by SVM-KNN [26][7][8] and demonstrate that the classifier, in conjunction with a set of extracted image features, improves geo-location accuracy compared to prior methods.
3.1 Background
The coarse classification in Chapter 2 enables the capability to distinguish between types
of terrain, whether it is cities, forest, mountains, etc. This is useful because it reduces the
geospatial search space, in some cases, up to 99%. The next step is to geo-locate within
this reduced search space. That is, now that we know this particular photograph is of an
urban scene, we ask questions such as, which city was it taken in? Or, if the image was
classified as forest, which type of forest is it (jungle, deciduous, evergreen)? Of course,
there are naturally limits as to how well a machine performs. For example, if we take a
picture of a white wall, there is not a lot that an analyst (human or machine) can do to
geo-locate that photograph. Ignoring such pathological situations, though, information
and cues within an image offer considerable potential and are telling about a geo-location
(if not the exact geo-location) by simply observing image attributes.
Fig 3.1 - Image attributes that might be used for medium-scale geo-location.
36
The decision-making process, as apparent in Figure 3.1 relies heavily on attributes such
as the type of vegetation and leaves, architecture style, common building colors, texture,
relative height, etc. Such features are recognizable and discernible if the observer has the
a priori knowledge of how they appear in digital imagery, where illumination conditions,
resolution, and picture quality play large roles.
For the medium-scale geo-location problem, we first focus on urban scenes to answer the
following question: given this image, which city was the image taken from? Before we
go further and explain the proposed classification approach, we will briefly review
related work on scene classification and image retrieval using image features. We will
explore which of these classification algorithms has the capabilities to do a good
classification job given our difficult classification problem where the boundaries between
our classes (e.g. is this location Bangkok or Paris) is extremely complex because the
algorithm needs to divide along a wide range of scene types (indoors or outdoors, street
or highway, tall or short buildings) that might be present at both locations. We will also
explore which classification algorithms are computationally tractable to our geo-location
problem, where we will have a large scale training database that has millions of images.
3.2 Learning for Scene Classification and Image Retrieval
There has been a significant interest in applying machine learning methods to scene
classification and image retrieval. Most of these methods use feature vectors such as
SIFT [10], texton dictionaries [11], color histograms, Gist [12] or a combination of these
[8], with feature vector dimensions on the order of 100-1000. When the training
databases are on the order of tens of thousands of examples, the most common and
reliable method to train is using Kernel Support Vector Machines (SVMs). SVMs is a
supervised learning method that takes in a input set of data and predicts one of two
possible classes that the input sample belongs to, making SVM a binary classifier. Given
a training set of binary labeled input data, a SVM training algorithm builds a model that
assigns a new example to one of the two categories. In addition to performing linear
classification, SVMs can also perform non-linear classification using what is known as
the “Kernel Trick”. The kernel trick is a transformation where the input data is mapped
37
from the original feature space to a much higher dimensional feature space. The reason
for the mapping is to more easily find a separation hyper-plane boundary (a linear
decision) that enables good partitioning between two classes that have a complex
separation boundary in the original low dimensional space. Thus for our geo-location
problem, where the boundaries between our classes (is this Bangkok or Paris) is very
complex, SVMs are highly desirable.
Algorithmic implementation of typical kernel SVMs have a training complexity of
O(d*N2) where d is the feature dimensionality and N is the number of training examples.
For our classification problem, where we have millions of training samples, this approach
is not computationally intractable. In machine learning literature, there are some
examples of large scale SVM approaches, but many such methods, such as SMO [27]
typically require a N2 all-pairs distance computation, which is computationally
intractable with millions of training images. One of the more promising approaches for
large scale SVMs is from Wang et al [28], who use a “histogram intersection kernel”
coupled with online SVM training method to classify image into Flickr groups and
PASCAL categories. With 80 thousand training images and image features on the order
of N=200 dimensions, they can train an SVM in 150 seconds, with classification
performance which is nearly as good as batch-trained SVMs. Nonetheless, even this
method does not directly scale well to training databases on the order of millions of
images.
For computational scalability, retrieval methods that use millions of training samples
typically use KNN training algorithms in a feature space defined by a few image feature
types and then use those nearest neighbors for various tasks, such as object recognition
[29][30][31][32][33], image completion [8] and object classification. Nearest neighbor
techniques are attractive in that they are trivially parallelizable, require no training, have
good classification performance and perform well from a computational perspective with
query complexity that scales linearly with the size of the data. Nearest neighbor methods
rely on a feature vector of low to medium size in dimensions (100-200), either using
SIFT [10], texton vocabularies [11], Gist [12] or a combination of these [8]. Given a new
image, the same feature vector is computed and the nearest neighbors in feature space are
38
found from the training database. Using those neighbors, a majority rule is implemented
to determine the label for the new query image. These KNN methods tend to work well in
low to medium sized dimensional spaces, but tend to suffer as feature dimensionality
increases [8]. The reason for this is that nearest neighbor methods lack one of the
fundamental advantage of supervised learning methods, which the ability to discard
irrelevant feature dimensions for a given task.
Nonetheless, with more features types and therefore higher feature dimensions, there is
potential gain in classification performance as long as we have a computationally
tractable supervised training method that focuses on the relevant features for the given
task or query. One promising approach for high dimensional features is to combine the
supervised learning power of SVM with computationally efficiency of KNN. The
medium-scale geo-location method used in this thesis is inspired by SVM-KNN [8][26]
and prior KNN enhancements [34][35][36] [37][38][39][40]. The method is a hybrid of
non-parametric, KNN techniques and parametric, supervised SVM learning techniques.
The philosophy behind this method is that learning becomes easier if we focus on
examining the local space around a query instead of the entire problem domain.
Consider our image geo-location problem where we are attempting to differentiate
between multiple cities (e.g. is this location Bangkok or Paris). The boundaries between
our classes is extremely complex because the boundary must still divide a spectrum of
scene types within a city (indoors or outdoors, close-up or perspective, street or highway,
tall or short buildings) that might be present in both locations. When looking at the
combined training data for both cities, there might not be a simple parametric boundary
between these geographic classes in feature space. However, if we were to look within a
space of similar scenes to the query image (e.g. streets), then it may become much easier
and more feasible to divide the classes. This is exactly what we intend to do with the
KNN-SVM algorithm. Given a query image, we will use KNN to roughly find a local
space of similar scenes and then use an online SVM classifier trained just on the nearest
neighbors to find a possibly non-linear parametric boundary and classify our image. The
proposed KNN-SVM algorithm will not only be computationally tractable, but also have
39
the potential to have significantly classification improved performance over a KNN only
method.
3.3 Medium Scale Geo-location Algorithm and Results
Our KNN-SVM classifier builds upon the baseline method described in [8]. Given a
query image, we first extract a feature vector for each image using the resulting output of
several popular feature detectors from literature. Given a query image, its corresponding
feature vector, as well the predicted terrain class from the course-scale classifier, we
propose the following KNN-SVM algorithm:
1. Reduce original database to images with geo-tags that overlap with the predicted terrain type. Automatically label “regions” using mean-shift clustering with 200km bandwidth. For computational efficiency, the above steps are performed offline only once for each terrain type.
2. Use baseline KNN-SVM from [26] with K1 to find a “region” label, using a minimum cluster size of 5.
3. Run again KNN with data only from region, using K2 nearest neighbors. 4. Cluster on the globe the K2 nearest neighbors by mean-shift, using bandwidth of
50 km, with minimum cluster size of 3. Consider each cluster as a city for SVM. 5. Compute the pair-wise distances between all K2 nearest neighbors using image
features with L1 and chi-squared distance. 6. Convert the pair-wise distance into a positive semi-definite kernel matrix using
procedure from [8] and train C 1-vs-all non-linear SVMs. 7. For each classifier C, compute distance of the query point to the decision boundary
using procedure from [8]. The class for which the distance is most positive distance is declared the winner.
8. Estimate GPS of query as average of all members of the winning class.
We tested the new algorithm using a 500 image hold out set, composed of geo-tagged
images from 5 cities with 100 images per city. The images include the cities of 1.
Lubbock, Texas, 2. Boston, MA, 3. Paris, France, 4. Vienna, Austria and 5. Dubrovnik,
Croatia. The images were in part downloaded from Flickr as well as selected from ground
imagery collection campaigns in support of the fine-scale geo-location algorithm
development. Similar to criteria applied in [8], we removed images that were undesirable.
For this experiment, we use the Tiny Images feature, as detailed by Torralba et al. in [41],
to create 16x16 color images as one of our features. In addition, we use color histograms
of size 4x14x14 bins in CIE L*a*b* space for a total of 784 dimensions. Texton features
40
are also used due to their ability to distinguish well between different building textures in
cities. Similar to [8] we use a 512 entry universal texton dictionary [42] by clustering data
to a set of bank filters with 8 orientations, 2 scales and 2 elongations. Finally, we apply
the same Gist feature descriptor as detailed in Chapter 2, of size 544 dimensions. We use
L1 distance for all image features (Gist, Tiny Images), and chi-squared for histograms
(textons, color). The sub-vectors are concatenated together to create a 2096 dimensional
feature vector. For each query image, we test the image against the whole database to
determine the geo-location performance for finding the particular city amongst data from
the entire world. Successful geo-location for a query image is defined as finding a
location within 200km of the actual GPS location as specified in the geo-tagged metadata
of the query image. By cross-validation, a K1=2000, K2=200 was determined to work
well in terms of geo-location accuracy.
Figure 3.2 captures the result of the mean-shift clustering for the urban-only database,
described in pseudo-code Step 1. The method was implemented in Matlab and took
approximately 16 minutes of run-time. The outcome is the division of the world-wide
data set into 977 clusters. Figure 3.2c provides visual confirmation that the chosen
clusters conceptually capture regional areas. Based on this database labeling, the rest of
the algorithm was run on the 500 image hold-out set. We compare our results to the
KNN-SVM procedure and optimal parameters used by Hays in [8]. Table 3.1 captures
geo-location performance. Results indicate that we can geo-locate a query image to a
particular city with an accuracy of 12% to 18%, with an average of 15%. We also ran the
algorithm from [8] on the urban-only database and obtained an accuracy of 12.5%. Our
method has an absolute improvement of 2.5%, leading to a 20% relative improvement
over previous results.
41
A
B
C Fig 3.2 - Mean shift clustering of urban-only database using 200km bandwidth. A) GPS locations for all urban images. B) Clustering of GPS locations for all urban images, color-coded using a wrapped color map (multiple clusters might have some color). The cluster center is shown with a black round marker. C) Zoom-in of the clustering for east coast and mid-west USA, confirming that clustering captures well regional areas.
42
Table 3.1.
Medium‐scale Geo‐Location Confusion Matrix at the city‐level
In addition to testing the imagery on a narrow set of only 5 cities, we also did testing
using a 500 image urban test set with images randomly selected from across the entire
world. We built the test set by randomly drawing 800 images from the urban-only data
set. From this initial set, we removed undesirable photos using the same methodology as
in [8]. In addition to the procedure in [8], we ensured the images did not capture the same
scene area (visual checked images with close geo-tags). Similar to [8], to prevent testing
bias, we removed from the database not only the test images, but also all images from the
same photographers. The resulting set contained 462 images. The set was enriched with
author-collected geo-tagged images to bring the set to a total of 500 images. Using this
new set, we tested the accuracy of our proposed method as function of database size. For
this experiment, (as well as the next experiment), we used all the im2gps features except
geometric context and 16x16 tiny images. The baseline KNN-SVM algorithm from [8]
was run using a Ksl=K=2000 to obtain a more fair comparison to the proposed KNN-
SVM method in that both algorithms now use the same data for further SVM
classification. Results for this test are shown in Figure 3.3. From Figure 3.3, we can see
that accuracy increases with database size, similar to the results obtained in [8]. For the
entire urban-only database, our accuracy rate was 16.2%. We repeated the test using the
method described in [8] and obtained a second curve in Figure 3.3 that has the same trend
as our proposed method. For the method in [8], we obtained an accuracy rate of 12.8%
using the entire urban-only database, resulting in a modest improvement of 3.4% (25%
relative improvement) for our proposed method compared to the method from [8].
43
So far, our accuracy metric has been based on the top-scored regional city-refined cluster
of images. We now relax the condition to determine the accuracy rate that we correctly
geo-locate the query image when we consider the second through Nth mean-shift
determined regional city-refined clusters, as well as the best cluster, which is defined as
the cluster being spatially closest to the ground truth for the query location. We compare
those results to the baseline KNN-SVM method from [8] with Ksl=K=2000. Results
shown in Figure 3.4 indicate a significant increase for both the baseline and proposed
algorithm in correct geo-location once lower ranked clusters are considered. The
percentage of the data set that meets the new criteria increases from 16.2% when
constrained to the rank 1 cluster to 39.2% when considering up to the top 9 clusters,
reaching 39.8% when expanding criteria to include all found regional city-refined
clusters. Correct classification for both algorithms eventually converges when all clusters
are considered due to having similar overlapping clusters being determined (choosing
exactly the same K=2000 nearest neighbors to start off). The proposed algorithm has a
slight advantage when considering up to the 4th top regional city-refined clusters, most
likely due to refined search enabled by our 2-stage (regional-city) hierarchical KNN-
SVM proposed approach. Chance performance based on random matches is also shown
for comparison. We note that the ratio of geo-location accuracy to chance performance
stays in a range of 8-16x better than chance. This demonstrates that the image search
system can be used to obtain correct geo-location with fairly good recall rates, while still
providing accuracy that are at about one order of magnitude better than chance.
As a point of comparison to prior approaches, we also tested the accuracy of running the
whole hierarchical algorithm (course-scale followed by medium-scale geo-location) on
the entire 6.5 million image database using the 2K random image subset from [8]. In
order to do a comparison with prior reported results in [8], we used all the base im2gps
features (minus geo-metric context and tiny images) for the first KNN-SVM stage and
added the additional geometric-derived and SIFT-derived features explained in [8] for
performing the second KNN-SVM stage.
44
Fig 3.3 - Accuracy as function of urban database size for proposed algorithm (red) versus method described in [8] (blue). The accuracy of our proposed method on the entire urban database is 16.2% compared 12.8% for previous method.
Fig 3.4 - Accuracy of geo-location estimates once lower ranked clusters are considered. The proposed method had good recall rate while still being much better than an order of magnitude better than chance.
The results indicate a slightly improved accuracy rate of 15.1% compared to the accuracy
rate of 13.75% reported in [8], leading to a small 1.35% absolute improvement and a
modest relative improvement of 9.8%. Although ~15% is a low absolute accuracy rate, it
is worth reminding that the test 2K random data set has many images that are extremely
45
difficult, if not impossible to confidently geo-locate. Figure 3.5 shows a representative
sample of the 2K random data set.
Fig 3.5 - Sample of 2k random data set, showing 30 images. Many of these photos are very difficult if not impossible to geolocate due to lack of content that is geographically specific.
In Figure 3.6, 3.7 and 3.8, we show some example geo-location results. For each
example, the query image is shown on the top left. The top 6 images from the resulting
geo-location image cluster are shown to the right of the query image. The predicted city-
refined geo-location estimate is shown as a red dot, while the actual ground truth geo-
location is denoted using concentric yellow rings of radius 200km and 750km.
In summary, we developed an improved geo-location method using a classifier inspired
by [26][7][8] and demonstrated that the classifier, in conjunction with a set of extracted
image features, improves geo-location accuracy compared to prior methods. Results with
the new method indicate a slightly improved accuracy rate of 15.1% compared to the
accuracy rate of 13.75% reported in [8], leading to a modest relative improvement of
46
9.8%. Future work might include appending additional images to the database to
determine if the proposed dual-stage KNN-SVM classifier can take further advantage of
the additional data. In addition, it would be of interest to better explore which features are
more important for geo-location and discard features that have low discriminatory power.
With the development of this medium-scale classifier, we have now reduced the geo-
location problem to a city-scale and gained additional scene knowledge by having a
region and possibly a city associated with the image. In the next chapter, we will explore
a method to further geo-locate the query image from city-scale down to street-level
accuracy, or in some cases go as far as determining an actual camera pose and location.
Fig 3.6 - Geo-location results 1. A query image from the city of Boston, MA is shown on the top left. The top 6 images from the resulting geo-location image cluster are shown to the right of the query image. The predicted city-refined geo-location estimate is shown as a red dot, while the actual ground truth geo-location is denoted using concentric yellow rings of radius 200km and 750km.
47
Fig 3.7 - Geo-location results 2. A query image from the Grand Canyon is shown on the top left. The top 6 images from the resulting geo-location image cluster are shown to the right of the query image. The predicted city-refined geo-location estimate is shown as a red dot, while the actual ground truth geo-location is denoted using concentric yellow rings of radius 200km and 750km.
48
Fig 3.8 - Geo-location results 3. A query image from Paris, France is shown on the top left. The top 6 images from the resulting geo-location image cluster are shown to the right of the query image. The predicted city-refined geo-location estimate is shown as a red dot, while the actual ground truth geo-location is denoted using concentric yellow rings of radius 200km and 750km.
49
Chapter 4
Fine Scale Geo-location
Chapter Summary. Once we have geo-located a query to a particular city, we go to the final step in the geo-location progression by attempting to estimate the pose from where that particular image was taken. To achieve this, we first pre-process a training data set using structure-from-Motion (SfM) techniques by extracting local features from training images, finding feature correspondences and upgrading the correspondences to 3D locations to create a 3D model of the city scene. The relative camera poses, along with the 3D reconstruction, are then geo-located using GPS image metadata that might be available with a subset of the training images in our city-wide image database. A query image can then be geo-located and attached to training image database using a similar SfM procedure. Our contribution to the SfM research area is to develop an efficient method to do 3D reconstruction on a city-wide scale using ground video imagery as well as aerial video imagery in order to compute a more complete and self-consistent geo-registered 3D city model. The reconstruction results of a 1x1km city area, covered with a 66 Mega-pixel airborne system along with a 60 Mega-pixel ground camera system, are presented and validated to geo-register to within 3 meters to prior airborne-collected 3D Ladar data. Compared to prior approaches, the new method has a computational speed-up on the order of 4 to 14x depending on database size. Furthermore, given holdout set of 500 query images, the presented method is shown to be able to geo-locate close to 80% of query images to within better than 100m, thus demonstrating the ability to geo-locate most query images to street level accuracy.
4.1 Background and Related Work
Automatic 3D reconstruction and geo-location of buildings and landscapes from images
is an active research area. Recent work by [43][44][45][46] has shown the feasibility of
3D reconstruction using tens of thousands of ground-level images from both unstructured
photo collections, such as Flickr, as well as more structured video collections captured
from a moving vehicle [5][47][48][49], with some algorithms incorporating GPS data for
geo-location when available [5][50][51][52]. While 3D reconstructions from ground
imagery provide high-detail street-level views of a city, the resulting reconstructions tend
to be limited to vertical structures, such as building facades, missing a lot of the
horizontal structures, such as roofs, or flat landscape areas, thus leading to an incomplete
model of the city scene [5][46][47][48]. Furthermore, when using GPS data for geo-
location, the 3D model’s geo-registration accuracy and precision might suffer since
50
street-level GPS solutions are poor, particularly amidst tall buildings on narrow streets
due to multipath reflection errors [5][53]. For video collects with GPS captured from
moving vehicles, the resulting 3D model is typically composed of a single connected
component that might have internal distortions due to GPS drift or discontinuities [5][51].
For unstructured ground photo collections, the issue of geo-registration is further
exacerbated as only a subset of images might have GPS metadata, with typical city-sized
reconstructions composed of multiple unconnected 3D models representing popular
touristic sites or landmarks [43][44][46], where each connected component might have
few or no GPS tie points. Recent work by [46] attempts to resolve this problem, however
they require additional GIS building data as a unifying world model to connect the
various disconnected 3D scene clusters.
While ground level 3D reconstructions do not capture a complete model of a city’s
surface scene, they can be complemented by adding aerial imagery, which has wider area
coverage along with inherently more accurate aerial GPS data. A reconstruction using a
combination of aerial and ground imagery might lead to a 3D city model that has both
high level of detail as well as capture a wide area. Furthermore, by using aerial imagery
to create a reference, geo-registered and self-consistent 3D world model, we might be
able improve both absolute geo-registration accuracy as well as precision of the
previously unconnected 3D ground reconstructions.
In this section, we describe an efficient method that utilizes both ground video imagery as
well as ultra-high resolution aerial video imagery to reconstruct a more complete 3D
model of a large (1x1km) city-sized scene. The method starts out with two similar
structure-from-motion (SFM) algorithms to process the aerial and ground imagery
separately. We developed a SFM processing chain similar to [43][44], with several
improvements to take advantage of inherent video constraints as well as GPS information
to reduce computational complexity. The two separate 3D reconstructions are then
merged using the aerial-derived 3D model as the unifying reference frame to correct for
any remaining GPS errors in the ground-derived 3D scene. To quantify the improvements
in geo-registration accuracy and precision, we compare the aerial-derived 3D model, the
ground-derived 3D model, as well as the merged 3D reconstruction to a previously
51
collected high-resolution 3D Ladar map, which is considered to be truth data. To the best
of our knowledge, no one has published results of city-sized reconstruction using both
aerial and ground imagery to obtain a more complete 3D model, nor has the geo-location
accuracy and precision of 3D reconstruction been quantified in a systematic manner over
large scale areas using 3D Ladar data as truth.
We utilize an airborne 66Mpixel multiple-camera sensor operating at 2Hz to capture
videos of large scale city-sized areas (2x2km), with an example image shown in Figure
4.1-A/B. In addition, we captured ground based video data at 1Hz with 5 12MPix Nikon
D5000 cameras using a moving vehicle as shown in Figure 4.1-C/D. A total of 72000
aerial images were captured, as well as 125000 ground images in support of fine-scale
geo-location. The algorithm was tested using 250 66-MPix aerial video frames along with
34400 ground images to create a dense 3D reconstruction of a 1x1km area of Lubbock,
Texas. A Ladar map of the city at 50cm grid sampling with 0.5 meter geo-registration
accuracy is used to determine the geo-registration accuracy and precision of the various
3D reconstructed data sets. The main contributions of our research are:
1. An efficient SFM method that takes into account video constraints as well as GPS
information, in combination with a method to merge the 3D aerial and ground
reconstruction for a more complete 3D city model, with improvements in geo-
registration accuracy and precision for the ground collected data.
2. The first 3D reconstruction using both aerial and ground imagery on a large city-sized
scale (1x1km).
3. A detailed study showing geo-location improvements of the merged reconstruction,
validated in a systematic manner over a large scale area using 3D Ladar data as
truth.
52
A B
C
D
Fig. 4.1 - Data used by the 3D reconstruction system. A) Example of a 66 MPix image captured at 2HZ by a multi-camera airborne sensor, covering a ground area of about 2x2km. B) Zoomed in view of the same aerial image. C) A 5 12MPix camera ground system, covering an 180 degree field of view, collected at 1Hz. D) Example of resulting ground imagery.
53
The rest of this chapter is organized as follows: Section 4.2 discusses in detail the
developed algorithm along with implementation of the system. Section 4.3 reports the 3D
reconstruction results on a 1x1 km area using aerial data, followed by the 3D
reconstruction results using only the ground imagery. Qualitative as well as quantitative
geo-registration results of the 3D reconstructions are reported for the aerial imagery,
ground collected imagery, merged ground imagery, as well as for the combined aerial-
ground reconstruction. Section 4.4 concludes with a proof-of-concept on how fine-scale
geo-location opens up new paths for improved image understanding.
4.2 Fine Scale Geo-Registration Approach
The developed algorithm can be divided into three stages. The first stage explains the
implementation of a 3D reconstruction pipeline that is applied similarly for both the
ground and aerial imagery. The second stage describes a 3D merge method to fuse the
two 3D reconstructions into a complete city model. The third stage makes use of the
combined 3D reconstruction to geo-locate a new query image. Each stage is described in
a separate subsection below.
4.2.1 3D Reconstruction from Video Imagery and Geo-location
The 3D reconstruction pipeline is similar to [43][44], with several improvements that
take into account temporal video constraints as well as availability of GPS information.
The processing pipeline, shown in Figure 4.2, can be broken up into the following stages:
preprocessing, feature detection, feature matching, initialization, bundle adjustment,
followed by dense 3D reconstruction.
For the pre-processing step, we first record estimates of the camera intrinsic parameters,
such as focal length, and any information related to camera extrinsic parameters, such as
GPS information. For the aerial imagery, the camera intrinsic data are determined using
prior calibration, while for the ground imagery we use jpeg-header metadata to determine
an initial estimate of the focal length, as well as record the GPS information on a per
54
video-frame basis. In the feature detection stage, we find points of interest for each image
using Lowe’s SIFT descriptors [10] and store those SIFT features for each image.
Fig. 4.2 - Structure from motion 3D reconstruction pipeline for video imagery. The pipeline takes as input a set of image frames from a video sequence, along with GPS information per frame. Features are detected from each image and matched across multiple sequence of images. GPS as well as time information is used to remove probable outliers as well as significantly speed up the processing time. Once feature matching is completed, bundle adjustment is run by first initializing with 2-view reconstruction and continually adding additional images to the reconstruction. The result is a sparse 3D reconstruction, which can later be upgraded to a dense reconstruction.
Next, in the feature matching stage, the SIFT descriptors of the points of interest are first
matched using Lowe’s ratio test [10], with an additional uniqueness test to filter out
many-to-one correspondences. The matches are verified using epi-polar constraints in the
form of RANSAC-based estimation of the essential matrix [54][55]. Typically, the image
matching stage is the most computationally expensive stage of the process. For
unstructured collections of photos, where any image might match any other of the
remaining images, the process typically takes O(n2) computational time, where n is the
number of images in the photo collection. In our case, where we have a video sequence,
we can reduce computational complexity of the matching step by taking into account that
55
time-neighboring video frames capture similar perspectives of the 3D scene, thus there is
a high likelihood that consecutive video frames will have many features matches to a
current video frame, while video frames further separated in time might have fewer
matches. We employ a simple data-driven model that keeps track of the maximum
amount of matching features. We continue to match neighboring frames further out in
time as long as the number of matches does not fall below 25% of the maximal match
number, or reaches a predetermined hard threshold, THorizon, of frames. Based on offline
tests of maximal correspondence track lengths, THorizon is set to 40 consecutive frames for
the aerial imagery, while for the ground imagery THorizon is set to 20. The moving horizon
time based clustering is captured in Equation 4.1 below.
[Eq. 4.1]
To account for situations where the same scene is revisited after a prolonged time, the
above data driven matching scheme is also performed between the current frame i and
key frames i+K*m, where K is the skip frame interval (set to THorizon/2). This sub-
sampling in time allows for loop closure, to remove time-based propagation errors in
camera pose estimates.
Besides time-based image matching, we also employ space-based constraints. For ground
imagery, we only match images that have GPS locations that are no further than 300
meters apart (this works well in practice for urban imagery). For aerial imagery, we
employ a more sophisticated method that uses GPS and INS information to compute
projection matrices and test for camera frusta intersection between the pairs of potential
images to match. This space-based matching constraint is captured in Equation 4.2.
1)1(
,
)1(
ImIm
,Im
,0},,...,{
25.0})max({
:ImIm
jiijii
i
horizonhorizon
Corrji
jii
ji
andbetweenmatchesofnumberisNm
andiindexatimageis
andjTTjwhere
RatioNm
Nm
ifonlyandifandbetweenMatchingFeatureDo
56
i image camera aerialfor frusta theis F where true,= )F ,Fintersect(
images ground are j and i where,T < |GPS- GPS|
:ImIm
iji
GPSji
ifonlyandifandbetweenMatchingFeatureDo ji
[Eq. 4.2]
Furthermore, once initial matching is completed using RANSAC with essential matrix
constraints, we take advantage of any images with available GPS/INS data to remove
false matches due to building symmetry. We consider an image to be well matched to
another image as long as the GPS/INS derived relative rotation versus the essential-
matrix derived rotation is off by no more than 20 degrees; otherwise the image are
considered to be mismatched due to building axial symmetry, and the image pair and
correspondences are thrown out.
This time-based and space-based image matching methods allow us to reduce the
computational complexity of image matching from O(N2) for unstructured photo
collections closer to order O(N), which leads to significant computational savings. Once
pair-wise matching is completed, the final step of the matching process is to combine all
the pair-wise matching information to generate consistent tracks across multiple images
[43][44][6], where each track represents a single 3D location.
Once feature matching is completed, the next step is to initialize the reconstruction with a
seed model, refine the initial model and add additional images. Similar to [43][44], our
SfM method is incremental, starting with a two-view reconstruction, adding another view
and triangulating more points, while doing several rounds of non-linear least squares
optimization, known as bundle adjustment (BA), in order to minimize the re-projection
error. Similar to [43], the seed initialization starts by finding the pair of images that have
the most matches using the essential matrix constraint, while having few matches using a
homography constraint (want an image pair with large rotation and large baseline, not
just pure rotation). After each bundle adjustment stage, some 3D points that have re-
projection error above a certain pixel threshold are removed and a final bundle
adjustment stage is run. Next, if GPS/INS information is available, the estimated final
BA rotation between image pairs is checked against the GPS/INS derived relative
rotation to be no more than 20 degrees off. If the criterion is not met, the initial seed is
57
discarded and the above process repeats for successive candidate seeds until one is found
that passes all the above criteria.
Once we have an initial seed, the above process repeats again with each additional image
view. The final result of this step is a set of adjusted camera intrinsic and extrinsic
matrices along with a sparse 3D reconstruction of the 3D scene. The geo-registered
sparse 3D reconstruction is upgraded to a dense 3D reconstruction using Furukawa et
al.’s Patch-based Multi-View Stereo (PMVS) algorithm [56][57].
Fig. 4.3 - 3D Geo-location method. The 3D reconstruction, which has camera pose data in a virtual coordinate space, is geo-located to world space using the GPS metadata available with the original images. A similarity transform is found using a robust least square method that utilizes the RANSAC algorithm. The similarity transform is applied to the camera poses in the virtual coordinate space to obtain world-space camera poses that are well aligned to the GPS metadata. The similarity transform is also applied to the 3D reconstruction to obtain an absolute geo-located 3D data set. The accuracy of the geo-located data set is verified by comparing the data to previously collected truth 3D data, obtained from a high-precision 3D Laser Detection and Ranging (LiDAR) sensor.
Next, both the aerial and ground derived dense reconstructions are geo-located using the
process depicted in Figure 4.3. Geodetic data in lat-lon-alt is converted into an Earth-
Fixed Earth-Centered (ECEF) world coordinates, which is a Cartesian coordinate system.
Geo-registration in is performed by automatically finding a 7 degrees-of-freedom(scale,
rotation & translation) transformation that minimizes the least squares errors between the
58
metric-scaled camera positions and the GPS-based ECEF camera positions. The method
uses random sample consensus (RANSAC) to get rid of any outliers correspondences,
that might be caused either by a poor GPS solution or a poor 3D camera reconstruction.
The results of the above process are multiple dense, geo-registered 3D models, one
derived from aerial imagery, with possible multiple separate 3D models derived from
ground imagery. The geo-registration accuracy and precision of each reconstructed geo-
located model is quantified by automatically aligning each 3D model to a previously
collected, geo-located 3D data set. The alignment method used is a modified version of
Iterative Closest Point [58] (ICP) algorithm with six degrees of freedom (rotation +
translation) as detailed in [59]. The alignment accuracy if verified manually by super-
imposing both data sets in a 3D data viewer. The truth data was derived from an airborne
3D Laser Radar (Ladar) sensor that produced data with sub-meter geo-location accuracy.
Due to the possibility of poor GPS solutions obtained from ground imagery due to multi-
path effects [53], some parts of the ground 3D reconstruction might include drift errors.
In the next section, we discuss a method to correct distortions in the ground-based 3D
reconstruction by using the aerial data as a reference frame.
4.2.2 3D Reconstruction Merge
The aerial-based reconstruction was used as a reference frame to merge the ground-
derived reconstructions in order to create a more complete and self-consistent 3D model
of the city. The first step was to roughly align the ground and aerial 3D reconstructions to
remove the overall bias between the multiple 3D models. Similar to the process of
determining geo-location accuracy, we used a modified version of Iterative Closest Point
[58] (ICP) algorithm with six degrees of freedom (rotation + translation) as detailed in
[59]. The next step was to correct for intra-model distortions. To find optimal global
alignment between the reference aerial model and each of the 3D ground models, we use
a combination of ICP along with multi-positions search in xyz space. The result of this
method is a set of rigid transforms that best align each ground model to the aerial model.
In addition to applying the corrections to the 3D ground models, the corrections are also
applied to the extrinsic camera positions of the 2D images that used to generate the
59
models. The images, along with the corrected extrinsic matrices are now considered a
single ground component. This data is passed again to PMVS with the goal being higher
density reconstruction compared to the original multiple disconnected 3D ground models.
Figure 4.4 details the 3D reconstruction merge method.
Fig. 4.4 - 3D Merge Method. The ground models are first aligned to the aerial 3D data. The result is a set of rigid transforms for each of the ground models. The transform is applied to the camera parameter files associated with each model in order to create a single-component ground data set. The metadata associated with this data set is again passes to PMVS to obtain a refined dense ground reconstruction.
4.2.3 Geo-locating a new image
The SfM algorithm described the Section 4.2.1 is applied similarly to both the ground and
aerial reconstruction, followed by the merged algorithm in Section 4.2.2. The result is a
complete, geo-referenced 3D model and image database, with camera poses for each of
the reconstructed images. Given a new query image, either taken from the ground or from
an aerial platform, we are now in a position to match this new image to the pre-processed
geo-referenced image database. The procedure for achieving this is described in Figure
4.5.
Similar to the SfM reconstruction pipeline, we do feature detection using SIFT. We apply
feature matching using Niester’s 5 point algorithm RANSAC constraint, as well as using
60
the time-based constraints method described in Section 4.2.1 with a K skip-factor of 5.
However, the space constraint method described in Section 4.2.1 is not applicable, as we
assume that no GPS metadata exists for the query image). Depending on how many
feature matches are found, there are several possibilities going forward:
1. No matches found: Image cannot be geo-located further than city-scale.
2. Matches found between 1 and 4. Cannot use Nistér’s 5 point algorithm to
compute essential matrix. Do weighted estimate based number of matches and
GPS locations of matched images
3. Matches equal to or greater than 5. Can add image to bundle adjustment using
motion-only estimation.
Fig. 4.5 - Geo-locating a new image using the SfM reconstructed image database.
61
After the above stage, we now have a camera pose determined in a virtual coordinate
space. This camera pose can be upgraded to world coordinates using the pre-computed
similarity transform found for the aerial reconstruction, which is also used for geo-
location of the merged aerial-ground reconstruction. The result is a camera pose in world
coordinates that can now be compared to the GPS metadata available with the query
image.
4.3 Fine-scale Geo-location Results
Aerial imagery was collected over Lubbock, Texas using a 66Mpix airborne multi-
camera sensor, flying at an altitude of 800 meters in a circular race-track pattern, with a
collection line-of-sight 30 degrees off nadir. Using a single 360 degree race-track,
consisting of 250 images, a 3D reconstruction was computed using the algorithm detailed
in Section 4.2. It is worth noting that the 3D reconstruction algorithm does not require a
360 degree view of an area to perform a good reconstruction; the algorithm has been
successfully tested for other more general flight paths such as straight fly-bys over an
area. Ground video imagery was collected using a pickup truck mounted with a 60 MPix
multi-camera sensor on top of the cabin roof. Of the 125000 images collected, 34400
images overlapped in coverage with the aerial platform and were used to perform a 3D
reconstruction in the region of interest.
As the aerial data is collected from above at 30 degrees off nadir, one would expect that
the aerial data captures well the sides of many but not all the buildings. On the other
hand, the ground photos capture primarily the building facades. This might lead to some
concern as there might be no overlap in certain areas when merging the ground and the
aerial model. In our experience, the situation did not arise in our data set as most of the
buildings are not very tall and are fairly well spaced apart. Such a situation might be of
concern for a Manhattan-like city collect and could be resolved by imposing minimum
thresholds on scene overlap and residual error.
62
4.3.1 Reconstruction and Geo-location of Aerial Video Imagery
The results of the 3D aerial reconstruction are qualitatively shown in Figure 4.6. The
figure shows the Texas-Tech campus, along with its distinctive football stadium. Multiple
zoomed-in views of the stadium and other campus buildings are rendered using Meshlab
[60] to capture the quality of the 3D reconstruction results. The dense reconstruction has
approximately 23 million points, with a 20cm pixel ground sampling distance, and range
resolution of approximately 1 meter (can resolve in height AC units on rooftops). One
can visually determine that we were able to find a good 3D metric reconstruction
(preserves 90O angles), to within a similarity transformation.
Fig. 4.6 - Aerial 3D reconstruction of 1x1km area of Lubbock, Texas using a 250 frame 66MPix video sequence. 3D rendering of the data was achieved using MeshLab.
63
The 3D model was geo-registered as described in Section 4.2, using the GPS metadata
available for each video frame. The quality of the geo-registration was also verified using
3D Ladar truth data. Figure 4.7-A shows a geo-located 3D Ladar truth data set collected
at an earlier date from a separate airborne sensor. The Ladar data is geo-located to within
0.5 meters and sampled on a rectangular grid with 50 cm grid-spacing. The range/height
resolution of the data is about 30 cm. Figure 4.7-B shows the 3D aerial reconstruction in
the same coordinate space as the 3D Ladar data, in order for the viewer to get a rough
comparison of the coverage area and notice that the two data sets appear well aligned.
Figure 4.7-C shows the two data sets superimposed to qualitatively demonstrate that we
obtained a good geo-registration.
A B C Fig. 4.7 - Qualitative geo-registration results of aerial reconstruction. A) 3D Ladar map of Lubbock, Texas displayed using height color-coding, where blue represents low height, yellow/red represent increasing height. B) Geo-registered 3D aerial reconstruction. C) 3D Ladar truth data superimposed onto the 3D aerial reconstruction. Notice that there is no doubling of buildings or sharp discontinuities between the two data sets, indicating a good geo-registration.
A B Fig. 4.8 - Quantitative geo-registration results of aerial reconstruction. A) Histogram of initial geo-registration error with bias of 2.52m indicating good geo-registration accuracy. The σ=0.54 m indicates low internal geometric distortion (high precision). B) Histogram of geo-registration error after applying ICP. The bias has been reduced by 2.5 times to 0.84m.
64
A quantitative study is performed to determine geo-location accuracy and precision of the
3D aerial reconstruction by automatically aligning the aerial reconstruction to the 3D
Ladar truth dataset using an ICP algorithm from [59]. The results are shown in Figure
4.8-A: the bias of the geo-registration is 2.52 meters with σ =0.54 meters. The results
validate that we not only have good geo-location accuracy, to within about 3 meters, but
also indicate low geometric distortion of around 0.5 meters, which is on the order of the
accuracy of the truth data. It is also noteworthy to highlight that the results in Figure 4.8-
A are the geo-registration errors prior to ICP alignment; in the above study, ICP is only
used to find 3D correspondences and verify initial goodness of geo-registration. Thus,
using just GPS metadata collected with the aerial video imagery, we obtained good geo-
location accuracy to within 3 meters and high geo-location precision to within 0.5 meters.
Figure 4.8-B show the geo-registration statistics after ICP alignment, with the bias
reduced by about 2.5 times to 0.84m. Taken altogether, the results suggest the 3D aerial
reconstruction might be readily fused with 3D Lidar data to obtain higher-fidelity
products.
4.3.2 Reconstruction and Geo-location of Ground Imagery
Ground video data was collected in Lubbock, Texas covering the same area as the
airborne sensor. The data was collected using 5, GPS-enabled, 12Mpix Canon D5000
cameras with 1Hz frame update. Figure 4.9-A shows the captured GPS locations
superimposed on a satellite image visualization using Google Earth. Figure 4.9-B shows
the overall 3D reconstruction, height-color coded in shades of purple-green-red. The
reconstruction was composed of 44 separate components with a total of 25 million points;
most of the components tended to capture individual streets, with reconstruction typically
stopping when reaching video data of busy intersections. To appreciate the 3D
reconstruction quality, Figure 4.9-C/D/E capture zoomed-in views of the reconstructions
within the University of Texas at Lubbock, colored texture derived from the underlying
RGB video frames.
65
A B
C D
E
Fig. 4.9 - Qualitative results of ground reconstruction. A) Ground recorded GPS points overlaid onto a satellite image. B) Ground reconstruction captured from two different views in height-above-ground color-coding (lowest height corresponds to purple, blue/green/red correspond to increasingly higher altitudes). C,D,E) Zoomed-in views of 3D with RGB color-map information obtained from the reconstructed images.
Using the same procedure as for the aerial reconstruction, we geo-registered the 3D
ground model by comparing the GPS data captured for each frame to the bundle-adjusted
camera locations. Figure 4.10-A/B qualitatively capture the initial geo-location error
66
(prior to ICP alignment): comparison of the 3D Ladar data in Figure 4.9-A to the
superimposed 3D ground reconstruction & 3D Ladar data in Figure 4.10-B reveals large
geo-location errors, with doubling of buildings surfaces. The statistics of the geo-
registration error prior to ICP alignment are shown in Figure 4.10-C. From Figure 4.10-
C, we can determine that we have poor geo-location accuracy with a geo-registration bias
of 9.63 meters, as well a poor geo-location precision with a σ=2.01 meters indicating that
significant distortions exist within the model. Thus, due to poor GPS ground solutions,
the geo-registration of the ground reconstruction is significantly worse compared to the
3D aerial reconstruction. The reason for these higher geo-registration errors is the
presence of slow-varying GPS bias and distortions due to multi-path effects [53]
especially present in urban canyons formed by streets surrounded by tall buildings, as
demonstrated in Figure 4.10-D.
A B
C D Fig. 4.10 - Initial geo-location of ground reconstruction. A) Qualitative view of the 3D Ladar truth data, B) Same 3D Ladar data superimposed with ground reconstruction showing doubling of buildings due to large geo-location bias. C) Histogram of geo-registration errors: the bias is 9.63m with a σ=2.01m. D) Example of GPS errors encountered amidst taller buildings, which lead to poor geo-location of ground data.
67
A timing comparison was also run to determine the improvement in computational speed
between the approach described in [43] versus our new method. The data was run on a
Pentium Xeon Quad-core 2.8Ghz machine, with 48GB of RAM (specified RAM size
needed to assure that a dense 3D reconstruction with PMVS was achievable). Results are
summarized in Table 4.1. For the aerial video imagery, our approach resulted in a 3.6x
speedup. The speedup can be attributed directly to the improvements in feature matching
based on video constraints, that reduced the time for the feature matching computational
step from O(N2) to O(N). For the larger ground data set, our approach resulted in a
significant 14x speedup. The computational advantage embedded in our feature matching
step becomes more readily apparent with this larger image set.
Table 4.1
Timing comparison of 3D Reconstruction method to prior state of art
Data Set Bundler v0.4 Our Method
250 image aerial video 306 min 84 min
34400 image ground video 10710 min 784 min
4.3.3 Geo-Registration of Combined Aerial-Ground Imagery
In order to obtain a better geo-location of the ground 3D data we apply the 3D Merge
algorithm described in Section 4.2.2. The result is a merged 3D aerial-ground
reconstruction that is now self-consistent. Figure 4.11-A/B quantitatively capture the
before and after merge intra-registration errors between the ground and aerial 3D data.
The overall bias is reduced by an order of magnitude from 8.86m to 0.83m, while the
intra-data distortion is reduced from σ=2.5m to σ=0.59m. Thus, the merge method was
able to successfully remove the bias term and also reduce the intra-registration distortion
by 4x in order to produce a more self consistent complete 3D city model.
68
A B
Fig. 4.11 - Improvement of geo-location after applying 3-D Merge algorithm. A) Intra-registration errors between the 3D ground and aerial data before 3-D Merge, with bias of 8.86m and σ=2.52m; B). Remaining intra-registration errors after 3-D Merge procedure with remaining bias of 0.83m and σ=0.59m. The merge method removed most of the bias term, and reduced distortions within the ground data set by 4x from σ=2.52m to a σ=0.59m.
The combined aerial-ground 3D city model was verified against 3D Ladar data to
determine the final geo-registration error. Results indicate a geo-location accuracy (bias)
of 2.82m, with a geo-location precision of σ=0.74m. As expected, the final geo-location
accuracy of 2.82m is limited by how well the aerial data was initially geo-located, which
in our case was with an accuracy/bias of 2.52m. The overall geo-location precision,
standing at σ=0.74m, is lower bound by both the geo-location precision of the aerial
reconstruction (σ=0.54m) as well as the precision of the ground reconstruction after the
3D merge (σ=0.59m). Thus, the combined aerial-ground data set has geo-registration
accuracy to within approximately 3 meters with geo-location precision on the order of
1m. Figure 4.12 shows the merged aerial-ground 3D reconstruction
69
Fig. 4.12 - Examples of merged aerial-ground reconstruction. Aerial results are shown in black-and-white imagery, with the ground data super-imposed in color. Note that that the aerial and ground data are very well registered, with no visible surface doubling present, visually confirming great aerial to ground registration.
70
4.3.4 Geo-locating a new image
In order to test the geo-location performance of image query against the reconstructed
database, a 500 image subset was kept for unit testing and not included in generating the
3D reconstructed image database. Furthermore, to prevent matching with other video
imagery in close chronological proximity to the selected images, all images within a +-10
second time window were removed, as were all consecutive images within a +-20m
contiguous-time space window. (e.g. imagine choosing an image where the vehicle was
stopped for a prolonged period of time; all chronologically contiguous images with a
GPS value within 20m of chosen image are removed). The algorithm described in
Section 4.2.3 was applied to the subset of images to generate the results shown in Figure
4.13. Figure 4.13 captures geo-location accuracy in terms of cumulative percent of
images that met the respective absolute geo-location accuracy value. The geo-location
performance is quite good, with 399 of the 500 (79.8%) images having enough matches
to obtain a pose estimate, with all such estimates having a geo-location accuracy under 84
meters. The remaining 101 images failed to have enough consistent matches, with the
most common failure modes observed being heavy occlusion of camera by oncoming
traffic or turns through intersections, where most of the scene structure failed to be static.
About 24% of images aligned to within 5 meters, and 61% were geo-located within 20
meters. Furthermore, most of the images that were added to the bundle adjustment step
(as opposed to having only 3D location estimates) were shown to have better than 20
meter geo-location accuracy, with 53% of images having enough matches to attempt
bundle adjustment and 27% of images geo-located using only a 3D location estimate. It is
also important to note that the truth GPS imagery is not perfect, and does contain
significant GPS bias, which will tend to make the geo-location appear worse than actual
geo-location error. Nonetheless, the results are very encouraging; a large percentage of
the images have the potential to be added to the reconstructed image database in order to
further improve scene coverage and fidelity.
71
Fig. 4.13 - Fine-scale geo-location accuracy for a 500 image test subset. Geo-location performance is quite good, with 399 of the 500 (79.8%) images having enough matches to obtain a pose estimate, with all such estimates having a geo-location accuracy under 84 meters. The remaining 101 images failed to have enough consistent matches, with the most common failure modes observed being heavy occlusion of camera by oncoming traffic or turns through intersections, where most of the scene structure failed to be static.
4.4 Towards Improved Image and Scene Understanding
In the last 3 chapters, we have shown a complete system for image geo-location, starting
with at a world-wide level and working our way to the city level and finally to the street
level or even an estimated camera pose. The developed system incorporated data from
different modalities, such as 2D GIS, aerial imagery, ground imagery as well as 3D
Ladar, to create a dense 3D world model representation, with enough statistics to be able
to geo-locate a new query image at varying levels of accuracy.
This system provides an immediate benefit in that the image can now be tagged as
coming from a particular region of the world, a particular country and even a particular
72
street. This additional metadata can be used as priors for object detection and recognition
to tailor a particular object detection algorithm at recognizing objects that might be found
in that particular region of the world. However, quite a lot more additional information
can be extracted about the content of the geo-located image. Consider the 3D city model
generated by the SfM image reconstruction; most of the pixels in the query image can
now be associated with absolute 3D locations, allowing for fusion and information
transference from other geo-located data sources for improved scene understanding.
From geometric context, we can readily classify pixels as being road, ground, trees and
buildings by back-projecting the classified 3D data into the query image plane to assign
pixels as belonging to particular scene classes. An example of such ground and building
classification using 3D imagery is shown in Figure 4.14-A. An example of road detection
using information transference from a 1D GIS road layer is shown in Figure 4.14-B [61].
Further fusion with other data sources, such as GIS layers and 3D Ladar data can act as
further scene understanding multiplier, with building names, business names and street
names now being associated with regions of the image, as demonstrated in Figure 4.14-C
[61]. Higher level scene understanding can also be gained, such as occlusion, missing
data reasoning and shadowing effects as demonstrated in Figure 4.14-D [62]. Using this
additional information, it might also be possible to better classify change detection due to
people, cars or new construction, as regions of the particular query image that match
poorly to the 3D underlay back-projected into the query image camera space.
A
73
B
C
D
Fig. 4.14 - Towards improved image understanding using information transference from other data modalities. A). Scene classification using 3D imagery. The blue color coding represents buildings, red and yellow represent ground. This classification can be back-projected into the query image plane to assign pixels a class label. B) Examples of 1D GIS road network information transference to identify pixels that are roads. C) Further information can transferred, such as building and street names, adding additional scene understanding. D) Occlusion, missing data reasoning and shadowing effects can be inferred from the fusion of the 2D to 3D imagery. Figure shows 3D data in-painted with a single 2D image. Occluded areas of the 3D map are shown in two shades of gray, with dark gray representing shadowed areas computed based on sun location at time the image was collected. This additional information can be used to better detect or track moving objects such as vehicles and persons that might be present in the query image.
74
Chapter 5
3D Ladar Processing: An Extension to 2D Image Geo-location
Chapter Summary: In support of geo-accuracy validation for the fine-scale geo-location method, we present a novel 3D Ladar processing method using data collected by an airborne 3D Ladar sensor. Data collected by 3D Laser Radar (Ladar) systems, which utilize arrays of avalanche photo-diode detectors operating in either Linear or Geiger mode, may include a large number of false detector counts or noise from temporal and spatial clutter. We developed an improved algorithm for noise removal and signal detection, called Multiple-Peak Spatial Coincidence Processing (MPSCP). Field data, collected using an airborne Ladar sensor in support of the 2010 Haiti earthquake operations, were used to test the MPSCP algorithm against current state-of-the-art, Maximum A-posteriori Coincidence Processing (MAPCP). Qualitative and quantitative results are presented to determine how well each algorithm removes image noise while preserving signal and reconstructing the best estimate of the underlying 3D scene. The MPSCP algorithm is shown to have 9x improvement in signal-to-noise ratio, a 2-3x improvement in angular and range resolution, a 21% improvement in ground detection and a 5.9x improvement in computational efficiency compared to MAPCP.
5.1 Background and Related Work
Three-dimensional Laser Radar (3-D Ladar) sensors output range images, which provide
explicit 3-D information about a scene [63][64][65]. MIT Lincoln Laboratory has built a
functional airborne 3-D Ladar system, with an array of avalanche photo-diodes (APDs)
operating in Geiger mode, that actively illuminates an area using a passively Q-switched
micro-chip laser with a short pulse width time [66][67]. On each single laser pulse, light
from the laser travels to the target area and some reflects back and is detected by an array
of Geiger-mode APDs. Figure 5.1 captures the 3D Ladar system concept.
Recent field tests using the sensor have produced high-quality 3-D imagery of targets for
extremely low signal levels [68][69]. Though there are many advantages to using single-
photon sensitive detector technology, the data collected using these Geiger-mode APDs
are often noisy with unwanted temporal or spatial clutter. It has been shown in previous
publications that by identifying spatial coincidences in data from as few as three laser
pulses, we can significantly reduce the probability of false alarms by several orders of
magnitude [70][71][72].
75
Fig. 5.1 - 3D Laser Radar (Ladar) system concept. A laser sends out a pulse of light to a target. Some of that light is reflected back and detected by an APD array. The time of flight between the send and receive of the laser pulse is recorded and converted to metric units to create a range image.
The method of finding signal in the presence of noise and clutter by using coincident
spatial data is known as coincidence processing. The more 3D points returned at the same
spatial location, the more likely that the points came from a real scene surface. In this
paper, we discuss the implementation of a novel processing algorithm, known as Multi-
Peak Spatial Coincidence Processing (MPSCP), and test it against the current state of the
art, Maximum A-posteriori Coincidence Processing (MAPCP) algorithm, [73]. The
contributions of this paper are as follows:
1. A set of general methods to address typical 3D Ladar processing challenges that are
relevant to most 3D Ladar sensor systems (Linear and Geiger mode).
2. An improved 3D Ladar filtering algorithm that is shown to have a significant
improvement over current state-of-the-art, with qualitative and quantitative results
shown.
In the remainder of this section, we first discuss the challenges of processing 3D Ladar
data and describe how our algorithm addresses those challenges. Quantitative and
76
qualitative results are shown using data collected over Haiti in support of earthquake
rescue operations using the Airborne Ladar Research Test-bed (ALIRT) platform [74].
5.2 3D Ladar Background
There exists a cause-effect relationship between 3D Ladar system design / data collection
methods and the inherent processing challenges that arise and need be subsequently
addressed. First we introduce the 3D Ladar system design, where an airborne sensor
stares at a pre-designated ground target from multiple perspectives in order to get a more
complete measurement of the scene. This concept of operations is shown in Figure 5.2.
Fig. 5.2 - 3D Ladar concept of operations. An airborne platform, stares at a pre-designated ground target from multiple perspectives in order to get a more complete measurement of the scene.
Due to limitations in APD array size (number of pixels), for each viewing perspective the
detector array needs to be scanned using a sinusoid pattern in angle-angle space to get a
higher resolution 3D image of the target. Given the above system design and data
collection methods, a list of processing challenges becomes apparent:
1. Varying signal and noise levels: the scanning pattern can lead to large variations
in the absolute output level (3D point density). Background light and/or detector thermal
excitation can lead to 3D noise points with high spatial coincidence that need to be
filtered out. There is a need to know the output level to dynamically determine the
statistical significance of coincident returns.
2. Photon attenuation: Obscurants in the range direction might reduce the
probability of transmitted photons reaching a ground target. Knowledge of photon
77
attenuation can be used to dynamically adjust statistical significance of coincident
returns.
3. Detector-specific range attenuation: due to nature of Geiger-mode APDs, once a
pixel is triggered at a closer range, no hits at further ranges are possible as the pixel needs
to be reset, leading to output level attenuation in the range direction.
4. Laser-detector Point Spread Function (PSF) can lead to 3D blurring of imaged
objects. A method is needed to de-blur the 3D image.
5. Platform attitude errors (GPS/INS) can add further blur to the 3D image.
6. Platform motion and signal aggregation from multiple perspectives: To increase
signal-to-noise (SNR) level, a method is needed to evaluate output level from each
perspective that contributes to a particular 3D location.
7. Automatically determine optimal processing parameters: Need data-driven
method to obtain good, reproducible, single-run results without the need for human
intervention.
Before further discussing and addressing each individual challenge, it is crucial to notice
that most of these challenges are related by a common denominator: sensor line of sight
(LOS). Variations in signal/noise levels are orthogonal to the LOS, while photon/detector
signal/noise attenuation are along the LOS. The Laser-Detector PSF is oriented along the
LOS direction: error in range due to laser-detector timing jitter is, by definition, aligned
to the LOS, while platform attitude errors, such as GPS and inertial navigation errors
(INS) can also be readily thought as orthogonal to the LOS.
The crucial insight is that processing in an appropriate line-of-sight coordinate system
plays an important role in decoupling the effects of the various processing challenges
listed above, so that each challenge can be independently addressed. Depending on
airborne platform velocity, range-to-target and target collection size, a LOS coordinate
system can be chosen to approximate the true line-of-sight while avoiding the
78
computational expense of ray tracing each individual APD array LOS vector and storing
the information in a 3D volumetric signal map.
A B
Fig. 5.3 - Line-of-sight (LOS) coordinate systems for various sensor platforms. A) For slow-moving platforms, a spherical coordinate system (angle-angle-range) gives a good approximation of the LOS while B) a skewed-cylindrical coordinate space (heading-angle-range) gives a good approximation of the LOS for fast-moving airborne platforms.
Figure 5.3 depicts several LOS coordinates systems that might be used. For instance for
airborne platforms that are slow-moving in comparison to the range-to-target distance
and target area, a sensor-centered spherical coordinate system (angle-angle-range) can
best approximate the collection volume which tends to resemble a solid angle. For fast-
moving airborne platforms, where target area size is on the same order as platform
motion, a skewed-cylindrical coordinate system can best approximate the LOS
independent of range-to-target. Another possibility to consider for small target areas is an
inverted target-centered spherical coordinate system.
Since the ALIRT system is hosted on a fast-moving airplane and uses the airplane’s
forward heading motion to scan a target area of the same approximate size, we utilize a
skewed-cylindrical coordinate system to best approximate the LOS.
79
5.3 3D Ladar Processing Approach
The proposed MPSCP algorithm advances current state of the art in 3D Ladar processing
by addressing each of the processing challenges noted in Section 5.2. Figure 5.4-A/B
shows an example of raw 3D data to allow the reader to visually appreciate the large
amount of noise and clutter present.
The noisy 3D data is initially stored in Universal Transverse Mercator (UTM) coordinate
system, a 3D space that can be locally approximated as a Cartesian coordinate space [75].
The first processing step of MPSCP is to transform the data from UTM space to an
appropriate line of sight space, which for our sensor is a skewed cylindrical coordinate
space. Using metadata, such as airborne sensor position, a LOS coordinate basis is
created and the data is transformed. We now proceed to explain how utilizing this data-
defined LOS coordinate space will lead to improved computational efficiency as well
improved 3D filtering results.
A B
Fig. 5.4 - Raw Lidar data showing salt and pepper noise. A) Height color-coded example of raw 3D data (dark grey low altitude, white high altitude). The target area is obscured due to the heavy amount of noise. B) Zoomed-in version of same data set showing the measured 3D structure embedded in high levels of noise and clutter.
The next MPSCP processing step is to determine expected 3D output level. MPSCP uses
the output level estimate to determine statistical significance of spatially coincident
returns. Statistical significance is determined in terms of a maximum likelihood estimator
80
given the expected output level. In this fashion, the MPSCP algorithm can dynamically
adjust its internal noise suppression thresholds to work well under most signal conditions.
Variations in output level due to the scanning pattern, photon attenuation as well as
detector attenuation, can be accounted accurately using the data-defined LOS coordinate
system. We first determine an initial output level, Oinitial, due to variations in scan pattern
dwell times. In our LOS coordinate system, Oinitial varies in only 2 of the 3 dimensions,
namely heading and angle, but not range. Compared to output level estimation in a 3D
Cartesian coordinates, which would have required computationally expensive 3D ray
tracing and storage of a volumetric 3D array of values, the problem of estimating output
level reduces to a 2D matrix in heading-angle space using our LOS coordinate system.
This leads to increased algorithmic computational efficiency as well as improved
implementation with significantly lower memory overhead. An example of the computed
Oinitial output level map is shown in Figure 5.5, with output level back-projected to 3D
from 2D heading-angle space on a per raw 3D point basis.
A B
Fig. 5.5 - Raw 3D Lidar point cloud color-coded by scan pattern induced output variation. A) Side-view of a target area, showing a notional scan pattern and the estimate of Oinitial (dark grey - low value, white - high value) obtained using the LOS space. B) Heading-angle view of the same target area. The LOS is shown to be accurately estimated, with high output levels (white) at the edge of the sinusoid scan pattern due to decreased angular velocity as the scan mirror changes direction, leading to an increase in 3D point density level.
The output level is also affected by photon attenuation as well as detector attenuation in
range. Photon attenuation due to line-of-sight blocking needs to be taken into account
81
when computing the statistical significance of coincident returns. Detector-specific output
attenuation in range has a similar effect as photon attenuation, reducing expected output
level at further range values along the LOS. To account for these data-dependent effects,
the data at a particular heading-angle location (which can be visually represented as a
chimney of data in the range direction) is binned into a range histogram H. For each
range bin i of histogram H, MPSCP keeps track of returns that have occurred at closer
ranges versus returns that have occurred at further ranges to determine an expected output
attenuation value. Equation 5.1 numerically captures the method for determining
attenuated output level, Oattenuated, as a function of range along the LOS, while Figure 5.6
visually describes the method to account for photon and detector range attenuation
effects.
.)(
)(1
),(
),(
21
21
NC
iCOiO
ahH
ahHinitialattenuated [Eq. 5.1]
where, Oattenuated(i) is the attenuation-corrected expected output level, Oinitial is the expected output level at a particular heading-angle location [h1, a2] determined solely based on the scan pattern (no attenuation correction), CH(i) is the cumulative histogram of range histogram H at range bin i and N is the last (furthest) range histogram bin.
A
82
B C
Fig. 5.6 - Method for correcting for photon and detector range attenuation effects. A). Example range histogram and cumulative range histogram for a localized heading-angle location. The cumulative histogram is used to compute an attenuated output level compared to the initial level, in order to account for photon and range blocking effects. B). Side-view of raw 3D Lidar point cloud color-coded by output attenuation in range and C) view orthogonal to the LOS, showing the effects of photon / detector attenuation as a function of range. Notice how the output attenuation changes from low (dark grey) to high (white) as the line-of-sight passes through obscuration.
Having determined the expected output level, a method is needed to find spatially
coincident returns to distinguish signal from noise. Spatial coincidence of points is
affected by the laser-detector 3D Point Spread Function (PSF) as well as platform attitude
errors, leading to blurring of the 3D image. The laser-detector 3D (angle-angle-range)
PSF can be decoupled into the angular response to a step-response in range (such as a
ground to building edge), followed by the range response to a flat surface. Figure 5.7
captures the methodology used to determine the 3D PSF. MPSCP uses the PSF as a 3D
matched filter to integrate signal and find 3D locations that have enough returns to be
considered statistically significant. Since our LOS coordinate system is already well
aligned to the 3D PSF, the 3D matched filter can be efficiently applied to the data. The
matched filter is also be used for sub-voxel estimation of the filtered return 3D location,
effectively removing the PSF-induced blur.
83
A B Fig. 5.7 - Computation of laser-detector 3D point spread function. A) Angular response to 3D edge and Gamma-function fit, B) Range response to flat-plate and its associated Gaussian fit. Another source that affects the spatial coincidence of points is platform attitude errors.
These errors occur due to drift in the GPS/INS solution, as well as due to errors induced
by the scanning hardware: such errors occur during changes in view-point perspective,
which require a sharp step-response in angular space from the scanning hardware. Due to
insufficient bandwidth, small angular errors can occur. These angular errors, combined
with GPS/INS drift, can lead to blurring that can be several times bigger than the 3D
PSF-induced blur. MPSCP corrects for these blurring errors by employing a two-stage
filtering process. Figure 5.8 shows the overall MPSCP processing block diagram. In the
first stage filter, data from each single viewpoint is processed independently: starting
with a noisy 3D data set per viewpoint, a unique data-defined LOS coordinate system is
created and the data is processed along the line of sight to produce a filtered 3D data set
per viewpoint. A secondary output is also created, which consists of the original 3D
noisy data appended with LOS statistics per point, such as the expected output level value
as visualized in Figures 5.5 and 5.6. Using the single-viewpoint 3D filtered data sets, we
align the all data sets to a single reference view, chosen as the data set with the largest
amount of data. The alignment method uses a variant of the Iterative Closest Point
[58][59] algorithm with six-degrees of freedom (3D rotation and 3D translation), which
produces results with sub-pixel error correction. The 6-degree transformation is also
applied to the raw 3D point cloud data that has been appended with LOS statistics per
point. To detect weak signals that might have been missed when processing data on a
84
single-viewpoint basis, a second-stage filter takes the aggregated, de-blurred, multi-
viewpoint data set and processes the data in a similar manner to the first-stage. Since the
data is taken from multiple perspectives, MPSCP defaults to using an UTM-aligned 3D
Cartesian coordinate space to process the aggregated data. The second-stage coincidence
processor uses the expected output level saved on a per-point basis from the first stage
filter to determine a statistical noise threshold to filter the multi-viewpoint aggregated
data set. Compared to the first stage LOS filter, in the multi-view point second stage the
noise filtering and detection is performed along the Z direction using a histogram
comprised of a vertical chimney of data.
As shown in the block diagram, MPSCP requires a single input parameter: processing
resolution in meters. This processing resolution is used to automatically determine a
binning size in the LOS coordinate space for the first-stage coincidence processing filter
as well as 3D PSF matched filter size. An accurate output level estimate is computed
directly from the data, which takes into account photon and detector output attenuation
effects, allowing MPSCP to dynamically adjust its noise-suppression thresholds to filter
most of the noise while keeping weak signals. In comparison, the MAPCP algorithm,
which represents the current state of the art in 3D Ladar processing, does not have this
type of automatic, data-dependent parameter tuning, with the user required to manually
determine the size of the 3D matched filter and manually choose an optimum threshold.
This typically requires multiple runs per data set for an operator to determine a good set
of parameters. Furthermore, since MAPCP does not take into account photon or detector
output attenuation effects, the algorithm has difficulty in keeping weak signals in low
output level regions while at the same time removing noise in high output level regions.
85
Fig. 5.8 - MPSCP algorithm block diagram. MPSCP has a two-stage filtering process. Data from each viewpoint is first processed independently in its own unique LOS coordinate system. Two outputs are created, namely a filtered 3D data set per viewpoint and the original noisy 3D data with LOS statistics, such as output level on a per point basis. The individual filtered data sets are aligned to remove attitude errors, with the transform applied to the noisy 3D data set. A second filtering stage ingests the aggregated data set to detect weak signals that might have been missed by the first stage filter, leading to a final 3D filtered output.
5.4 3D Filtered Results and Discussion
The MPSCP algorithm was tested against MAPCP on multiple data sets collected over
Port-Au-Prince, Haiti as part of the 2010 earthquake response. The data was used to
determine the navigability of streets as well as to quickly respond to population
movement into tent-cities that literally sprang out overnight. By accurately counting the
number of tents, an accurate assessment could be determined of the quantity of essential
supplies for each tent city.
5.4.1 Qualitative Results
Figure 5.9-A shows height-intensity color-coded MAPCP results for a target-mode data
set collected from multiple perspectives. Figure 5.9-B shows the MPSCP results for
visual comparison. From the results, one can visually discern that MPSCP has
86
significantly better angular resolution as well as range resolution compared to MAPCP,
with sharp palm tree branches, sharper building edges and car shapes better resolved. In
addition, the MPSCP results have almost all the noise removed, while the MAPCP
algorithm still has a large amount of noise present (visually seen as salt-and-pepper noise
above road, other open areas). In Figure 5.9-C/D, we are showing the same data set, now
zoomed-in and cropped in the z-direction to reveal the presence of tents. The MPSCP
results shown in Figure 5.9-D, demonstrate improved 3D scene coverage and
reconstruction under weak signal conditions compared to MAPCP.
A B
C D
Fig. 5.9 - Visual comparison of MAPCP versus MPSCP results on a Haiti tent city collected in January 2010. A) MAPCP results and B) MPSCP results. C) Zoomed in view to the center of target area showing tent city under obscurant using MAPCP, and D) same view of MPSCP results. The MPSCP results are shown to have less noise, have sharper edges with less blurring on buildings, cars, palm trees lining the street, and have better 3D scene coverage in weak signal areas under obscuration (fewer no-signal voids, shown as black pixels in the image)
87
5.4.2 Quantitative Results
Using metrics developed by Lopez et. al. [73], we quantitatively evaluated the data sets
shown in Figure 5.9. Signal-to-noise (SNR) was measured in a flat area out in open:
processed 3D points that fell within a height envelope above and below the ground were
considered valid detections; points above or below were considered noise. MPSCP had
an SNR of 97x while MAPCP had an SNR of 10.8x. MPSCP has a 9x improvement in
SNR, close to an order of magnitude better than MAPCP.
Figure 5.10-A shows the results of a line spread function (LSF) metric to evaluate
angular resolution. The LSF results indicate that MPSCP has a 3x improvement in
angular resolution. Range resolution was measured by segmenting out the roof-top of a
building, followed by slope-bias removal using principal-component analysis to align the
plane normal axis to the z, up direction. The resulting MAPCP and MPSCP range
histograms are shown in Figure 5.10-B; MPSCP has 2 times improvement in range
resolution. Ground scene reconstruction was also evaluated, as shown in Figures 5.10-C
and 5.10-D. The 3D data was cropped in the z direction to include only 3D returns on the
ground and tents; the data was binned in the x-y directions to create a binary filled vs.
empty pixel image. Results indicate that MPSCP found 21% more ground cover
compared to MAPCP.
The improved ground signal detection of MPSCP compared to MAPCP, while retaining
high-frequency information out in open areas, can be attributed to the use of dynamic
thresholding based on an accurate output-level estimate that takes into account photon
and detector output attenuation effects due to obscuration. The use of dynamic
thresholding allows the MPSCP algorithm to detect weak signals under obscuration,
while still removing heavy noise in high output level areas. By contrast, MAPCP does not
employ data-driven noise thresholding, leading the algorithm to have difficulty in
keeping weak signals in low output areas level while at the same time removing noise in
high output level areas.
88
A B
C D Fig. 5.10 - Coincidence processing quantitative results. A) MAPCP vs. MPSCP line spread function (LSF), showing that MPSCP has an improvement of about 3x in angular resolution. B) MAPCP range resolution versus MPSCP range resolution, showing an improvement in the MPSCP result of 2x. C) Ground coverage for MAPCP and D) MPSCP, with voids shown as black pixels. MPSCP recovered 21% more ground cover compared to MAPCP
A timing analysis was run on 4 multi-viewpoint data sets using a 12 core Intel Xeon
3GHz machine. Both MPSCP and MAPCP were run at the same processing resolution
with the default processing parameters. The overall conclusion from the timing results is
that MPSCP is about 6 times faster than MAPCP. Besides extensive testing on 4 multi-
viewpoint data sets collected in Haiti, the algorithm has been successfully tested on a
large scale 3D map data set covering approximately 30 square km of Port-Au-Prince,
Haiti. The MPSCP algorithm produced good, single-run results without the need for
parameter tweaking. The removal of the need for human intervention is of tremendous
importance for algorithm scalability to the large amounts of 3D Ladar data sets generated
in the field.
89
In summary, we have described a set of general methods to process 3D Ladar data that
are relevant to most 3D Ladar sensor systems, with either Linear-mode of Geiger-mode
APDs. We have also described in detail a novel 3D Ladar filtering algorithm that is
shown to be a significant improvement over the current state of the art. Qualitative results
indicate shaper 3D images with building and tree structure better resolved. The algorithm
was also able to remove more noise while preserving weak signal areas as visually
demonstrated in the form of improved ground coverage under obscuration. The use of
automatic, data-driven parameter tuning allows MPSCP to produce good, single-run
results without the need of human intervention.
90
Chapter 6
Conclusion
In this chapter, we summarize the contributions of the research work in this thesis, review
recent developments in literature and discuss promising directions for future research.
6.1 Contributions
Image geo-location on a world-wide scale is a very challenging problem. Besides being
an interesting problem in itself, it can be tremendously useful for many other vision tasks,
such as image retrieval, object detection and recognition. For instance, the distribution of
likely geo-locations of a particular image provides additional context, such as terrain
type, population density, and prominent cultural markers. This additional metadata can be
used as priors for object detection and recognition to tailor a particular object detection
algorithm at recognizing objects that might be found in that particular region of the
world.
For this research, we developed a hierarchical image geo-location and 3D reconstruction
framework using a course-to-fine scale localization approach using a 6.5 million image
database. By design, the approach presented is scalable to larger databases and may be
highly beneficial for many research communities; such communities include, but are not
limited to, online social networking sites, intelligence agencies and companies dealing
with large-scale data mining.
The presented approach starts off with a coarse geo-location method, where a query
image is roughly geo-located to particular region of the world by classifying the terrain
type in that particular image. To achieve the goal of image geo-location by terrain
classification, we first create a 3D world model representation composed of a large
training database of geo-tagged, terrain labeled images. This database is created by
merging knowledge from three publicly available databases, namely a geo-spatial terrain
type and land coverage database, a 6.5 million image database that is only geo-tagged and
91
a database of terrain-labeled images. We developed a coarse geo-location method that
uses the generated 3D world model to test a hold-out set of 5000 images. We
demonstrated an improvement over current state of the art in terrain classification, with
over 91% terrain classification accuracy, with a significant improvement of 5.72% over
the baseline. The proposed method has several advantages over prior approaches [8], in
that the method is robust to images with noisy geo-labels, works in a low dimensional
feature space to avoid the curse of dimensionality [9] and reduces the database size in
order to allow for more complex follow-on stages to be computationally tractable.
A medium scale geo-location method was implemented that improves upon previous
image retrieval techniques to geo-locate a query image to city-level accuracy. We
developed an improved KNN-SVM approach that is not only computationally tractable,
but also provides significantly improved classification performance over a KNN only
method. The hierarchical course and medium geo-location framework was tested on a
geo-tagged 6.5 million image database and demonstrated to have a relative improvement
of 10% in geo-location accuracy compared to previous methods applied up to city level
geo-location. Results summarizing the coarse and medium geo-location method are
published in [76].
Once we have geo-located a query image to a particular city, we go to the final step in the
geo-location progression by attempting to estimate the pose from where that particular
image was taken. To achieve this, we first process a training data set using structure-
from-Motion (SfM) techniques, where we take our training images for a particular city,
find feature correspondences and upgrade our correspondences to 3D locations to create a
3D model of the city scene. The relative camera poses, along with the 3D reconstruction,
are then geo-located using GPS image metadata that might be available with a subset of
the training images in our city-wide image database. A query image can then be geo-
located and attached to training image database using a similar SfM procedure. Our
contribution to the SfM research area is to develop an efficient method to do 3D
reconstruction on a city-wide scale using ground video imagery as well as aerial video
imagery in order to compute a more complete and self-consistent geo-registered 3D city
model. The reconstruction results of a 1x1km city area, covered with a 66 Mega-pixel
92
airborne system along with a 60 Mega-pixel ground camera system, are presented and
validated to geo-register to within 3 meters to prior airborne-collected Ladar data.
Compared to prior approaches, the new method has a computational speed-up on the
order to 4 to 14x depending on database size. Results summarizing the fine-scale geo-
location approach are published in [77]. As a proof-of-concept, we leveraged the newly
developed 3D world model to perform information transference from other geo-located
labeled data sources to the respective query image in order to demonstrate improved
image understanding.
In support of validation of our fine geo-location method, we developed a novel 3D Ladar
processing method using data collected by an airborne 3D Ladar sensor. Data collected
by 3D Laser Radar (Ladar) systems, which utilize arrays of avalanche photo-diode
detectors operating in either Linear or Geiger mode, may include a large number of false
detector counts or noise from temporal and spatial clutter. We present an improved
algorithm for noise removal and signal detection, called Multiple-Peak Spatial
Coincidence Processing (MPSCP). Field data, collected using an airborne Ladar sensor in
support of the 2010 Haiti earthquake operations, were used to test the MPSCP algorithm
against current state-of-the-art, Maximum A-posteriori Coincidence Processing
(MAPCP). Qualitative and quantitative results are presented to determine how well each
algorithm removes image noise while preserving signal and reconstructing the best
estimate of the underlying 3D scene. The MPSCP algorithm is shown to have 9x
improvement in signal-to-noise ratio, a 2-3x improvement in angular and range
resolution, a 21% improvement in ground detection and a 5.9x improvement in
computational efficiency compared to MAPCP. Results summarizing the 3D Ladar
processing approach are published in [78] [79].
93
6.2 Recent Developments
In this section, we review recent work in the literature that is related to the work
presented in this thesis.
Altwaijry H. et al. tackled the problem of image geo-location using Google-glass
imagery [80]. They extracted different features and in one case did testing using a
much higher dimensional vector (upwards of 300K dimensions instead of around
~2K dimensions as in our case). The training data set used was quite small at
1204 image (vs 6.5 million in our case). For their data set, they obtained good
geo-location accuracy (70-80%), though it was not clear how dense the images
were collected and if the high geo-location accuracy was due to instance level
learning versus the more desirable general image learning. For our geo-location
method, we might consider using some of the features proposed in [80].
G. Baats et al. focused on geo-location of images in mountainous terrain at the
country level [81]. The research utilizes a high-resolution digital terrain model to
form a sky contour. The sky contour is quantized into a feature vector that can be
matched to a database of GPS labeled images with pre-computed sky-contours. A
variant of ICP [69] is applied to deal with the lack of rotation invariance for the
sky contours. The approach proves to have good geo-location performance with
upwards of 80% of image having a geo-location error lower than 1km. At its
essence, the method allows for virtual sampling of the earth to densify sparse
remote regions on the globe (a problem that is apparent in our Flickr data set) in
order to allow for good image matching. The technique should work well for
mountainous areas and possibly coastal areas, and might be used to improve our
geo-location method once we detect the presence of mountainous/coastal terrain
from the coarse-scale classifier. However, the presented method in [81] will not
work for other terrain types such as urban, forest or country, which typically
contains sky-contours that are highly varying with small changes in viewing
perspective or in the case of forests, possibly not well defined due to a lack of a
contiguous sky region.
94
T. Y Lin et al. presented work on cross-view image geo-localization, where
satellite imagery was used as a truth database and matched to ground-based
imagery [82]. In particular, a land cover database was used to label the satellite
imagery, with that information being used to predict which regions might best
match the query ground-based imagery. The method has some similarities to our
approach, in that the land-cover database is used to extract further information
that can reduce the geo-spatial search. The method was applied to small region
(city scale) and obtained accuracy results that were 2x better than chance. The
research work has some overlap with our approach and provides another method
to use land-cover databases to improve geo-location of ground based imagery.
Research work in [83][84] focused on matching ultra-wide baseline aerial
imagery in urban environments, an issue that is not directly addressed by our fine-
scale aerial video geo-location method, where we typically have a narrow baseline
between images The research described promising results for matching images
that have gone though a large rotational (more than 30 degrees) and translation
changes, circumstances under which SIFT matching is known to fail. For our
approach, we did in fact use SIFT matching as we did not need to address the
issue of matching over wide baselines since the training imagery is composed of
high-frame rate aerial and ground based video imagery. However, the research in
[83][84] might be used as an extension to the fine-scale geo-location method in
Chapter 4 to further improve both ground and aerial geo-location, where we might
have a query image that has a wide baseline compared to any prior collected
training imagery.
Our review on recent work on image geo-location confirms that other authors are starting
to present geo-location methods that use multiple GIS data sources, along the lines of our
proposed approach. In particular, other researchers have found that using land-cover
databases can add significant information that is helpful to ground-based image geo-
location. The presented methods are quite different than the one proposed in this thesis,
95
but the work is very much complimentary and can be brought into the framework of the
hierarchical geo-location approach.
6.3 Future Work
Future areas of investigation will focus on further improving the coarse geo-location by
upgrading the KNN classifier to a KNN-SVM classifier similar to the one used for the
medium scale geo-location. We would also like to consider expanding the number of
terrain classes, allowing for improved data reduction and geo-location specificity. In
particular, the “country” and “urban” classes tend to account for more than half of all
images in the image database and need to be further sub-divided. Towards that goal, we
might consider adding several additional classes, namely a “savanna/arid” class as well as
further subdivide the “urban” class into a “sub-urban” versus “dense urban” class.
In regards to medium-scale geo-location, future work might include adding additional
features, learning which features are more important for geo-location and discarding
features that have low discriminatory power. For the fine-scale geo-location method,
recent work in the literature suggests that a further hierarchical approach can be applied
within the 3D SfM reconstruction to achieve a higher percentage of images that are part
of the initial 3D reconstruction. We are actively pursuing similar methods for real-time
reconstruction of 2D imagery from small UAVs.
Furthermore, in this thesis we have only dealt with geo-locating a single query image.
Research in [7] has extended this to a sequence of images collected over multiple days.
We would like continue towards that line of research by extending the presented thesis
work to the problem of geo-location of a short video sequence, where the most of the
information is much more narrowly localized in both time and space, making the problem
more challenging.
In terms of 3D Lidar processing, we are currently making great strides towards improved
signal detection. In particular, the multi-view point 3D filter described in this thesis is not
very good at capturing vertical surfaces since we do peak detection using a histogram
96
comprised of a vertical chimney of data. We are developing a new multi-viewpoint filter
that addresses this shortcoming. Furthermore, we are developing algorithms that process
large amounts of data (1GB/sec) in real time as well as pursuing extreme Lidar platform
SWaP (size, weight and power) reductions on order of 109 to obtain similar 3D data
collection capabilities on a small UAV as compared to prior airborne systems that need a
much larger, manned, airborne platform.
In addition, we are developing classification algorithms using fused 2D and 3D data for
improved scene understanding and object recognition. These algorithms are to be used to
detect natural versus man-made with further sub-classifications into object classes, such
as trees, rivers, buildings, cars, roads and trails.
97
References
1. http://blog.flickr.net/en/2006/08/29/geotagging-one-day-later/
2. Graham, M., Hale, S. A. and Stephens, M. (2011) Geographies of the World’s Knowledge.
London, Convoco! Edition.
3. W. Zhang and J. Kosecka. Image Based Localization in Urban Environments, 3DPVT 2006
4. Amir Roshan Zamir and Mubarak Shah, Accurate Image Localization Based on Google Maps
Street View, ECCV, 2010
5. M. Pollefeys, D. Nister, J. Frahm, A. Akbarzadeh, P. Mordo- hai, B. Clipp, C. Engels, D.
Gallup, S. Kim, P. Merrell, Detailed Real-Time Urban 3D Reconstruction from Video. IJCV,
Volume 78, Issue 2-3:143–167, July 2008
6. Noah Snavely: Scene Reconstruction and Visualization from Internet Photo Collections,
Doctoral thesis, University of Washington, 2008
7. James Hays, Alexei A. Efros. IM2GPS: estimating geographic information from a single
image. CVPR 2008.
8. James Hays, Large Scale Scene Matching for Graphics and Vision, CMU PhD Thesis, 2009.
9. Richard Ernest Bellman; Rand Corporation (1957). Dynamic programming. Princeton
University Press. ISBN 978-0-691-07951-6
10. D. Lowe, Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110,
2004
11. De Bonet, J.S.and Viola, P. 1997. Structure driven image database retrieval. Advances in
Neural Information Processing, 10:866–872.
12. A. Oliva and A. Torralba. Building the gist of a scene: The role of global image features in
recognition. In Visual Perception, Progress in Brain Research, volume 155, 2006.
13. Herv´e J´egou, Matthijs Douze, and Cordelia Schmid. Hamming embedding and weak
geometric consistency for large scale image search. In European Conference on Computer
Vision, volume I, pages 304–317, Oct 2008.
14. D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In CVPR, volume 2,
pages 2161–2168, 2006.
98
15. J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization: Improving
particular object retrieval in large scale image databases. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2008.
16. C. Liu, J. Yuen, and A. Torralba. Nonparametric scene parsing: label transfer via dense scene
alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2009.
17. http://unstats.un.org/unsd/demographic/products/dyb/dybsets/2012.pdf
18. Global Land Cover Characterization Database, http://edc2.usgs.gov/glcc/glcc.php
19. United Nations Environment Programme, Mountains and Treed cover in Mountain Regions
(2002) http://www.unep-wcmc.org/mountains-and-tree-cover-in-mountain-regions-
2002_724.html
20. N. Rasiwasia and N. Vasconcelos, “Holistic context modeling using semantic co-
occurrences," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) (2009).
21. A. Oliva and A. Torralba, “Modeling the Shape of the Scene: A Holistic Representation of the
Spatial Envelope," International Journal of Computer Vision 42(3), 145{175 (2001), URL
http://dx.doi.org/10.1023/A:1011139631724.
22. A. Torralba, “Understanding visual scenes," Video Lecture (2009), URL
http://videolectures.net/nips09_torralba_uvs/.
23. B.C. Russell, A. Torralba, K.P. Murphy, and W.T. Freeman, “Labelme: a database and
webbased tool for image annotation," International Journal of Computer Vision 77(1-3), 157{173
(2008).
24. S. Arya, D.M. Mount, N.S. Netanyahu, R. Silverman, and A.Y. Wu, “An optimal algorithm
for approximate nearest neighbor searching in fixed dimensions," in ACM-SIAM Symposium on
Discrete Algorithms (1994), pp. 573{582.
25. USGS Land Use/Land Cover System Legend,
http://edc2.usgs.gov/glcc/globdoc2_0.php#app3
26. Hao Zhang, Alexander C. Berg, Michael Maire, and Jitendra Malik. Svm-knn: Discriminative
nearest neighbor classification for visual category recognition. CVPR ’06, 2006.
99
27. John C. Platt. Sequential minimal optimization: A fast algorithm for training support vector
machines, 1998.
28. G. Wang, D. Hoeim, and D. A. Forsyth. Learning image similarity from flickr groups using
stochastic intersection kernel machines. In ICCV, 2009.
29. Herv´e J´egou, Matthijs Douze, and Cordelia Schmid. Hamming embedding and weak
geometric consistency for large scale image search. In European Conference on Computer Vision,
volume I, pages 304–317, oct 2008.
30. D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In CVPR, volume 2,
pages 2161–2168, 2006.
31. J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization: Improving
particular object retrieval in large scale image databases. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2008.
32. C. Liu, J. Yuen, , and A. Torralba. Nonparametric scene parsing: label transfer via dense
scene alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2009.
33. Antonio Torralba, Rob Fergus, and Yair Weiss. Small codes and large image databases for
recognition. In CVPR, 2008.
34. Carlotta Domeniconi and Dimitrios Gunopulos. Adaptive nearest neighbor classification
using support vector machines. In NIPS, 2001.
35. J. H. Friedman. Flexible metric nearest neighbor classification. Technical report, Stanford,
Nov. 1994.
36. Pascal Vincent and Yoshua Bengio. K-local hyperplane and convex distance nearest neighbor
algorithms. In NIPS, 2002.
37. Stefan Schaal Chris Atkeson, Andrew Moore. Locally weighted learning. AI Review, 11:11–
73, April 1997.
38. T. Hastie and R Tibshirani. Discriminant adaptive nearest neighbor classification. IEEE
PAMI, 18:607–616, 1996.
39. Craig Stanfill and David Waltz. Toward memory-based reasoning. Communications of the
ACM, 29(12):1213–1228, 1986.
40. P. Jain, B. Kulis, and K. Grauman. Fast image search for learned metrics. CVPR, June 2008.
100
41. A. Torralba, R. Fergus, and W. T. Freeman. Tiny images. MIT-CSAIL-TR-2007-024, 2007.
42. D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images
and its application to evaluating segmentation algorithms and measuring ecological statistics. In
Proc. ICCV, July 2001.
43. Sameer Agarwal, Noah Snavely, Ian Simon, Steven M. Seitz, Richard Szeliski: Building
Rome in a Day. ICCV 2009
44. Jan-Michael Frahm, Pierre Georgel, David Gallup, Tim Johnson, Rahul Raguram,
Changchang Wu, Yi-Hung Jen, Enrique Dunn, Brian Clipp, Svetlana Lazebnik, Marc Pollefeys:
Building Rome on a Cloudless Day, ECCV 2010
45. Martin Bujnak and Zuzana Kukelova and Tomas Pajdla: 3D reconstruction from image
collections with a single known focal length, ICCV 2009
46. C. Strecha, T. Pylvanainen, P. Fua: Dynamic and Scalable Large Scale Image
Reconstruction, CVPR 2010.
47. Jan-Michael Frahm, Marc Pollefeys, Svetlana Lazebnik, Christopher Zach, David Gallup,
Brian Clipp, Rahul Raguram, Changchang Wu, Tim Johnson: Fast Robust Large-scale Mapping
from Video and Internet Photo Collections, ISPRS 2010
48. Micusik B., Kosecka J.: Piecewise Planar City 3D Modeling from Street View Panoramic
Sequences, CVPR 2009
49. T. Lee: Robust 3D Street-View Reconstruction using Sky Motion Estimation. 3DIM2009 in
conjunction with ICCV, 2009
50. C. Fruh and A. Zakhor: An Automated Method for Large-scale, Ground-based City Model
Acquisition. IJCV, 60(1), 2004
51. M. Agrawal and K. Konolige: Real-time localization in outdoor environments using stereo
vision and inexpensive GPS,” ICPR, Vol. 3, pp. 1063–1068, 2006
52. Yuji Yokochi, Sei Ikeda, Tomokazu Sato, Naokazu Yokoya: Extrinsic Camera Parameter
Based-on Feature Tracking and GPS Data, ICPR, pp. 369–378, 2006
53. M. Modsching, R. Kramer, and K. ten Hagen: Field trial on GPS Accuracy in a medium size
city: The influence of built-up, WPNC 2006
54. Richard I. Hartley and Andrew Zisserman. Multiple View Geometry. Cambridge University
Press, Cambridge, UK, 2004
101
55. D. Nistér: An efficient solution to the five-point relative pose problem, IEEE Transactions on
Pattern Analysis and Machine Intelligence (PAMI), 26(6):756-770, June 2004
56. Yasutaka Furukawa and Jean Ponce: Accurate, Dense, and Robust Multi-View Stereopsis,
IEEE Trans. on Pattern Analysis and Machine Intelligence, 2009
57. Yasutaka Furukawa and Jean Ponce: Patch-based Multi-View Stereo Software,
http://grail.cs.washington.edu/software/pmvs
58. P. Besl and N. McKay. A method of registration of 3-D shapes. IEEE Trans. Pattern Analysis
and Machine Intelligence, vol. 12, no. 2, pp. 239-256, February 1992
59. S. Rusinkiewicz, M. Levoy, Efficient variants of the ICP algorithm, in: Third International
Conference on 3D Digital Imaging and Modeling (3DIM), June 2001, pp. 145–152
60. MeshLab, http://meshlab.sourceforge.net/
61. Peter Cho, Noah Snavely, “Enhancing Large Urban Photo Collections with 3D Ladar and GIS
Data”, International Journal of Remote Sensing Applications (IJRSA) 2013.
62. A. Vasile, F. R. Waugh, D. Greisokh, and R. M. Heinrichs. “Automatic alignment of color
imagery onto 3d laser radar data”. In AIPR ’06: Proceedings of the 35th Applied Imagery and
Pattern RecognitionWorkshop, page 6,Washington, DC, USA, 2006. IEEE Computer Society.
63. A.G. Gschwendtner and W.E. Keicher, “Development of Coherent Laser Radar at
Lincoln Laboratory,” Linc. Lab. J. 12 (2), 2000, pp. 383–396.
64. R.M. Marino, T. Stephens, R.E. Hatch, J.L. McLaughlin, J.G. Mooney, M.E.
O’Brien, G.S. Rowe, J.S. Adams, L. Skelly, R.C. Knowlton, S.E. Forman, and W.R.
Davis, “A Compact 3D Imaging Laser Radar System Using Geiger-Mode APD Arrays:
System and Measurements,” SPIE 5086, 2003, pp. 1-15.
65. M.A. Albota, B.F. Aull, D.G. Fouche, R.M. Heinrichs, D.G. Kocher, R.M. Marino, J.G.
Mooney, N.R. Newbury, M.E. O’Brien, B.E. Player, B.C. Willard, and J.J. Zayhowski, “Three-
Dimensional Imaging Laser Radars with Geiger-Mode Avalanche Photodiode Arrays,” Lincoln
Laboratory Journal, vol. 13, no. 2, 2002, pp. 351-370.
66. J.J. Zayhowski, “Passively Q-Switched Microchip Lasers and Applications,” Rev. Laser Eng.
29 (12), 1988, pp. 841-846.18
102
67. J.J. Zayhowski, “Microchip Lasers,” Lincoln Laboratory Journal, vol 3, no. 3, 1990, pp. 427-
446.
68. R.M. Heinrichs, B.F. Aull, R.M. Marino, D.G. Fouche, A.K. McIntosh, J.J. Zayhowski, T.
Stephens, M.E. O’Brien, and M.A. Albota, “Three-Dimensional Laser Radar with APD Arrays,”
SPIE 4377, 2001, pp. 106-117.
69. M.A. Albota, R.M. Heinrichs, D.G. Kocher, D.G. Fouche, B.E. Player, M.E. O’Brien,
B.F.Aull, J.J. Zayhowski, J. Mooney, B.C. Willard, and R.R. Carlson, “Three-Dimensional
Imaging Laser Radar with a Photon-Counting Avalanche Photodiode Array and Microchip
Laser,” Appl. Opt. 41 (36), pp. 7671-7678.
70. B.F. Aull, A.H. Loomis, D.J. Young, R.M. Heinrichs, B.J. Felton, P.J. Daniels, and D.J.
Landers, “Geiger-Mode Avalanche Photodiodes for Three-Dimensional Imaging,” Linc.
Laboratory Journal, vol. 13, no. 2, 2002, pp. 335-350.
71. K.A. McIntosh, J.P. Donnelly, D.C. Oakley, A. Napoleone, S.D. Calawa, L.J. Mahoney, K.M.
Molvar, E.K. Duerr, S.H. Groves, and D.C. Shaver, “InGaAsP/InP Avalanche Photodiodes for
Photon Counting at 1.06 μm,” Appl. Phys. Lett. 81, 2505-2507 (2002).
72. D.G. Fouche, “Detection and False-Alarm Probabilities for Laser Radars That Use Geiger-
Mode Detectors,” Appl. Opt. 42 (27), pp. 5388-5398.
73. Jeffrey R. Stevens, Norman A. Lopez, Robin R. Burton, “Quantitative Data Quality Metrics
for 3D Laser Radar Systems”, SPIE Proceedings, 2010, Volume 8037.
74. www.ll.mit.edu/publications/technotes/TechNote_ALIRT.pdf
75. C. F. F. Karney, “Transverse Mercator with an accuracy of a few nanometers,” Journal of
Geodesy, 2011, Volume 85, Number 8, Pages 475-485
76. Alexandru N. Vasile and Octavia Camps, “Hierarchical Image Geo-Location on a World-
Wide Scale”, ISVC 2013, Part II, LNCS 8034, pp. 266-277, 2013
77. Alexandru N. Vasile, Luke J. Skelly, Karl Ni, Richard Heinrichs and Octavia Camps,
“Efficient City-sized 3D Reconstruction from Ultra-high Resolution Aerial and Ground Video
Imagery”, ISVC 2011, Part I, LNCS 6938, pp. 350–362, 2010
78. Alexandru N. Vasile, Luke J. Skelly, Michael E. O’Brien, Dan G. Fouche, Richard M.
Marino, Robert Knowlton, M. Jalal Khan and Richard M. Heinrichs, “Advanced Coincidence
Processing of 3D Laser Radar Data”, ISVC 2012, Part I, LNCS 7431, pp. 382-393, 2012
103
79. Alexandru N. Vasile, Luke J. Skelly, Michael E. O’Brien, Dan G. Fouche, Richard M.
Marino, Robert Knowlton, M. Jalal Khan and Richard M. Heinrichs, “Coincidence Processing of
3D Lidar Data for Foliage Penetration Applications”, MSS-EO 2012
80. Altwaijry H., Moghimi M., Belongie S., "Recognizing Locations with Google Glass: A Case
Study", IEEE Winter Conference on Applications of Computer Vision (WACV), Steamboat
Springs, Colorado, March, 2014.
81. G. Baatz, O. Saurer, K.Köser, M. Pollefeys, Large scale visual geo-localization of images in
mountainous terrain, In Proceedings of the 12th European Conference on Computer Vision -
Volume Part II, (2012), pp. 517–530
82 T.-Y. Lin, S. Belongie, J. Hays. Cross-view image geolocalization, in IEEE Conference on
Computer Vision and Pattern Recognition (CVPR) (Portland, OR, June 2013)
83. Altwaijry H., Belongie S., "Ultra-wide Baseline Aerial Imagery Matching in Urban
Environments", British Machine Vision Conference (BMVC), Bristol, September, 2013.
84. Mayank Bansal, Kostas Daniilidis, and Harpreet Sawhney. Ultra-wide baseline façade
matching for geo-localization. In ECCV 2012.