[ieee igarss 2012 - 2012 ieee international geoscience and remote sensing symposium - munich,...

IMPORTANCE-WEIGHTED MULTI-SCALE TEXTURE AND SHAPE DESCRIPTOR FOROBJECT RECOGNITION IN SATELLITE IMAGERY

Grant J. Scott

University of MissouriCenter for Geospatial Intelligence

Columbia, MO, USA

Derek T. Anderson

Mississippi State UniversityDept. of Electrical & Computer Engineering

Mississippi State, MS, USA

ABSTRACTWe present a sliding window-based, per-pixel importance-

weighted, multi-scale, cell-structured feature descriptor and

demonstrate its performance for recognizing different aircraft

from remotely sensed imagery. Opening and closing differ-

ential morphological profiles are constructed, then fused with

the Choquet integral to create a soft segmentation. A per-

pixel importance map is derived from the soft segmentation

and used in the calculation of histogram of oriented gra-

dients, local binary patterns, invariant object moments, and

Haar-like features. Superiority is demonstrated in comparison

to flat single-scale and non-importance weighted representa-

tions with encouraging results for both cross-validation and

blind testing. Results show that the pyramid, cell-structured,

importance-weighting performs better than traditional ap-

proaches in the difficult problem space of recognizing objects

in remote sensing imagery.

Index Terms— Object recognition, feature weighting,

texture and shape descriptors, satellite imagery

1. INTRODUCTION

The last decade has seen unprecedented growth in the number

of high-resolution satellite imagery sensors and the corre-

sponding geospatial image libraries. Along with this growth

in high-resolution imagery, there has been increasing re-

search in object-based image analysis in remote sensing

imagery [1–3]. Geospatial object-based image analysis rep-

resents an alternative method of exploiting remote sensing

imagery. Instead of analyzing and classifying pixels at the

landcover level, the emphasis is on the segmentation and

analysis of coherent objects present in the imagery. A par-

ticularly challenging task for object-based image analysis

in remotely sensed imagery is the robust recognition of ob-

jects. Object recognition challenges derive from the variety

of contextual settings in which an object may exist, such

as seasonal and geographic variations. The variability of

perspective view of the objects in the imagery presents an

additional challenge. To overcome these variabilities, object

extraction and the generated object features must adequately

handle the following: 1) changes in perspective view from

changes in sensor azimuth and/or elevation, 2) variable scale

nature of the objects within imagery from off-nadir captures,

and 3) surrounding imagery context in which an object could

be found. The goal of this research is to develop a robust

object descriptor that can be utilized to automatically analyze

large volumes of high-resolution remote sensing imagery for

objects of interest.

Much of the research regarding object recognition in im-

ages often focuses on classifying or categorizing an image

based upon characteristic objects therein. Often, objects of

interest are composed into training sets that include thousands

of regions of interest for a handful of classes of objects. The

goal of training object recognition in this regard is to present

as many possible variations of the target object during train-

ing in both perspective and context (e.g., background). In the

case of satellite imagery, the perspective views are primarily

top-down views of objects, subject to moderate obliqueness.

However, the limited pixel information due to object size ver-

sus resolution, the variability of image context, and the size

of imagery databases present unique challenges for remote

sensing archives. For these reasons, techniques that require

thousands of training samples per object class or tens of thou-

sands of features per object are not suitable for remote sensing

imagery because they are not scalable approaches.

Our image database consists of panchromatic high-

resolution orthorectified, georeferenced commercial satel-

lite imagery from DigitalGlobe’s Quickbird sensor. For our

training data, 97 objects are selected from four scenes repre-

senting four different capture times of a single region within

the four scenes, Kabul International Airport. Our training

data includes three classes of target objects, 34 commercial

jets, 23 military cargo planes, and 40 helicopters; as well as

800 random image chips that are absent of the three target

classes. Figure 1 shows example targets from our training

and test data. Object recognition in satellite imagery presents

numerous challenges, such as lighting variations, shadow

effects, the variability of object scale, and most significantly

the variability of object context.

79978-1-4673-1159-5/12/$31.00 ©2012 IEEE IGARSS 2012

Commercial Jet Helicopter Military/Cargo

Fig. 1. Example objects from our database: showing the vari-

abilities within object classes as well as the complicated im-

agery context (all objects shown at same resolution).

2. OBJECT EXTRACTION

During the task of sliding window-based object recognition

there exists two different types of information in the window:

the object of interest; and everything else in the window, such

as landcover and incidental objects, in other words–the sur-

rounding context. Automatic extraction of objects from re-

mote sensed imagery is a challenging task due to the high

variability of the image collection environment, land-cover,

and types of objects. In order to develop descriptors that

are robust and can be transferred between background con-

texts, a descriptor should be designed to prioritize informa-

tion about the object centered in the window. Specifically, we

do not want the descriptor to measure background context.

This presents a challenge in remote sensing domains where

the variability of imagery in databases precludes brute-force

approaches that attempt to train for all possible poses of an

object in all possible settings. Our goal is to develop image

descriptors which are weighted based on the soft segmenta-

tion of objects from within the imagery.

The differential morphological profile (DMP) [4] is able

to exploit contrast edges between objects and their surround-

ing context to extract objects. Using geodesic morphological

reconstruction, the DMP extracts objects that are lighter

(opening) or darker (closing) than their surrounding image

content. The DMP produces a set of scale-attributed re-

sponses using a geodesic disk of size rm, where m ∈ Mdefines the scales of a morphological structuring element

(SE). The opening or closing filtering of imagery at a SE re-

moves objects that are smaller than the structuring elements.

By computing the piece-wise differential, e.g., response at

scale rm minus rm−1, we can find objects that survive up to

some scale of SE, then are obliterated in a subsequent scale.

Therefore, each level in the DMP is the set of objects ex-

tracted during a particular geodesic scale transition. We use

geodesic disks of radii 1, 3, 5, 7, 9, and 11 m for the DMP.

Once the DMP has been generated, the levels are fused

into a soft segmentation (confidence) of objects in the region

of interest. We fuse the DMP stack with a Choquet fuzzy

integral [5]. This allows us to fuse a finite set of informa-

tion sources, such as the levels of the DMP, and map them

into the [0,1] confidence domain. We process the fused DMP

with morphological 3 m radial dilation and object reconstruc-

tion to further enhance the soft segmentation and isolate the

object of interest. This soft-segmentation becomes the per-

pixel weighting for object feature descriptor contribution. We

then compute the eigenvalues and eigenvectors of the soft seg-

mentation to perform a rotational alignment of the importance

map as well as the source image window. Figure 2 shows

an example input image chip, the first three DMP levels, the

Choquet integral fused result, and the resulting importance

map (after alignment). Each of the subsequently described

features are computed by applying the per-pixel weights of

the importance map during feature extraction.

3. WEIGHTED FEATURE EXTRACTION

A common issue in computer vision is deciding from which

scale to extract image features. At low spatial resolutions,

image features typically model shape characteristics (directly

or indirectly) or landcover surface patterns (e.g., texture). At

very high spatial resolutions, the image features can be used

to measure properties of surface textures of an object of inter-

est. We produce a multiple spatial resolution image pyramid

over the object of interest; at a 0.5 m resolution, as well down-

sampled resolutions of 1 m and 2 m, respectively.

We start with the object centered in a 100 m x 100 m

window, then form a three level image pyramid, i.e., full-

resolution, half-resolution and quarter-resolution. We then

extract various features, as detailed below, from different spa-

tial resolutions creating a multi-scale set of features. In the

base pyramid level, i.e., full-resolution, the window is orga-

nized into a 3x3 cell-structure with 30% cell overlap. In the

second pyramid level, half-resolution, window image, we ex-

tract a 2x2 cell-structure with 20% cell overlap. The third

pyramid level, quarter-resolution, is treated as a single cell.

Once the image chip pyramid is generated, we compute

a multi-scale object descriptor using importance-weighted

extensions of state-of-the-art image region and blob features.

These include invariant object moments (IOM) [6], localbinary patterns (LBP) [7], histograms of oriented gradients(HOG) [8], and Haar-like features (HLF) [9]. Secondly, we

use a cell-structured pyramid to provide spatial context for

features within the image chip. Cell-structured image de-

scriptors are commonly used in computer vision [8] because

they preserve the spatial relationship between sub-regions and

allow portions of descriptors to specialize to the sub-regions.

A typical plane provides an illustrative example; where the

two wings are symmetrically opposed about a fuselage and

80

Input Chip Opening DMP Choquet Fused DMP Importance Map

Fig. 2. Example soft object extraction: input chip, DMP levels, dilated Choquet fusion result, and rotated importance map.

a tail structure exists on one end of the fuselage. We use

the cell-structure described above for the generation of the

LBP, HOG, and HLF. The IOM are computed over the final

importance map from the object segmentation step.

IOM are specifically designed to be rotation invariant fea-

tures describing the overall object shape characteristics. Im-

age moments are commonly computed over a subset of im-

age pixels, often chosen via segmentation methods. As de-

scribed previously, the fused DMP result represents a soft

segmentation to of the object from the background context;

which in turn is a natural candidate for the generation of ob-

ject moments. We therefore compute the normalized central

moments of the importance map; then the traditional first and

fourth through seventh invariant moments, as well as addi-

tional invariant moment defined in [6].

LBP capture generalized textures as well as the occur-

rence of types of edges within a region of pixels. The fea-

tures are useful for capturing characteristics representative

of the edge curvatures of objects to their background con-

text. LBPun,r represents the LBP feature which samples n

points at radius r; and has no more than u 0-1 transitions

(i.e., uniform pattern). The LBP code value is calculated as

LBPn =n∑

k=0

(s(ik − ic)2k), where ic is the center value, ik

is the value of the kth neighbor; and function s(x) is 1 when

x ≥ 0, else 0. The calculation of LBPn,r involves sampling

each neighbor, ik, on a circle of radius r. We compute the

LBP 28,3 for each pyramid cell as a summed pixel-weight nor-

malized histogram, resulting in 59 features per cell.

HOG capture the trends in the image surface, i.e., spec-

tral value transitions. For each cell, gradients are computed

using orthogonal 1-D point derivative filters to compute the

X and Y gradient components. The HOG is then constructed

as a histogram of 16 orientation bins. For each gradient, the

pixel-weighted magnitude is proportionally contributed to a

primary and secondary bin. The gradient direction determines

the primary bin, and the distances between the primary bin

center and the secondary bin center determine the how the

magnitude is distributed.

HLF effectively capture textures from the integral imagecomputed from a particular pyramid cell. From each cell we

use a selection of twelve Haar-like filters to generate a set of

Fig. 3. Six Haar-like filters are shown. We also use a 90

degree rotation of each, making twelve total filters.

features. Figure 3 shows six of the filters, which are rotated

90 degrees to generate the next six filters. It should be noted

that multi-scale features can be generated by varying the size

of the Haar-like filters used over the integral image, without

generating an image pyramid. The light portions of the filter

represents the positive portion of the filter response, and the

dark the negative portion. For each pyramid cell, the imp-

ortance weighted integral image is generated by factoring the

importance map weights into the integral image. The filters

are processed over the weighted integral image to accumulate

responses and generate statistical measures, specifically the

mean and standard deviation, resulting in 24 features per cell.

These four feature sets are concatenated into a single fea-

ture vector; representing the shape and structural character-

istics of the object at three scales. Once the features are ex-

tracted from the training image chips, we train support vector

machine (SVM) classifiers on each class of object. This al-

lows each SVM to develop the appropriate weighting of fea-

tures that best discriminate each class from the others.

4. EXPERIMENTAL RESULTS

We evaluated different combinations of stages in our ap-

proach using SVMs and two kernels, linear and radial basis

function (RBF). In each testing case, the SVM was trained

as a two-class classifier and tested for the target class versus

the rest of the database. Additionally, to provide insight into

the effects of the weights versus the multiscale pyramid cells,

we generated four distinct object recognition databases: R1,

a flat image chip without importance weight; R2, a flat image

chip with importance weight; R3, pyramid cells without imp-

ortance weight; R4, pyramid cells with importance weight.

The cross validation dataset was generated from four differ-

ent scenes, collected at different times at Kabul International

Airport. For cross-validation, we divide the database in half,

81

Table 1. SVM Cross-validation Recognition Rates: Linear

(LNR) and radial basis function kernel (RBF).

Object Com. Jet Heli. CargoRates TAR FAR TAR FAR TAR FAR

R1:LNR 0.15 0.94 0.91 0.97 0.54 0.96

RBF 0.29 0.00 0.15 0.00 0.26 0.00

R2:LNR 0.03 0.98 0.50 0.95 0.00 1.00

RBF 0.00 0.00 0.35 0.00 0.13 0.00

R3:LNR 0.01 0.92 0.50 0.96 0.41 0.95

RBF 0.94 0.00 0.35 0.00 1.00 0.00

R4:LNR 1.00 0.00 0.64 0.54 0.98 0.00

RBF 1.00 0.00 1.00 0.00 1.00 0.00

then alternate training and testing with each half. The second

dataset is composed of commercial jets from three scenes of

different locations, including: Tehran, Iran airport and two

Denver, Colorado, USA airport scenes.

Our object recognition results for the cross validation tests

show that our multi-scale pyramid with per-pixel importance

weighted features significantly outperforms single-scale (flat)

un-weighted features. We increased our target recognitionrates (TAR) and dropped the false alarm rates (FAR) for all

classes (see Table 1). However, the different characteristics of

the Linear versus RBF SVM are also notable. Specifically, the

RBF kernel SVM has significantly lower false alarm rates for

all classes of objects, across each of the descriptor sets. Addi-

tionally, it achieves perfect classification of the importance-

weighted pyramid cells, which may be indicative of over-

fitting in binary classification schemes. However, in the case

of our second test set, we observe that the linear SVM gener-

alizes better for database R1 and R2 and R4. In these cases,

the TAR rates are: R1 Linear 0.54, RBF 0.18; R2 Linear 0.73,

RBF 0.45; and R4 Linear 0.93, RBF 0.00. In regards to R3,

both the linear and RBF SVM performed the same, at 0.82

TAR.

5. CONCLUSION

We have presented context insensitive recognition of aircraft

objects from remotely sensed imagery using a multi-scale,

pyramid-based, per-pixel importance weighted feature de-

scriptor. A per-pixel importance map is generated by fusing

levels of the DMP, then used to weight the various features

during extraction. Our object descriptor is built using per-

pixel weighted object descriptors of IOM, HOG, LBP, and

HLF. We compared the target recognition and false alarm

rates using single-scale and non-importance weighted repre-

sentations with encouraging results for both cross-validation

and blind testing. Table 1 demonstrates that the pyramid

cell-structured importance weighting performs well versus

traditional approaches.

Our future research will further develop and refine the the

parameterization of the feature extraction, such as LBP sam-

pling r and n, the Haar-like filters and associated statistics.

Additionally, we will begin to examine more elegant classi-

fication schemes. These will include a cascaded recognition

system that keeps the feature type and scale of extracted fea-

tures in distinct feature spaces and fuses the recognition in-

formation. This will allow us to train more robust classifiers

at each scale and potentially within separate feature spaces.

6. REFERENCES

[1] T. Blaschke, “Object based image analysis for remote

sensing,” ISPRS Journal of Photogrammetry and RemoteSensing, vol. 65, no. 1, pp. 2–16, 2010.

[2] Geoffrey J. Hay, Guillermo Castilla, Michael A. Wulder,

and Jose R. Ruiz, “An automated object-based approach

for the multiscale image segmentation of forest scenes,”

International Journal of Applied Earth Observation andGeoinformation, vol. 7, no. 4, pp. 339–359, 2005, Bridg-

ing Scales and Epistemologies: Linking Local Knowl-

edge with Global Science in Multi-Scale Assessments.

[3] G.J. Scott, M.N. Klaric, C.H. Davis, and Chi-Ren Shyu,

“Entropy-balanced bitmap tree for shape-based object re-

trieval from large-scale satellite imagery databases,” Geo-science and Remote Sensing, IEEE Transactions on, vol.

49, no. 5, pp. 1603–1616, may 2011.

[4] M. Pesaresi and J. A. Benediktsson, “A new approach for

the morphological segmentation of high-resolution satel-

lite imagery,” IEEE Trans. Geoscience and Remote Sens-ing, vol. 39, pp. 309–320, 2001.

[5] T. Murofushi M. Grabisch and M. Sugeno, Fuzzy Mea-sures and Integrals: Theory and Applications, Physica-

Verlag, Heidelberg, 2000.

[6] J. Flusser and T. Suk, “Rotation moment invariants for

recognition of symmetric objects,” Image Processing,IEEE Transactions on, vol. 15, no. 12, pp. 3784 –3790,

dec. 2006.

[7] Timo Ojala, Matti Pietikinen, and David Harwood, “A

comparative study of texture measures with classification

based on featured distributions,” Pattern Recognition,

vol. 29, no. 1, pp. 51 – 59, 1996.

[8] Navneet Dalal and Bill Triggs, “Histograms of oriented

gradients for human detection,” in In CVPR, 2005, pp.

886–893.

[9] P. Viola and M. Jones, “Rapid object detection using a

boosted cascade of simple features,” in Computer Visionand Pattern Recognition, 2001. CVPR 2001. Proceedingsof the 2001 IEEE Computer Society Conference on, 2001,

vol. 1, pp. 511–518.

82

[ieee igarss 2012 - 2012 ieee international geoscience and remote sensing symposium - munich,...

Documents