spie dss 2013 corrected

8/10/2019 SPIE DSS 2013 Corrected

1/14


2/14


3/14

2. RELATED WORK AND THEORY

In this section, we give a brief description of the monogenic signal analysis for the computation of the local phaseinformation and provide a relevant background work involved in object detection from aerial imagery along witha few well-known feature descriptors that are used in object detection.

2.1 Local Phase from Monogenic Signal Analysis

The importance of phase information in the representation of structure in an image is illustrated in Gonzalezand Woods.1 The magnitude information is a measure of the strength of the signal. However, the estimationof local phase in an image is not trivial. In order to define the local structure in a one dimensional signal, theanalytical signal representation given in Equation 1 is of importance.

fA(x) = f(x)ifH(x) (1)

Here f(x) is the original signal and fH(x) is its Hilbert transform, which can be calculated in frequency domainas given in Equation 2.

FH(u) = H1(u)F(u) (2)

Here F(u) is the frequency domain representation off(x) and H1(u) = isgn(u) is the definition of the Hilberttransform. Therefore, from Equation 1,the phase of a signal can be computed as

(x) = arctan(fH(x), f(x)) (3)

There have been multiple techniques that have attempted to extend the analytical signal representation tomultiple dimensions like the use of steerable filters. However, those techniques were not purely isotropic innature. The isotropic extension of the analytical signal representation is given by the monogenic signal2 as inEquation 4.

fM(x1, x2) = (f, fR)(x1, x2) (4)

where fR(x1, x2) = (h f)(x1, x2) and h = (h1, h2), is the Riesz kernel. The spatial and frequency domainrepresentation of the Riesz kernel is given in Equations 5 and 6 respectively.

(h1, h2)(x1, x2) = ( x1

2|x|2, x2

2|x|2),x= (x1, x2) R

2 (5)

(H1, H2)(u1, u2) = (

u1

2|u|2 )(

u2

2|u|2 ), (u) = (u1, u2) R2

(6)

As in Equation 3, the local phase can be computed for the two dimensional signal from the monogenic signalrepresentation as in Equation 7.

((x)) = fR(x)

|fR(x)|arctan(

|fR(x)|

f(x) ) = (x) exp(i (x)) (7)

where(x) is the local phase and(x) is the local orientation. Both these are computed from the Reisz transformof the signal as shown in Equations in 8 and 9.

= arctan(R21

(f) +R22

(f), f) (8)

= arctan(R2(f)

R1(f)), [0, ) (9)

A measure of the local contrast can be estimated as shown in Equation 10.

A=f2(x) +|f2

R(x)| (10)

In the monogenic signal representation, the implicit assumption is that any signal in a local sense is consideredto be intrinsically one dimensional. The local amplitude, local phase and local orientation is computed for thisintrinsic 1D signal. Therefore, design of the bandpass filter in the construction of monogenic signal representationis significant.

Proc. of SPIE Vol. 8745 87451U-3

wnloaded From: http://spiedigitallibrary.org/ on 11/07/2014 Terms of Use: http://spiedl.org/terms


4/14

2.2 Object Detection in Aerial Imagery

Some of the recent works in detection objects addressed issues to counter the complex large variations in theappearance of the object. Yao and Zhang3 developed a general approach towards detecting objects from aerialimagery using a semi-supervised learning from contextual information. The context based object detection isbased on the fact that the objects present in the aerial imagery are often surrounded by a homogenous backgroundregion. The main motivation for using a semi-supervised learning scheme is the absence of a large number of

labelled training samples of objects captured on aerial imagery. From a set of unlabelled training samples,the semi-supervised classification scheme can adaptively label the unlabelled samples, thereby improving theclassification accuracy. This can form the basis for an ad-hoc training scheme where online training can be usedto incorporate the new test samples into the training set during the test phase.

Khan et al.4 used a 3D model based object classifiction scheme to detect vehicles from aerial imagery. Anappearance based information of the aerial object is used to compute a 3D model from where a set of salientlocation markers are determined. By simulating the scene conditions through the 3D model rendering, the varioussalient locations are used in creating a Histogram of Gradients(HOG) based feature classifier. By computing thematch score such as the Salient Feature Match Distribution Matrix between the features in rendered and realtest scenes, the vehicles in the test scene are classified.

Another approach to detecting objects in aerial imagery is to use a extensive global feature descriptionwhich is invariant to view-point, scale and orientation of the object present. Texture of the objects which canrepresent the spatial structure from an aerial imagery can be used to represent an object or vehicle. Guo et al.5

proposed a rotation invariant texture classification scheme using the well-known local binary pattern(LBP).6 Asopposed to the orignal LBP texture descriptor, the local binary pattern variance(LBPV) descriptor retains theglobal spatial information and the local information where this textural descriptor brings out the local contrastinformation. The feature extraction scheme uses a global rotation invariant matching with locally variant LBPtexture features.

Mathew and Asari7 proposed a local intensity histogram based descriptor for tracking an object in very lowresolution videos. One of the main challenges is that there is a large global camera motion and contains poorgradient and texture information. The descriptor proposed by them uses an intensity histogram which encodesboth spatial and intensity information. While the application has been for tracking, the feature descriptor usedin representing the object of interest can be used for object detection and classification from low-resolution aerialimagery. The algorithm uses a more robust feature comparison metric known as the Earth Movers distance.8

A rotation and scale invariant object recognition methodology has been proposed by Matungka et al9 where

image feature extraction is combined with a log-polar wavelet mapping. Here, the log-polar mapping convertsa rotation of the cartesian coordinates to a translation in the log-polar coordinates. Hence, a translational shiftis more easier to determine than a rotational one. However, the changes in the image origin in the cartesiandomain can greatly influence the log polar mapping.

3. METHODOLOGY

The general approach to object detection from aerial imagery is to use a novel object representation schemewhich uniquely describes an objects shape, structure, color and texture which is invariant to the issues of non-uniform illumination, view-point, orientation and noise. These artifacts are mainly due to the image acquisitionprocess in the optical sensor onboard the aircraft. One of main issues in designing an object representationis the invariance to lighting or invariance to non-uniform illumination. In case of an object present in a goodlighting condition and another object present in dark lighting condition, the object representation of both the

objects should ideally have similar in its invariance to non-uniform illumination. One such representation isthe use of phase information computed from frequency spectrum image analysis. A more localized version ofphase can be computed by monogenic signal analysis where the local phase represents the local structure objectirrespective of the lighting present in the image. This is illustrated in Figure 3. We see that the local phasebrings out the structural details of the backhoe irrespective of the lighting present and helps in distinguishing itfrom the surrounding objects in the background region such as trees, shrubs, building etc...The characteristicsof the local phase is that since it is illumination invariant, it is not affected by over exposure to lighting or verylow illumination conditions and it projects the regular edges and corners associated with the description and




5/14

r

(a) Backhoe 1 (b) Local Phase (c) Backhoe 2 (d) Local Phase 2

Figure 3: Top Row: Backhoe captured in Flight 8(left) and the corresponding local phase computed in that region.Bottom Row: Backhoe captured in Flight 6(left) and the corresponding local phase information computed inthat region. (Courtesy of Vendor 1)

(a) Excavator1 (b) Excavator2 (c) Excavator3

Figure 4: Images of Excavator illustrating the various constraints that occurs in Aerial imagery for both theVendor 4 (Left three) and Vendor 1(Right three)

representation of the object. A feature descriptor which is extracted from the local phase information tend tohold the illumination property true and hence this representation is a very effective in describing objects capturedby optical sensors at an altitude of 500-3000 feet. Some of the constraints in using the local phase domain are

that the computation of local phase depends on the following factors:

Size of the object region: The sampling frequency refers to the sampling used to create the monogenicfilters used in the computation of the local phase information which in turn is related to the size of theregion of interest containing the object or construction machinery.

Orientation of the object: The local phase changes with the orientation of the object. Since the local phaseinherently depends on the frequency spectrum of the object, a change in the orientation of the object causesits frequency spectrum to get shifted thereby changing the local structure in a square neighborhood region.

Image Resolution: The variation in the resolution of the object captured in the scene can also cause changesin the computation of the local phase. More specifically, the frequency content captured by the monogenicsignal analysis shifts to a different band in the frequency spectrum as the resolution of the object changes.

So, to extract similar local phase information from two similar objects but appearing at different imageresolutions, the frequency band at which the local phase operates should be varied.

So, any object descriptor computed from the local phase information needs to be normalized for the scale (relatedto the size of the object), the orientation and the image resolution. Illustrations of the constraints are shownin Figure 4. To counter these constraints, we use a multi-stage approach where at each stage a suitable type ofdescriptor is extracted to incorporate rotation,scale and viewpoint invariance.




6/14

Figure 5: Block diagram illustrating the detection framework

4. DETECTION FRAMEWORK

The detection framework used to automatically locate construction vehicles in the pipeline right of way followsa three stage approach. The detection framework is usually preceded by a training stage where the template foreach construction equipment is computed. The template is extracted from the local phase information of a highresolution image and stored in a multi-scale fashion.

Local phase based template matching : This is a preliminary stage where a possible set of regions forlocation of the object are noted by matching a template of the object on the training set to the test image.

Selection of Orientation and Cluster Voting: Here, the orientation of the object present in possible regionsis determined and a shortlist of such regions is made through Hierarchical clustering.

Final Detection by cluster selection and Feature matching by Histogram of Oriented Phase (HOP): Fromthe final set of clusters, we extract the a feature descriptor known as Histogram of Oriented Phase from

the local phase information for feature matching with the original template.

4.1 Training

The training stage involves selection of a suitable image for the creation of the template. We use a high resolution,nader-view (top-view) of the construction vehicle as the training image. The selection of the high resolution imageenables us to create a multi-scale template pyramid with each level corresponding to a lower resolution. Thisscaling enables the algorithm to search for objects with a different resolution than the object images present inthe training set. The local phase information of the training image is computed and the template to be usedis selected from a closely cropped region containing the local phase of the actual object. This template froma high resolution image is down sampled to different lower sizes to create a local phase template pyramid. Anillustration of the template selection is shown in Figure 6. Some of the steps involved in computing the localphase are given below.

Generation of Log-Gabor filters and monogenic filters for local phase computation.

Computation of local phase of training image using the Log-Gabor filters and monogenic filters to createa frequency-scale space representation.

Selection of the template region in the local phase domain.

Creation of a template pyramid by down sampling of the template obtained from the local phase of high-resolution training image.

4.2 Local Phase-based Template Matching

In this preliminary stage, we locate possible regions where construction equipment can be found by searchingthe entire image in a windowed based approach. This is done by a template matching in a sub-region of theimage (a particular window) in the local phase domain using normalized cross correlation. The normalized crosscorrelation based template matching is a fast technique which finds the location with the most optimal match.Since the object can occur in different orientations, the template matching is performed for every 5 degreerotation of the sub-image where for each rotation, we get an optimal match. This template matching scheme isapplied at every scale of the template to obtain matches corresponding to similar objects with smaller size orlower resolution. An illustration of the local phase based template matching is shown in Figure 7.




7/14

I f

M u l t i - S c a l e M u l t i - O r i e n t a t i o n

S e l e c t

O r i e n t a t i o n

S i n g l e O r i e n t a t i o n M u l t i -

M a t c h i n g

S c a l e D e t e c t i o n s

V o t i n g S c h e m e

V o t i n g

H O P M a t c h i n g

S e l e c t e d C l u s t e r G r o u p

Z o o m e d i n

V i e w

F i n a l D e t e c t i o n

Z o o m e d i n V i e w o f F i n a l

D e t e c t i o n

Figure 8: Orientation Selection and Cluster Voting

Figure 9: Cluster Selection and Detection using Histogram of Oriented Phase

4.4 Cluster Selection and Detection using Histogram of Oriented Phase

From the previous stage, we obtain a set of clusters containing locations of possible object regions where eachcluster will be associated with a certain number of votes depending on how close the detections are with respectto the multi-scale template. Now another stage of pruning out the clusters is by looking at the number ofvotes the cluster attained and by setting a specific threshold for the number of votes, we can eliminate certainclusters. The idea behind this elimination is that those clusters which have got fewer votes are those possibleregions which have some background variation which was projected by the local phase. In short, the noise in thebackground was projected and got matched to the training template at a certain scale but the match distancewas too large. After the pruning of such clusters, a different feature set is extracted from the detections of theremaining clusters and matched with the training template. This feature vector known as the Histogram ofOriented Phase(HOP), is a weighted histogram over the local orientation (computed from the monogenic signalanalysis) with the weights corresponding to the local phase. This dense descriptor can uniquely identify an objectand can be matched more closely to the corresponding descriptor of the training template. So, in each cluster,we identify the detections whose HOP descriptor matches to the template region within a certain extent andcompute the number of HOP hits the cluster has. Clusters which have less than a certain number of HOP hitsis discarded and the remaining clusters or so are then considered as the final object location. An illustration isshown in Figure 9.

5. EXPERIMENTAL RESULTS AND ANALYSIS

The construction equipment detection framework has been tested on three different datasets with each datasetcontaining images captured at around 1000-3000 feet by three different vendors: 1,2 and 3. The imagery capturedby these vendors differ in the type of sensors used to capture these images and the height at which they werecaptured. One of the main characteristic that the sensor differs is in the Ground Spatial Distance (GSD) or thespatial resolution of the image. This determines the resolution of the object present in the image. Moreover, theangle at which the image (depends on the orientation of the sensor mount on the aircraft) has been captured is also




8/14

Figure 6: Training Phase:Template Selection

Figure 7: Local Phase based Template Matching

4.3 Orientation Selection and Cluster Voting

From the previous stage, from sub-regions obtained by overlapping windows, we get clusters of detection at everyorientation and at every scale. In this stage, we select the appropriate orientation among the set of detectionsin each sub-region. The local phase of the detected location at each rotation of the image is compared with thetemplate local phase by the phase histogram matching using the Earth Movers distance matrix. The rotation

of the sub-image which yields the smallest distance to the template is then selected as the correct orientation ofthe detected object. Some regions (sub-images) do not have enough detection and may contain few detectionscorresponding to a particular scale while some other sub-images have a cluster of detections in a particularlocation and at a particular orientation. The latter corresponds to the presence of construction equipment asthe template matching scheme fires at every possible scale for only a single orientation. The latter correspondsto a false detection by the template matching scheme where the detection may have happened due to the noisein the background at a particular scale. This creates a scenario where in some parts of the image, clusters ofdetection exist and in some parts, only one or two detections are only present. This brings the need to removethe single detections and to retain the clusters. So a hierarchical clustering scheme is employed which evaluatesthe cluster of detection in each sub-region to satisfy the total number of detections required for a particularorientation. Only the clusters which satisfy a minimum requirement for the number of possible detections areretained while the rest are discarded. A voting scheme is applied on the retained clusters where a particulardetection in a cluster is given a certain vote or weight depending on how close the detection is to the template

region. The closeness is the matching distance that is computed between the phase histograms of the templateand the detected region. The detections which are closer to the template have a higher probability of being theactual construction equipment and so a higher vote is given to the detections which are closer to the template.As shown in Figure 8, the color scheme applied to the detections represents the voting mechanism with the greenhaving the highest, the red having the least with the yellow having moderate number of votes.


9/14

an important factor as different viewing angles of the camera can lead to different viewpoints. So, the testing onthese three datasets will provide a good evaluation of the construction equipment detection framework describedin the previous sections. In this section, we will evaluate the algorithm by testing it on the three datasets forthe detection of the Backhoe and provide statistics on the accuracy, false detection rate and miss rates. Theaccuracy will be in terms of the percentages. The false detection rates are the number of detections that wasincorrectly detected as construction equipment in the final stage. The miss rate is the number of construction

equipment (Backhoe) which were not located by the automated algorithm. As mentioned earlier, the algorithmhas a training stage and 3 stages in the testing phase :

Stage 1: Local Phase based Template Matching.

Stage 2: Orientation Selection and Cluster Voting.

Stage 3: Cluster Selection and Matching by Histogram of Oriented Phase.

5.1 Test on Vendor 1 Dataset

The algorithm was tested on the dataset provided by the Vendor 1 and this dataset had fair decent resolutionimagery but covered a large area of the Pipeline Right of Way. The imagery was captured at a height of around1000-2000 feet above the Pipeline Right of Way. One of the main challenges in this dataset are the dark bands

or regions that appear at the edge of the image probably due to the encasing of the sensor used to capture theimage. Thus, construction equipment appearing at the edges of the images had very low illumination or lightingpresent on the object. Out algorithm tackled the problem of illumination by using the local phase. Moreover, achange in elevation at which the images were captured during different flights results in a change in the spatialresolution as well.

5.2 Test on Vendor 2 dataset

The algorithm was also tested on the dataset provided by the Vendor 2 and this dataset had higher resolutionimagery as it was captured at at height of around 500-1000 feet above the Pipeline Right of Way. So, theconstruction equipment present in the imagery is much more defined and has more structural details for the objectto be detected. The challenges involved in this dataset are the slight illumination variations and orientation andposition changes with slight changes in spatial resolution of the object. Again, we have evaluated our algorithmby running it on the test image containing the Backhoe. An illustration of the detection results is shown inFigure 11.

5.3 Test on Vendor 3 Dataset

The imagery provided by Vendor 3 had two different kinds, one taken at a height of 6000 feet and the other settaken at a height of around 1000-2000 feet. For the evaluation of the algorithm, we use the first set of imagerycaptured at 6000 feet taken from Flights 1-4 where each flight corresponds to a single pass over the PipelineRight of Way at Gary,Indiana. The challenge in this dataset is that the spatial resolution is very poor whichleads to less number of structural details required for detection. Moreover, there are illumination variations suchas over-exposed illumination over the object to be detected. So, this algorithm is evaluated on this challengingset of imagery by applying it to detect the largest type of construction equipment such as the Backhoe. Anillustration of the detection procedure for each stage is shown in Figure 12.

5.4 Statistics for Detection of BackhoeThe statistics which we have computed are the detection accuracy and the number of false positives obtained bytesting on images containing the construction equipment such as the Backhoe by using only one training image.The selection of the training image depends on the resolution of the object present. In this proposed algorithm,we use the object sample with the highest spatial resolution. The tables shown below illustrate the detectionaccuracy and the false positive rate attained for each dataset.




10/14

I

=

r

m

a

u

l

r

.

t

4

R

a

m

i

1

.

t

o

m

,

a

m

.

.

)

.

y

l

1

6

'

01

4

W

M

#

1

-

(a) Training Image. (b) Test image with manual annotation.

(c) Stage1:Local Phase Template Matching (d) Stage2:Orientation Selection and Cluster Voting

(e) Stage3:Cluster Selection and HOP Matching (f) Stage3 Detection for test image from Flight 4

Figure 10: Detection of the Backhoe at different stages on sample images,Courtesy of Vendor 1




11/14



(e) Stage3:Cluster Selection and HOP Matching (f) Stage3 Detection for test image from Flight 5





12/14

4 1 2 1 0 1

l i t 4 N t i R . s i r

A

l

zx

t r i t t a a r

i1

R

S

s

m



(e) Stage3:Cluster Selection and HOP Matching (f) Stage3 Detection for another test image





13/14

Table 1: Statistics for Vendor 1 DatasetEquipment/Stages Stage1 Stage2 Stage3 False Positives

Backhoe Instance (Flight No)1(1) Y Y Y 02(2) Y Y Y 0

3(3) Y X X 04(4) Y Y Y 15(5) Y Y Y 06(6) Y Y Y 17(7) Y Y Y 0

Total False Positives 2True Detection Rate 100% 85.71% 85.71%


Backhoe Instance (Flight No)1(1) X X X 02(2) X X X 03(3) Y Y Y 04(4) Y Y Y 05(5) Y Y Y 06(6) Y Y Y 17(7) Y Y Y 18(8) Y Y Y 0

Total False Positives 2True Detection Rate 75% 75% 75%


Backhoe Instance (Flight No)1(1) Y Y Y 02(1) Y Y Y 03(1) Y Y Y 24(1) Y Y Y 45(1) X X X 06(2) Y Y Y 07(2) Y Y Y 2

8(2) Y Y Y 49(3) Y X X 0

10(3) Y Y X 011(3) Y X X 0

Total False Positives 12True Detection Rate 91% 73% 64%




14/14

6. CONCLUSIONS

We have proposed an algorithm which can autonomously detect construction equipment in various lightingconditions and different equipment orientations using a multi-layer framework. This framework is based on thefeature extraction from the local phase information generated by a monogenic analysis of the image. The localphase information brings out the spatial structure of the object and projects it from the surrounding homogenousbackground and is invariant to the illumination present in that region. By computing the histogram of phase

and histogram of oriented phase(HOP) along with a template matching scheme, we have successfully detectedthe construction equipment such as the Backhoe in three different datasets provided by the vendors 1,2 and 3.Future work will include the detection of other construction equipment such as the Excavator, Mini-Excavator,Trencher etc.. on the pipeline right of way (ROW) which is more challenging as its size is considerably smallerthan the Backhoe.

ACKNOWLEDGMENTS

This project has been funded by the Pipeline Research Council International(PRCI) with the test imagerycaptured in Gary, Indiana. (Project No: PR-433-133700)

REFERENCES

[1] Gonzalez, R. C. and Woods, R. E., [Digital Image Processing], Addison-Wesley Longman Publishing Co.,Inc., Boston, MA, USA, 2nd ed. (1992).

[2] Felsberg, M. and Sommer, G., The monogenic signal, Signal Processing, IEEE Transactions on 49(12),31363144 (2001).

[3] Yao, J. and Zhang, Z., Semi-supervised learning based object detection in aerial imagery, in [ComputerVision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on], 1, 10111016vol. 1 (2005).

[4] Khan, S., Cheng, H., Matthies, D., and Sawhney, H., 3d model based vehicle classification in aerial imagery,in [Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on], 16811687 (2010).

[5] Guo, Z., Zhang, L., and Zhang, D., Rotation invariant texture classification using lbp variance (lbpv) withglobal matching, Pattern Recogn.43, 706719 (Mar. 2010).

[6] M.Pietikainen, A.Hadid, G.Zhao, and T.Ahonen, [Computer Vision Using Local Binary Patterns], Springer(2011).

[7] Mathew, A. and Asari, V., Local region statistical distance measure for tracking in wide area motionimagery, in [Systems, Man, and Cybernetics (SMC), 2012 IEEE International Conference on], 248253(2012).

[8] Rubner, Y., Tomasi, C., and Guibas, L. J., The earth movers distance as a metric for image retrieval,International Journal of Computer Vision40, 2000 (2000).

[9] Matungka, R., Zheng, Y., and Ewing, R., Object recognition using log-polar wavelet mapping, in [Toolswith Artificial Intelligence, 2008. ICTAI 08. 20th IEEE International Conference on], 2, 559563 (2008).

spie dss 2013 corrected

Documents