a deep multiscale fusion method via low-rank sparse ...research article a deep multiscale fusion...

Research ArticleA Deep Multiscale Fusion Method via Low-Rank SparseDecomposition for Object Saliency Detection Based on UrbanData in Optical Remote Sensing Images

Cheng Zhang 1 and Dan He2

1City Institute, Dalian University of Technology, China2Dalian University of Finance and Economics, China

Correspondence should be addressed to Cheng Zhang; [email protected]

Received 6 February 2020; Accepted 16 April 2020; Published 8 May 2020

Academic Editor: Qingchen Zhang

Copyright © 2020 Cheng Zhang and Dan He. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work isproperly cited.

The urban data provides a wealth of information that can support the life and work for people. In this work, we research the objectsaliency detection in optical remote sensing images, which is conducive to the interpretation of urban scenes. Saliency detectionselects the regions with important information in the remote sensing images, which severely imitates the human visual system.It plays a powerful role in other image processing. It has successfully made great achievements in change detection, objecttracking, temperature reversal, and other tasks. The traditional method has some disadvantages such as poor robustness andhigh computational complexity. Therefore, this paper proposes a deep multiscale fusion method via low-rank sparsedecomposition for object saliency detection in optical remote sensing images. First, we execute multiscale segmentation forremote sensing images. Then, we calculate the saliency value, and the proposal region is generated. The superpixel blocks of theremaining proposal regions of the segmentation map are input into the convolutional neural network. By extracting the depthfeature, the saliency value is calculated and the proposal regions are updated. The feature transformation matrix is obtainedbased on the gradient descent method, and the high-level semantic prior knowledge is obtained by using the convolutionalneural network. The process is iterated continuously to obtain the saliency map at each scale. The low-rank sparsedecomposition of the transformed matrix is carried out by robust principal component analysis. Finally, the weight cellularautomata method is utilized to fuse the multiscale saliency graphs and the saliency map calculated according to the sparse noiseobtained by decomposition. Meanwhile, the object priors knowledge can filter most of the background information, reduceunnecessary depth feature extraction, and meaningfully improve the saliency detection rate. The experiment results show thatthe proposed method can effectively improve the detection effect compared to other deep learning methods.

1. Introduction

With the rapid promotion of information technology, urbandata has become one of the important information sourcesfor human beings. And the amount of information receivedby people has increased exponentially [1, 2]. How to selectthe object regions of human interest from the mass ofimage information in urban becomes a significant research.Studies have found that under a complex scene, the humanvisual processing system will focus on several objects,named region of interest (ROI) [3]. ROI is relatively close

to human visual perception. Saliency, as the image pretreat-ment process, can be widely applied in remote sensing areassuch as visual tracking, image classification, image segmen-tation, and target relocation.

The saliency detection method mainly contains twoaspects: top-down and bottom-up. The top-down-basedsaliency detection method [4–6] is a task-driven process.The ground-truth images are labeled manually for supervisedtraining. It integrates more perceptions of humans to obtainthe salient map. However, the bottom-up method is a data-driven process and pays more attention to the images’

HindawiWireless Communications and Mobile ComputingVolume 2020, Article ID 7917021, 14 pageshttps://doi.org/10.1155/2020/7917021

https://orcid.org/0000-0001-8439-4892https://creativecommons.org/licenses/by/4.0/https://creativecommons.org/licenses/by/4.0/https://creativecommons.org/licenses/by/4.0/https://creativecommons.org/licenses/by/4.0/https://doi.org/10.1155/2020/7917021

features such as contrast, position, and texture to computethe saliency map (SM). Itti et al. [7] proposed a spatial visualmodel taking full advantage of local contrast and obtainedthe saliency map via the image differences from the centerto the surrounding. Hou and Zhang [8] put forward asaliency detection algorithm based on Spectral Residual(SR). Achanta et al. [9] proposed a frequency-tuned (FT)method based on the image frequency domain to calculatesaliency. A detection method combining histogram waspresented to calculate global contrast [10]. Furthermore,other relevant methods were raised and showed bettereffect [11–15]. But they do not analyze the image fromthe dimensions.

Yan et al. [16] treated the saliency region of the image assparse noise and the background as a low-rank matrix. Itcalculated the saliency of the image by using the sparserepresentation and robust principal component analysisalgorithm. Firstly, the image was decomposed into 8 × 8blocks. Every image block was sparsely encoded and mergedinto a coding matrix. Then, the coding matrix was decom-posed by robust principal component analysis. Finally, thesparse matrix obtained by decomposition was devoted toestablish the saliency factor of the corresponding imageblock. However, because the large-size saliency object con-tained many image blocks, the saliency object in each imageblock no longer satisfied the sparse feature; thus, it greatlyaffected the detection effect. Lang et al. [17] utilized a multi-task low-rank recovery approach for saliency detection. Themultitask low-rank representation algorithm was used todecompose the feature matrix and constrained the consis-tency of all feature sparse components in the same imageblocks. The algorithm used the consistency information ofmultifeature description, and its effect was improved. How-ever, since the large-size target contained a large number offeature descriptions, the feature was no longer sparse. Thereconstruction error could not solve this problem, so thismethod could not completely detect the saliency object witha large size. To perfect the result of the above method, Shenand Wu [18] proposed a low-rank matrix recovery (LRMR)algorithm combining bottom-up and top-down algorithm(providing high-level and low-level information, respec-tively). First, it performed the superpixel segment in theimage and several features were extracted. Then, the featuretransformation matrix and a priori knowledge, includingsize, texture, and color, were obtained by network learningto transform the feature matrix. Finally, the low-rank andsparse decomposition of the transformed matrix were carried

out by using the robust principal component analysis algo-rithm. This method improved the deficiency to some extent.However, due to the limitation of center prior and the failureof color prior to complex scenes, this algorithm was not idealfor detecting images with complex backgrounds.

The saliency detection method using different low-levelfeatures is usually only effective for a specific type of image,which is not suitable for multiobject images in complexscenes [19–21]. Figure 1 is the instance of saliency detection.The low-level features of visual stimuli lack an understandingof the nature of saliency objects and cannot represent the fea-tures at a deeper level. For noisy objects in the image, if theyare similar to the low-level features but do not belong to thesame category, they are often wrongly detected as saliencyobjects. Yang et al. [22] showed a bag of word model to detectsaliency. Firstly, the prior probability saliency map could beobtained through the object feature, and a word bag modelrepresenting the middle semantic features was establishedto calculate the conditional probability saliency graph.Finally, two saliency images were synthesized by Bayesianinference. The middle semantic features could represent theimage content more accurately than the bottom features.Therefore, the detection effect was more accurate. Jianget al. [23] took saliency detection as a regression problemand integrated regional attributes, contrast, and feature vec-tors of regional background knowledge at multiscale segmen-tation conditions. The saliency map was obtained bysupervised learning. Due to the introduction of backgroundknowledge features, the algorithm had a better ability to iden-tify background objects, and thus obtained more accurateforeground detection results.

Deep learning (DL) combines low-level features to formmore abstract high-level features, a typical representative isa convolutional neural network (CNN). Many saliency detec-tion methods have adopted CNN to optimize the result. Liet al. [24] proposed deep CNN to detect saliency. Firstly,region and edge information were obtained by using thehyper-pixel algorithm and bilateral filtering. DCNN was uti-lized to extract the regions and edge features in raw images.Finally, the region confidence graph and edge confidencegraph generated by CNN were integrated into the condi-tional random field to judge the saliency. Wang et al. [25]proposed recurrent fully CNN (i.e., RFCNN) for saliencydetection, which mainly included two steps: pretraining andfine-tuning. RFCN was used to train the original image to cor-rect the saliency prior image. Then, the traditional algorithmwas used to further optimize the modified saliency graph.

Figure 1: Saliency detection instance.

2 Wireless Communications and Mobile Computing

Lee et al. [26] proposed a deep saliency (DS) algorithm forsaliency detection using low and high-level information in aunified CNN framework. VGG-Net was used to extract theadvanced features. It mainly extracted the low-level features.Then, the CNNwas used to encode the distance graph. Finally,the coded low-level distance graph was connected with higherfeatures. A full-connected CNN classifier was adopted to eval-uate the features’ information and obtain the saliency graph[27]. The above DL methods show the excellent performancein terms of saliency detection rate. But there are still some dis-advantages such as slow speed and highly complex calculations.

In this paper, we propose a deep multiscale fusionmethod via low-rank sparse decomposition for object saliencydetection in optical remote sensing images. The main contri-butions are as follows.

(a) First, multiscale segmentation is executed for remotesensing images. For the first segmentation graph, thedepth features of all the superpixel blocks areextracted by CNN

(b) Then, we calculate the saliency value, and the pro-posal region is generated. The superpixel blocks ofthe remaining proposal regions of the segmenta-tion graph are input into the CNN network. Byextracting the depth feature, the saliency value iscalculated and the proposal regions are updated.Meanwhile, the color, texture, and edge featuremean values of all the pixels in each superpixelare calculated to construct the feature matrix. Inorder to make the image background facilitatelow-rank sparse decomposition, the above featurematrices need to be transformed so that the back-ground can be represented as a low-rank matrix inthe new feature space

(c) To make use of the high-level information andimprove the detection effect of the ROI, the fully con-volutional neural network is used for learning fea-

tures, and the high-level semantic prior knowledgematrix is obtained. The feature matrix is transformedby using the feature transformation matrix and thehigh-level semantic prior knowledge. The robustprincipal component analysis algorithm is used todecompose the transformed matrix into a low-ranksparse decomposition to obtain a saliency map. Theprocess is iterated continuously to obtain the saliencymap on each scale

(d) Finally, the weight cellular automata method fusesthe multiscale saliency graphs. It is shown that theproposed method can effectively improve the detec-tion effect compared to other DL methods

The remainder of the paper is organized as follows. Theproposed deep multiscale fusion method for saliency detec-tion is analyzed in section II. Section III introduces thesaliency region extraction based on multiscale segmentation.Saliency is calculated based on the deep features in sectionIV. The performance and robustness are evaluated in sectionV. Conclusion is drawn in section VI.

2. DeepMultiscale Fusion for Saliency Detection

The proposed deep multiscale fusion method for saliencydetection in optical remote sensing images is shown inFigure 2.

Firstly, the image l is segmented into a small number ofsuperpixel blocks by using the superpixel segmentation algo-rithm. The deep feature is extracted from all the superpixelblocks. The color, texture, and edge feature mean value ofall the pixels in each superpixel are calculated to constructthe feature matrix. In order to make the image backgroundfacilitate low-rank sparse decomposition, the above featurematrix needs to be transformed so that the background canbe represented as a low-rank matrix in the new feature space.And the multidimensional feature containing the key

Input raw imageSLIC

Multiscalesegment

Rough segmentsaliency map

Pre-saliencyregion

Final saliencyresult

Deep featureSparse principal

feature CNN

Calculating saliency Multiscalesaliency map

Weight cellular automata fusion

EigenmatrixEigentransformationmatrix

High-level semanticprior knowledge

⨯ ⨯

Low-rank matrix

=

Sparse matrix

+

Saliency map

+

Figure 2: The framework of proposed saliency detection.

3Wireless Communications and Mobile Computing

information of the image is extracted by PCA (principalcomponent analysis). The rough segmentation saliencygraph is obtained based on the calculation of key features,where we can extract the initial saliency region to obtainthe superpixel set Suppix. Then, we adopt Suppix to central-ize the similarity degree between superpixel and the nonob-ject region. The input image is segmented at differentscales. The region containing the superpixel block in the Suppix set is selected for depth feature extraction. Saliency mapsand Suppix sets at the next scale are obtained based on thesame method. The robust PCA is used to decompose thetransformed matrix into a low-rank sparse decompositionto obtain a saliency map. Weight cellular automata fusion isused to obtain the final SM Mf inal.

3. Saliency Region Extraction Based onMultiscale Segmentation

Superpixel segmentation is to gather adjacent similarpixel points into image regions with different sizesaccording to the low-level features such as brightness,thus reducing the complexity of significance calculation.The superpixel segmentation algorithm mainly includeswatershed [28] and simple linear iterative clustering(SLIC) [25] method. We combine their respective charac-teristics, SLIC method is used to obtain the segmentationresults with regular shape and uniform size during roughsegmentation, and the watershed algorithm is used toobtain better object contour during fine segmentation inthis study.

For N segmentation scales ðs1,⋯,snÞ. Supj = fSpjigN ji=1

denotes the obtained superpixel set at a certain segmentationscale. Nj denotes the superpixel number at scale sj. Sp

jiðvÞ =

fR,G, B, L, a, bg is pixel’s color feature vector in thesuperpixel.

For the input image, we extract color, texture, and edgefeatures to construct the feature matrix.

(i) Color feature. The gray value of R, G, B, hue, and sat-uration are extracted to describe the color feature ofthe image

(ii) Edge feature. Steerable pyramid filter is used todecompose the image in multiple scales and direc-tions. Filters with 3 scales and 4 directions areselected to obtain 12 responses as the edge featuresof the image

(iii) Texture feature. Gabor filter is used to extract tex-ture features at different scales and directions. Here,3 scales and 12 directions are selected to obtain 36responses as the texture features

It calculates the mean value of all pixel features ineach superpixel to represent the eigenvalue f i. All theeigenvalues constitute the eigenmatrix F = ½ f1, f2,⋯,f N �,F ∈ Rd×N .

The saliency region of the image is regarded as sparsenoise and the background as a low-rank matrix. In the

complex background, the image background similaritydegree after clustering is still not high. Therefore, the fea-tures in the original image are not conducive to low-ranksparse decomposition. In order to find a suitable featurespace, most image backgrounds can be represented aslow-rank matrices; in this paper, the eigentransformationmatrix is obtained based on the gradient descent method.The process of obtaining the eigentransformation matrixis as follows:

(a) Construct marker matrix Q = diag fq1, q2,⋯,qNg. Ifthe superpixel pi is within the marked saliency regionmanually, qi = 0. Otherwise, qi = 1

(b) According to the following formula, the optimalmodel of transformation matrix T is utilized to learnthe features of raw image

Toptimal = arg minT O Tð Þ =1K〠K

k=1TFkQkk k∘ − γ Tk k∘ ð1Þ

Where Fk ∈ Rd×Nk is the feature matrix of kth image.Nk represents the superpixel number of kth image. Qk ∈RNk×Nk is the labeled matrix of the kth image. k⋅k∘ repre-sents the kernel norm of the matrix, that is, the sum ofall singular values of the matrix. γ is the weight coefficient.kTk2 denotes the ℓ2 norm of the matrix T . c is a constantto prevent T from arbitrarily increasing or decreasing. Ifthe eigentransformation matrix T is appropriate, thenTFQ is low rank. −γkTk∘ is to avoid obtaining the generalsolution when the rank of T is arbitrarily small.

(c) Find the Toptimal gradient descent direction, that is

∂O Tð Þ∂T

= 1K〠k

∂ TFkQkk k∘∂T

− γ∂ Tk k∘∂T

ð2Þ

(d) Adopt the following formula to update the eigen-transformation matrix T until the algorithm con-verges to the local optimal. α is the step size

Tt+1 = Tt − α∂O Tð Þ∂T

ð3Þ

3.1. Extracting Proposal Region. The segmentation graph ofa rough segmentation scale sj is taken as input. Thesaliency map Mapj is obtained by depth feature extractionand saliency value calculation. The Mapj, as the objectprior knowledge in the next segmentation, is used to guidethe proposal region extraction. The saliency Mapj is binar-ized. The value of Mapj is divided into K channels by theadaptive threshold strategy. pðiÞ is used to represent thenumber of pixels in the channel i. The channel k with


the largest number of pixels in all channels is determined.The threshold value T is calculated by the formula (4).

T = k + 1ð Þ/K ð4Þ

In order to prevent T from getting larger, the sig-nificant pixel will not be binarized to 0 when thesaliency object occupies the most space in the image.The pixel number in each channel must satisfy pðiÞ/areaðIÞ < ε. Where areaðIÞ is the pixel number of imagel. ε ∈ ½0:6, 0:9� is an experience value. The binarizationobject a priori map is denoted as MapBj. We adoptMapBj as the prior knowledge. The corresponding super-

pixel area of superpixel set Supj+1 = fSpj+1i gN j+1i=1 in the next

scale sj+1 constitutes the proposal saliency superpixel set

Suppixj+1 = fSpj+1i gMj+1i=1 . Mj+1 is the number of proposal

saliency superpixel at the scale sj+1, Mj+1 0:5, thesuperpixel at the corresponding position is considered tobelong to Suppixj+1.

3.2. Region Optimization. The proposal object superpixel setmay contain some background areas or missing saliencyareas. It needs to optimize the proposal object area. Itremoves the possible background area in Suppixj+1 and addsthe possible saliency area in the background area. Accordingto the Euclidean distance between the two color spaces, thedifference matrix is Difmat. It is a symmetric matrix withNj+1 order.

Difmat i, jð Þ =Difmat Spi, Spj� �

=

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi〠6

k=1Fi,k − Fj,k� �2

vuut ð5ÞWhere Fi,k is the kth feature of superpixel region Spi.

k = ½1,⋯,6� corresponds to R, G, B, L, a, and b, respec-

tively. For Spk ∈ Suppixj+1, it calculates the local averagedissimilarity degree through equation (6),

MavDif Spkð Þ =ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi∑

Mj+!l=1,l≠kDifmat Spk, Splð Þ2

qMj+1

ð6Þ

Where Spk, Spl ∈ Suppixj+1, Mj+1 is superpixel numberin the proposal saliency region set Suppixj+1. We calculatethe average dissimilarity degree of each superpixel Spk inSuppixj+1 and its adjacent background region:

MavDif Spkð Þ′ =

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi∑

Mj+!′l=1,l≠kDifmat Spk, Splð Þ2

rMj+1′

ð7Þ

Where Spk ∈ Suppixj+1, Spl ∉ Suppixj+1 and Spk is adja-cent to Spl. Mj+1′ represents the number of superpixelsadjacent to Spk in the background area. If MavDif ðSpkÞ′>MavDif ðSpkÞ, it indicates that Spk is more similar tothe adjacent background area, then Spk will be removedfrom Suppixj+1.

Similarly, for any Sph ∉ Suppixj+1, the average dissimilar-ity MavDif ðSphÞ′ between Sph and adjacent backgroundregion, and average dissimilarity MavDif ðSphÞ between Sphand adjacent proposal saliency region can be calculated. Ifthe condition MavDif ðSphÞ′ >MavDif ðSphÞ is satisfied,then the similarity between Sph and the adjacent saliencyregion is higher than that of other background regions.Therefore, Sph is added to Suppixj+1. Suppixj+1 is constantlyupdated by comparing the superpixel in Suppixj+1 with othersaliency regions and background regions. Until the super-pixel in Suppixj+1 is no longer changed.

3.3. Deep Feature Extraction of Proposal Region. This deepfeature extraction method based on CNN is as shown inFigure 3. In the first superpixel segmentation, the deep

Input raw image

Rectself

Rectlocal

Rectglobal

C1 C2 C5 F5 F7ęę

C1 C2 C5 F5 F7ęę

C1 C2 C5 F5 F7ęę

Output

Fself

Flocal

FglobalGoogleNet Model

Figure 3: Deep features extraction based on CNN.


features of all superpixels are extracted. In the subsequentdeep feature extraction process, only the superpixel in Suppix set is extracted. Under a certain segmentation strat-egy, the computation is greatly reduced and the computa-tion speed is increased.

Assuming it is not the first time to segment the super-pixel, the local and global features are extracted for super-pixel Spi. The local features of the superpixel include twoparts: (1) the deep feature Fsel f containing its own region;(2) deep feature Flocal containing itself and the adjacentsuperpixel region.

First, according to Suppix set, it extracts the minimumrectangular region Re ctsel f of each superpixel Spi. Since mostsuperpixels are not regular rectangles, the extracted rectan-gles must contain other pixels. These pixels are representedby the average value of the superpixel. The depth featureFsel f only containing its own region can be obtained throughthe deep CNN.

If we only adopt the saliency calculation of Fsel f toacquire saliency detection value is meaningless. It is impossi-ble to determine whether it is saliency without comparing itwith the saliency of other adjacent superpixels. Therefore, itstill needs to extract Re ctlocal to further obtain Flocal of thedeep local feature. The location of the region in the imageis an important factor to judge whether it is saliency or not.It is generally believed that the area in the center of the imageis more likely to be saliency than the region at the edge.Therefore, the whole image is taken as the input, and thedeep feature Fglobal of the global region is extracted.

If it only uses the bottom feature to extract the saliencymap, due to many interference objects, the final saliencymap is not ideal. Therefore, the high-level information needsto be added to improve the detection effect. The adoptedhigh-level semantic prior knowledge is mainly to predictthe most likely ROI based on previous experience (i.e., train-ing samples). The FCNN is used to train the high-levelsemantic prior knowledge, which is integrated into the fea-ture transformation process to optimize the final saliencymap. Higher-order features can be learned from the primitivedata without preprocessing in the multi-stage global trainingprocess of CNN.

FCNN can accept input images with any size. The dif-ference between FCNN and CNN is that the deconvolutionlayer replaces the full connection layer. Finally, pixel classi-fication is carried out on the feature map of the upsam-pling. A binary prediction is produced for each pixel, anda classification result at the pixel level is output. Thus, theproblem of image segmentation at the semantic level issolved. Semantic a priori is an important high-level infor-mation in the detection of the ROI, which can assist thedetection of the ROI. Therefore, this paper adopts FCNNto obtain high-level semantic prior knowledge and appliesit to the detection of the ROI.

The network structure of FCNN is shown in Figure 4.Based on the original classifier, this paper utilizes the backpropagation algorithm to fine-tune the parameters in allFCNN layers. In the network structure, the first row getsthe feature map after alternately seven convolutional layersand five pooling layers. The last step of the deconvolutionlayer is to conduct the upsampling of the feature map witha step size 32 pixels. The network structure in this paper isdenoted as FCNN-32s. It is found that the precisiondecreases because of the maximum pool operation. It directlyexecutes upsampling for the feature map of downsampling,which will result in very rough output and details loss. There-fore, in this paper, the features with step size 32 pixelsobtained from the upsampling are extended by 2 times,which is summed with the feature with step size 16 pixels.Then, the obtained feature is recovered to the original imagefor training, and the FCNN-16s model is obtained. So moreaccurate detailed information is obtained than that ofFCNN-32s. We adopt the same method to train the networkto obtain the FCNN-8s model, the prediction of detailedinformation is more accurate. Experiments show thatalthough lower-level feature fusion for training networkscan make detailed information prediction more accurate,the effect of low-rank sparse decomposition on the result isnot significantly improved. Since the training time willincrease sharply, this paper adopts FCNN-8s model toacquire the high-level priori knowledge of images.

The deep CNN model comprises an input layer, multipleconvolution layers, downsampling layer, full connectionlayer, and output layer. The downsampling layer and

Raw image Conv1 Pool1 Conv2 Pool2 Conv3 Pool3 Conv4 Pool4 Conv5 Pool5 Conv6-7 32⨯upsampled prediction(FCNN 32s)

Raw image Conv1 Pool1 Conv2 Pool2 Conv3 Pool3 Conv42⨯con7Pool4 Conv5 Pool5 Conv6-7

16⨯upsampled prediction(FCNN 16s)

Raw image Conv1 Pool1 Conv2 Pool2 Conv3 4⨯con72⨯Poll4Pool3

Conv4 Pool4 Conv5 Pool5 Conv6-7 32⨯upsampled prediction(FCNN 32s)

Figure 4: FCNN model.


convolution layer form the intermediate structure of the neu-ral network. The former is used for feature extraction, andthe latter is for feature calculation. The fully connected layeris connected with the downsampling layer, which can outputthe feature. The output of the convolution layer is:

dln = f 〠∀m

di−1m ⋅ klm,n

� �+ blm

!ð8Þ

Where dln and di−1m are the feature maps of the current

layer and the previous layer. klm,n is the convolution kernelof the model. f ðxÞ = 1/½1 + e−x� is the neuron activation func-tion. bln is neuron bias. The feature extraction result of thedownsampling layer is:

dln = f kln ×1s2〠s×sdl−1n + bln

!ð9Þ

Where s × s is the downsampling template scale. kln is thetemplate weight. In this paper, the trained GoogleNet modelis used to extract the depth features of the proposal objectregion. On the strength of this model, the labeled output layeris removed to obtain a depth feature. The convolution layerC1 uses 96 filters with 11 × 11 × 3 size to filter the inputimage with size 224 × 224 × 3. The convolution layers C2,C3, C4, and C5 take the output of the downsampling layeras their input, respectively. The convolution processing iscarried out by using the self-filter, and several output featuregraphs are obtained and transmitted to the next layer. Thefull connection layers F6 and F7 have 4096 features. The out-put of each full connection layer can be denoted as:

doutn = f 〠 dout−1n × koutm,n� �

+ boutn� �

ð10Þ

3.4. Saliency Calculation Based on Deep Feature. PCA [28] isthe common method for dimension reduction of high-dimensional data, which can replace p high-dimensionalfeatures with a smaller number of m features. For n super-pixels, the output features can constitute a sample matrixW with n × p dimension. The correlation coefficient matrixR = ðrijÞp×p of the sample is calculated by the formula (11):

rij =∑nk=1 xki − �xið Þ xkj − �xj

� �ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi∑nk=1 xkj − �xj

� �2∑nk=1 xki − �xið Þ2q , i, j = 1, 2,⋯, p: ð11Þ

Where �xi = 1/n∑ni=1xij. By solving the equation ∣λI − R∣= 0, we find the eigenvalues and order them. Then, we calcu-late the contribution rate and cumulative contribution rate ofeach eigenvalue λi:

conrate = λi/〠p

k=1λk, cumrate = 〠

i

k=1λk/〠

p

k=1λk, i, j = 1, 2,⋯, p:

ð12Þ

We calculate the corresponding orthogonal unit vectorzi = ½zi1, zi2,⋯,zip�T of each eigenvalue λi. The unit vectorcorresponding to the firstm features with a cumulative contri-bution rate 95% is selected to form the transformation matrixZ = ½z1, z2,⋯,zm�p×m. The high-dimensional matrix m isreduced by formula (13). Spiðdf Þ = ð f i,1, f i,2,⋯,f i,mÞdenotes the m-dimension principal component feature.The principal component features are extracted by the sametransformation matrix in the segmentation maps with differ-ent scales.

Spi dfð Þ =Wn×pZp×m ð13Þ

3.5. Contrast Feature. The contrast feature reflects the differ-ence degree between the region and its adjacent region. Thecontrast feature wcðSpiÞ of the superpixel Spi is defined byits distance from other superpixels features, as given inequation (14):

wc Spið Þ =1

n − 1 〠n

i=1,i≠kSp dfð Þi − Sp dfð Þk��

2 ð14Þ

Where n denotes the number of superpixel. k⋅k2 is 2-norm.

3.6. Spatial Feature. In the human visual system, we pay dif-ferent attentions in different spatial positions. The distancebetween the pixel at different positions and the image centersatisfies the Gaussian distribution. For any superpixel Spi, itsspatial feature wsðSpiÞ is calculated as:

ws Spið Þ = e−d Spi,xcð Þ

σ2 ð15Þ

Where Spi,x is the central coordinate of superpixel Spi. c isthe central region. If the average distance from the imagecenter is smaller, the spatial feature is larger. The saliencyvalue of the superpixel Spi is denoted as:

Map Spið Þ =wc Spið Þ ×ws Spið Þ ð16Þ

We obtain the SM of the first segmented image and use itas the object prior knowledge to guide the proposal regionextraction and optimization.

3.7. Saliency Detection Based on Low-Rank SparseDecomposition. The background in the image can beexpressed as a low-rank matrix. The saliency region canbe regarded as sparse noise. For an original image, theeigenmatrix F = ½ f1, f2,⋯,f N � ∈ Rd×N and the eigentransfor-mation matrix T are obtained. Then, we use the FCN toobtain the high-level prior knowledge P. The low-ranksparse decomposition of the transformed matrix is carriedout by robust PCA.

L∗, S∗ð Þ = arg minL,S

Lk k∘ + λ Sk k1ð Þ s:t:TFP = L + S ð17Þ


Where F is the eigenmatrix. T is the learned eigen-transformation matrix. P is a high-level prior knowledgematrix. L is a low-rank matrix. S represents the sparsematrix. k⋅k∘ represents the kernel norm of the matrix, thatis, the sum of all singular values of the matrix. k⋅k1 repre-sents the ℓ1-norm of the matrix, the sum of the absolutevalues of all the elements in the matrix. Supposing that S∗ is the optimal solution for the sparse matrix. Thesaliency map can be calculated by the following equation.

Sal pið Þ = S∗ : ,ið Þk k1 ð18Þ

Where SalðpiÞ represents the saliency value of super-pixel pi. kS∗ð: ,iÞk1 represents the ℓ1-norm of the ith col-umn vector of S ∗, that is, the sum of the absolutevalues of all the elements in the vector.

3.8. Saliency Map Fusion Based on Weight CellularAutomata.Wang and Wang [29] adopted the multilayer cel-lular automata (MCA) for object fusion. Each pixel repre-sents a cell. In the m-layer cellular automata, the cellular ofthe saliency map has m-1 neighbors. They are at the samepositions in other saliency maps.

If cellular i is labeled as foreground, the foreground proba-bility of its neighbor j at the same position in other SMs is λ= Pðηi = +1 ∣ i ∈ FÞ. Saliency maps obtained by differentmethods are considered to be independent. When updatingsynchronously, all saliency maps are considered to have thesame weight. There are guiding and refining relationshipsbetween the saliency maps at different segmentation scales.The weights cannot be considered as equally during the fusion

process. In different segmentation scales, it is assumed thatthe weight of the SM obtained by the first segmentation scaleis λ1, represented bywi = λ1. The SMweight with different scaleis expressed as:

wi = λi−1 + 1 − oi/Oið Þ, i = 1, 2,⋯, 6 ð19Þ

Where Oi denotes the total pixel number in the proposalobject set. oi is the superpixel number in the ith saliency map.Set λ1 = 1. Synchronous updating mechanism f : MapM−1⟶Map is defined as:

l Mapt+1m� �

=wm 〠M

k=1,k≠msign Maptk − γk ⋅ I

� �⋅ ln λ1 − λ

�

+ l Maptm� �

ð20Þ

Where Maptm = ½Maptm,1,⋯,Maptm,H �T represents thesaliency value of all the cellular of themth SM at time t. MatrixI is amatrixwithH elements. If the neighbor of cellular is judgedas foreground, then the saliency value should be increased. Weobtain thefinal saliencymapby formula (21).T2 is thenext time.

Mapf inal =1N

〠M

m=1MapT2m + Sal pið Þ� � ð21Þ

The proposed deep multiscale fusion method for objectsaliency detection is summarized as depicted in Algorithm 1.

Input: Raw image I, multiscale segment number N and segment parameter in each scale.Output: Saliency map.

for i = 1 : N{

if i=1 then(1) According to the determined parameters, we use SLIC to segment image l;(2) Determine the input region Re ctsel f , Re ctlocal , Re ctglobal of each superpixel;(3) The above is input GoogleNet to extract deep feature Fsel f , Flocal , Fglobal ;(4) The deep features of all superpixels constitute a matrixW, and the transformationmatrix A ofW is calculated by using PCA

to obtain the principal component features;(5) According to the principal component features, saliency values without object priors are calculated to obtain the first seg-

mentation saliency map Map1;else

(6) According to the determined parameters, we use Watershed algorithm to segment image;(7) The saliency map Mapi−1 is taken as object priori map. Then it extracts and optimizes proposal object set Suppix;(8) Determine the input region Re ctsel f , Re ctlocal , Re ctglobal in Suppix;(9) The above is input GoogleNet to extract deep feature Fsel f , Flocal , Fglobal ;(10) The deep features of all superpixels constitute a matrix W, and the transformation matrix A of W is calculated by using

PCA to obtain the principal component features;(11) According to the principal component features, saliency values with object priors are calculated. And we obtain the

saliency map Mapi;end if}

(12) Calculate the saliency map weight wi at each scale;(13) Adopt weight cellular automata to fuse the obtained N saliency maps and get final SM.

Algorithm 1: Proposed saliency detection method.


4. Experiments and Analysis

In this section, we obtain the experiment data from GoogleEarth. The remote sensing image size is from 512 × 512pixel to 2048 × 2048 pixel. The spatial resolution is 1m. Theexperiment environment is Intel(R), Core(TM), i7-8750,CPU2.2Hz, Geforce GTX1060 with MATLAB 2017aplatform.

4.1. Evaluation Index and Parameter Setting. In theexperiment, the PR curve, F-measure, and mean absoluteerror (MAE) of the saliency map are compared to evalu-ate the effect of saliency detection to select a better seg-mentation scale.

Precision and Recall are the two most commonly usedevaluation criteria in image saliency detection. If the PRcurve is higher, the effect of saliency detection is better. Oth-erwise, it is poor. For the given manual labeled Ground TruthG and the saliency map S, the definition of Precision andRecall is given in equation (22):

Pr ecision = sum S,Gð Þsum Sð Þ , Re call =

sum S,Gð Þsum Gð Þ ð22Þ

Where sumðS,GÞ represents the sum of the value afterthe pixels of visual feature graph S multiplying that of G.sumðSÞ is the sum of all pixels in the visual feature graphS. sumðGÞ represents the sum of all pixels in G.

When calculating F-measure, the adaptive threshold T ofeach image is used to segment the image.

T = 2W ⋅H

〠W

x=1〠H

y=1S x, yð Þ ð23Þ

Where the W and H denote the width and height of theimage, respectively. It calculates the average precision andrecall of the SM. The average F-measure value is calculatedaccording to equation (24). The effect of saliency is better ifthe F-measure value is excellent. F-measure value is usedfor the comprehensive evaluation of accuracy and recall. β2

is often set to 1.

F = 2Pecision ⋅ Re callPecision + Re call ð24Þ

MAE is used to evaluate the saliency model by comparingthe difference between the SM and the GT. We use formula(25) to compute the MAE value of each input image. The cal-culated MAE value can be used to draw a histogram. If theMAE value is lower, the proposed algorithm is better.

MAE = 1W ⋅H

〠W

x=1〠H

y=1∣S x, yð Þ − G x, yð Þ∣ ð25Þ

4.2. Segment Scale Determination. The main parameter ofthis algorithm is the segmentation scale. Many segmentationscales can increase the computational complexity. Few scaleswill affect the accuracy of saliency detection. Therefore, 15

segmentation scales are set according to experience. We con-duct experiments on randomly selected remote sensingimage data. Then, we extract the depth features of all super-pixels in the segmentation graphs and calculate the saliencymap. The histogram of the PR curve with different segmenta-tion scales is shown in Figure 5. Three segmentation scaleswith better effects are selected from them. Through compar-ative analysis, it is found that the three segmentation scales10, 11, 12 have a relatively better saliency detection effect.The three segmentation scales are selected as the final seg-mentation scales of the proposed method.

4.3. PCA Parameter Determination. To verify the effective-ness of PCA on selecting principal components fromdepth features, this section adopts the depth featuresextracted from each superpixel block as the data set. Thepercentage of explained variance (PEV) is used to measurethe importance of the principal component in the overalldata as formula (26). PEV is a main index to describethe distortion rate of data.

PEV = 〠m

i=1R2ii/tr 〠

� �ð26Þ

Where R2ii is the right matrix of the main componentmatrix M ′ after singular value decomposition. ∑ denotesthe covariance matrix. Figure 6 shows the relation betweenPEV and the top 50 principal components. It reveals that

00.10.20.30.40.50.60.70.80.9

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Histogram of PR

(0.1)(0.2)(0.3)

(0.4)(0.5)(0.6)

(0.7)(0.8)

Figure 5: Histogram of PR curve with different segmentation scales.

0

0.2

0.4

0.6

0.8

1

0 10 20 30 40 50

Figure 6: Relation between PEV and the principal components.


with the increase of principal component number, PEVshows an upward trend. But the trend grows slowly. Whenthe number of principal components exceeds 20, the PEV

reaches to 90%, which is considered to represent the over-all information of the data. In this paper, the top 20 prin-cipal components are selected for saliency calculation.

(a)

(b)

Figure 7: Test images: airplane1, cloud, vehicle, playground, airplane2, boat. (a) Raw remote sensing images; (b) Ground Truth.

(a) (b) (c) (d) (e) (f)

Figure 8: Comparison of saliency images. (a) RA. (b) RB. (c) SC. (d) RAD. (e) SCLR. (f) Proposed.


4.4. Saliency Detection Comparison with Other State-of-the-Art Methods. In this section, five state-of-the-art methodsincluding RA [30], RB [31], SC [32], RAD [33], and SCLR[34] are conducted comparison with the proposed deep mul-tiscale fusion method. And we conduct experiments on someoptical remote sensing images based on urban data, namelyairplane (512 × 512 pixel), playground (1024 × 1024 pixel),boat (1024 × 1024 pixel), vehicle (512 × 512 pixel), and cloud(2048 × 2048 pixel). Due to the limited space, we only displaythe results of several remote sensing objects. The test images

along with their relevant GT maps are shown in Figure 7.Figure 8 displays the saliency results with different methods.

Figure 8 shows the comparison of saliency detectionresults with different methods. It can be seen that the detec-tion effect of this algorithm is obviously better than otheralgorithms.

Table 1 is the F-measure result. With the change ofRecall, the Precision of the method in this paper has bettervalue and keeps a high level. However, in terms of F-mea-sure value, our method is 7.18% higher than the second bettermethod. Under the condition of complex background infor-mation, both the PR curve value and F-measure value of theproposed method are significantly higher than other algo-rithms. It fully demonstrates the advantages of the proposedalgorithm in relatively complex image information. Similarly,the MAE of this proposed algorithm is lower than that ofother algorithms. Figures 9–14 are the subjective evaluationresults for the six objects.

We also adopt IoU (Intersection-Over-Union) to illus-trate the effectiveness of the proposed method [35, 36]. TheIoU is calculated as follows:

IoU = Area of OverlapArea of Union

ð27Þ

The greater IoU shows a better effect. The results areshown in Table 2.

Table 1: Different indexes with different methods on differentobjects.

Object Method Precision Recall F-measure MAE

Airplane1

RA 79.9% 73.5% 72.8% 19.3%

RB 81.1% 75.8% 76.4% 17.2%

SC 81.8% 74.6% 77.2% 15.7%

RAD 87.4% 77.4% 79.5% 14.6%

SCLR 91.7% 75.4% 81.8% 12.5%

Proposed 95.6% 65.3% 82.5% 9.8%

Cloud

RA 84.6% 68.9% 73.8% 21.2%

RB 89.1% 71.8% 75.4% 17.8%

SC 90.7% 74.1% 76.2% 14.1%

RAD 91.6% 73.7% 78.4% 12.6%

SCLR 93.6% 72.8% 80.9% 11.3%

Proposed 97.4% 71.5% 83.6% 7.6%

Vehicle

RA 89.2% 77.4% 78.7% 19.5%

RB 91.5% 79.9% 80.8% 17.6%

SC 93.1% 79.3% 82.1% 13.8%

RAD 93.6% 79.5% 82.5% 13.1%

SCLR 94.3% 78.6% 83.7% 11.7%

Proposed 98.2% 72.4% 89.1% 8.7%

Playground

RA 85.7% 77.8% 81.9% 16.5%

RB 87.8% 74.2% 83.7% 14.8%

SC 89.9% 74.6% 84.1% 13.1%

RAD 91.8% 73.3% 84.5% 12.5%

SCLR 92.4% 71.8% 86.7% 10.2%

Proposed 97.2% 59.6% 89.7% 9.4%

Airplane2

RA 86.4% 78.9% 77.1% 15.8%

RB 87.6% 78.3% 78.6% 14.6%

SC 88.2% 77.1% 79.4% 13.5%

RAD 88.3% 76.6% 80.8% 12.9%

SCLR 91.3% 75.8% 81.6% 11.9%

Proposed 96.3% 62.4% 83.9% 8.1%

Boat

RA 85.4% 84.1% 72.4 24.5%

RB 88.6% 78.2% 74.1% 21.2%

SC 89.7% 76.1% 76.4% 19.4%

RAD 91.6% 74.8% 79.2% 15.7%

SCLR 92.7% 73.4% 82.5% 11.4%

Proposed 95.2% 68.3% 89.7% 7.4%

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

RA RB SC RAD SCLR Proposed

Airplane1

PrecisionRecall

F-measureMAE

Figure 9: Airplane1 comparison with different methods.

PrecisionRecall

F-measureMAE

0.00%10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%90.00%

100.00%


Cloud

Figure 10: Cloud comparison with different methods.


From Table 2, we can see that our proposed method has abetter saliency detection effect than other methods.

There are also apparent differences in the detectiontime among different algorithms. In terms of the speedof saliency detection, the proposed method is faster thanother methods as given in Figure 15. Though deeplearning-based algorithms need to train many samples,compared with other deep learning methods, the process-ing efficiency is improved by about 4%. Overall, the deepmultiscale fusion method has a better effect on saliencydetection for remote sensing images.

5. Conclusions

The saliency detection algorithm based on DL can overcomethe shortcomings of the traditional saliency detection algo-rithms. However, the detection efficiency is obviously insuffi-cient. Therefore, we present a deep multiscale fusion methodfor object saliency detection in optical remote sensing imagesbased on urban data. Through the deep feature extraction, wecalculate the saliency value and use the weight cellularautomata to integrate and optimize the scale saliency map.Results reveal that the proposed method can efficientlyacquire the saliency detection results than other methods.In the future, some new models based on deep learning willbe researched. And the new methods will be applied to prac-tical engineerings.

0.00%10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%90.00%

100.00%


Vehicle

PrecisionRecall

F-measureMAE

Figure 11: Vehicle comparison with different methods.

0.00%10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%90.00%

100.00%


Playground

PrecisionRecall

F-measureMAE

Figure 12: Playground comparison with different methods.

0.00%10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%90.00%

100.00%


Boat

PrecisionRecall

F-measureMAE

Figure 14: Boat comparison with different methods.

Table 2: IoU comparison.

Method RA RB SC RAD SCLR Proposed

Airplane1 69.3% 72.5% 76.7% 77.5% 79.8% 82.4%

Cloud 68.4% 72.7% 77.1% 79.2% 81.6% 83.5%

Vehicle 69.7% 73.8% 76.5% 78.5% 79.4% 81.6%

Playground 66.4% 72.9% 80.1% 81.4% 83.7% 86.4%

Airplane2 63.9% 71.3% 74.9% 76.4% 78.2% 81.9%

Boat 65.4% 70.8% 73.2% 75.9% 77.2% 79.5%

0 2 4 6 8 10 12

RA

RB

SC

RAD

SCLR

Proposed

Time comparison

BoatAirplane2Playground

VehicleCloudAirplane1

Figure 15: Time comparison with different methods.

0.00%10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%90.00%

100.00%


Airplane2

PrecisionRecall

F-measureMAE

Figure 13: Airplane2 comparison with different methods.


Data Availability

The data used to support the findings of this study are avail-able from the corresponding author upon request.

Conflicts of Interest

The authors declare that there is no conflict of interestregarding the publication of this paper.

References

[1] Q. Zhang, L. Yang, Z. Chen, and P. Li, “Incremental DeepComputation Model for Wireless Big Data Feature Learning,”in IEEE Transactions on Big Data, no. article 1, 2019.

[2] P. Li, Z. Chen, L. T. Yang, Q. Zhang, and M. J. Deen, “DeepConvolutional Computation Model for Feature Learning onBig Data in Internet of Things,” IEEE Transactions on Indus-trial Informatics, vol. 14, no. 2, pp. 790–798, 2018.

[3] S. Yin, Y. Zhang, and K. Shahid, “Large Scale Remote SensingImage Segmentation Based on Fuzzy Region Competition andGaussian Mixture Model,” IEEE Access, vol. 6, pp. 26069–26080, 2018.

[4] H. Quan, S. Feng, and B. Chen, “Two Birds With One Stone: AUnified Approach to Saliency and Co-Saliency Detection viaMulti-Instance Learning,” IEEE Access, vol. 5, pp. 23519–23531, 2017.

[5] C. Lang, J. Feng, S. Feng, J. Wang, and S. Yan, “Dual Low-RankPursuit: Learning Salient Features for Saliency Detection,”IEEE Transactions on Neural Networks and Learning Systems,vol. 27, no. 6, pp. 1190–1200, 2016.

[6] J. Zhu, Y. Qiu, R. Zhang, J. Huang, andW. Zhang, “Top-DownSaliency Detection via Contextual Pooling,” Journal of SignalProcessing Systems, vol. 74, no. 1, pp. 33–46, 2014.

[7] L. Itti, C. Koch, and E. Niebur, “A model of saliency-basedvisual attention for rapid scene analysis,” IEEE Transactionson Pattern Analysis and Machine Intelligence, vol. 20, no. 11,pp. 1254–1259, 1998.

[8] X. Hou and L. Zhang, “Saliency detection: a spectral residualapproach,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pp. 1–8, IEEE, Minneapolis,MN, USA, 2007.

[9] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, “Fre-quency-tuned salient region detection,” in Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pp. 1597–1604, IEEE, Miami, FL, USA, 2009.

[10] M. Cheng, G. Zhang, N. Mitra, X. Huang, and S. Hu, “Globalcontrast based salient region detection,” in Proceedings of the2011 IEEE Conference on Computer Vision and Pattern Recog-nition, pp. 409–416, IEEE, Providence, RI, 2011.

[11] Q. Zhang, C. Bai, T. Yang, Z. Chen, P. Li, and H. Yu, A UnifiedSmart Chinese Medicine Framework for Healthcare and Medi-cal Services, IEEE/ACM Transactions on Computational Biol-ogy and Bioinformatics, 2019.

[12] P. Li, Z. Chen, L. T. Yang, J. Gao, Q. Zhang, and M. J. Deen,“An Incremental Deep Convolutional Computation Modelfor Feature Learning on Industrial Big Data,” IEEE Transac-tions on Industrial Informatics, vol. 15, no. 3, pp. 1341–1349,2019.

[13] G. Lee, Y. W. Tai, J. Kim et al., “ELD-Net: An Efficient DeepLearning Architecture for Accurate Saliency Detection,” IEEE

Transactions on Pattern Analysis and Machine Intelligence,vol. 40, no. 7, pp. 1599–1610, 2018.

[14] Q. Zhang, C. Bai, Z. Chen et al., “Deep Learning Models forDiagnosing Spleen and Stomach Diseases in Smart ChineseMedicine with Cloud Computing,” in Concurrency and Com-putation: Practice and Experience, p. e5252, 2019.

[15] W. Wang, J. Chen, J. Wang, J. Chen, and Z. Gong, “Geogra-phy-Aware Inductive Matrix Completion for PersonalizedPoint of Interest Recommendation in Smart Cities,” IEEEInternet of Things Journal, 2019.

[16] J. Yan, M. Zhu, H. Liu, and Y. Liu, “Visual saliency detectionvia sparsity pursuit,” IEEE Signal Processing Letters, vol. 17,no. 8, pp. 739–742, 2010.

[17] C. Lang, G. Liu, J. Yu, and S. Yan, “Saliency detection by mul-titask sparsity pursuit,” IEEE Transactions on Image Process-ing, vol. 21, no. 3, pp. 1327–1338, 2012.

[18] X. Shen and Y. Wu, “A unified approach to salient objectdetection via low rank matrix recovery,” in Proceedings of the2012 IEEE Conference on Computer Vision and Pattern Recog-nition, pp. 853–860, IEEE, Providence RI, USA, 2012.

[19] C. Dai, X. Liu, J. Lai, P. Li, and H. Chao, “Human BehaviorDeep Recognition Architecture for Smart City Applicationsin the 5G Environment,” IEEE Network, vol. 33, no. 5,pp. 206–211, 2019.

[20] W. Wang, J. Chen, J. Wang, J. Chen, J. Liu, and Z. Gong,“Trust-Enhanced Collaborative Filtering for PersonalizedPoint of Interests Recommendation,” IEEE Transactions onIndustrial Informatics, p. 1, 2019.

[21] C. Dai, X. Liu, and J. Lai, “Human action recognition usingtwo-stream attention based LSTM networks,” Applied softcomputing, vol. 86, p. 105820, 2020.

[22] S. Yang, C. Zhao, and W. Xu, “A novel salient object detectionmethod using bag-of-features,” Acta Automatica Sinica,vol. 42, pp. 1259–1273, 2016.

[23] H. Jiang, J. Wang, Z. Yuan, Y.Wu, N. Zheng, and S. Li, “Salientobject detection: a discriminative regional feature integrationapproach,” in Proceedings of the 2013 IEEE Conference onComputer Vision and Pattern Recognition, pp. 2083–2090,IEEE, Portland, OR, USA, 2013.

[24] Y. Li, Y. Xu, S. Ma, and H. Shi, “Saliency detection based ondeep convolutional neural network,” Journal of Image andGraphics, vol. 21, pp. 53–59, 2016.

[25] L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan, “Saliencydetection with recurrent fully convolutional networks,” in Pro-ceedings of the Computer Vision-ECCV 2016. Lecture Notes inComputer Science, vol. 9908, pp. 825–841, Springer, Amster-dam, Netherlands, 2016.

[26] G. Lee, Y. Tai, and J. Kim, “Deep saliency with encoded lowlevel distance map and high level features,” in Proceedings ofthe 2016 IEEE Conference on Computer Vision and PatternRecognition, pp. 660–668, IEEE, LasVegas, NV, USA, 2016.

[27] P. Li, Z. Chen, L. T. Yang, J. Gao, Q. Zhang, and M. J. Deen,“An Improved Stacked Auto-Encoder for Network TrafficFlow Classification,” IEEE Network, vol. 32, no. 6, pp. 22–27,2018.

[28] C. Yang, J. Pu, Y. Dong, G. S. Xie, Y. Si, and Z. Liu, “Sceneclassification-oriented saliency detection via the modularizedprescription,” The Visual Computer, vol. 35, no. 4, pp. 473–488, 2019.

[29] A. Wang and M. Wang, “RGB-D Salient Object Detection viaMinimum Barrier Distance Transform and Saliency Fusion,”


IEEE Signal Processing Letters, vol. 24, no. 5, pp. 663–667,2017.

[30] S. Chen, X. Tan, B. Wang et al., “Reverse attention for salientobject detection,” in European Conference on ComputerVision, Springer, Cham, 2018.

[31] W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency optimizationfrom robust background detection,” 2014 IEEE Conferenceon Computer Vision and Pattern Recognition, pp. 2814–2821,2014.

[32] Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. Torr,“Deeply supervised salient object detection with short connec-tions,” 2017 IEEE Conference on Computer Vision and PatternRecognition (CVPR), pp. 5300–5309, 2017.

[33] X. Hu, L. Zhu, J. Qin, C.-W. Fu, and P.-A. Heng, “Recurrentlyaggregating deep features for salient object detection,” Thirty-Second AAAI Conference on Artificial Intelligence, pp. 6943–6950, 2018.

[34] N. Liu and J. Han, “A deep spatial contextual long-term recur-rent convolutional network for saliency detection,” IEEETransactions on Image Processing, vol. 27, no. 7, pp. 3264–3274, 2018.

[35] J. Gao, P. Li, Z. Chen, and J. Zhang, “A Survey on Deep Learn-ing for Multimodal Data Fusion,” Neural Computation,vol. 32, no. 5, pp. 829–864, 2020.

[36] J. Gao, P. Li, and Z. Chen, “A canonical polyadic deep convo-lutional computation model for big data feature learning inInternet of Things,” Future Generation Computer Systems,vol. 99, pp. 508–516, 2019.


A Deep Multiscale Fusion Method via Low-Rank Sparse Decomposition for Object Saliency Detection Based on Urban Data in Optical Remote Sensing Images1. Introduction2. Deep Multiscale Fusion for Saliency Detection3. Saliency Region Extraction Based on Multiscale Segmentation3.1. Extracting Proposal Region3.2. Region Optimization3.3. Deep Feature Extraction of Proposal Region3.4. Saliency Calculation Based on Deep Feature3.5. Contrast Feature3.6. Spatial Feature3.7. Saliency Detection Based on Low-Rank Sparse Decomposition3.8. Saliency Map Fusion Based on Weight Cellular Automata

4. Experiments and Analysis4.1. Evaluation Index and Parameter Setting4.2. Segment Scale Determination4.3. PCA Parameter Determination4.4. Saliency Detection Comparison with Other State-of-the-Art Methods

5. ConclusionsData AvailabilityConflicts of Interest

a deep multiscale fusion method via low-rank sparse ...research article a deep multiscale fusion...

Documents