2354 ieee transactions on image processing, vol. 27, no. 5 ...jwp2128/papers/caozhouetal2018.pdf ·...

14
2354 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 5, MAY 2018 Hyperspectral Image Classification With Markov Random Fields and a Convolutional Neural Network Xiangyong Cao , Feng Zhou , Member, IEEE, Lin Xu, Deyu Meng , Member, IEEE, Zongben Xu, and John Paisley, Member, IEEE Abstract— This paper presents a new supervised classification algorithm for remotely sensed hyperspectral image (HSI) which integrates spectral and spatial information in a unified Bayesian framework. First, we formulate the HSI classification problem from a Bayesian perspective. Then, we adopt a convolutional neural network (CNN) to learn the posterior class distributions using a patch-wise training strategy to better use the spatial infor- mation. Next, spatial information is further considered by placing a spatial smoothness prior on the labels. Finally, we iteratively update the CNN parameters using stochastic gradient decent and update the class labels of all pixel vectors using α-expansion min- cut-based algorithm. Compared with the other state-of-the-art methods, the classification method achieves better performance on one synthetic data set and two benchmark HSI data sets in a number of experimental settings. Index Terms— Hyperspectral image classification, Markov ran- dom fields, convolutional neural networks. I. I NTRODUCTION H YPERSPECTRAL remote sensors capture digital images in hundreds of continuous narrow spectral bands to produce a high-dimensional hyperspectral image (HSI). Since HSI provides detailed information on spectral and spatial distributions of distinct materials [48], it has been used for many applications, such as land-use [10], [33], [47] and land- cover mapping [33], forest inventory [42], and urban area problems [52]. All these applications require the material class label of each hyperspectral pixel vector and thus HSI classifi- cation has been an active research topic in the field of remote Manuscript received June 21, 2017; revised November 12, 2017 and December 30, 2017; accepted January 11, 2018. Date of publication January 29, 2018; date of current version February 21, 2018. This work was supported in part by the National Natural Science Foundation of China under Grant 61661166011, Grant 61603292, Grant 11690011, Grant 61721002, Grant 91330204, Grant 11131006, and Grant 61603235, and in part by the National Basic Research Program (973 Program) of China under Grant 2013CB329404. The work of X. Cao was supported by the China Scholarship Council under Grant 2016013581. The associate editor coordi- nating the review of this manuscript and approving it for publication was Prof. Peter Tay. (Corresponding author: Deyu Meng.) X. Cao, D. Meng, and Z. Xu are with the School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an 710049, China (e-mail: [email protected]; [email protected]; [email protected]. edu.cn). F. Zhou is with the National Laboratory of Radar Signal Processing, Xidian University, Xi’an 710071, China (e-mail: [email protected]). L. Xu is with the NYU Multimedia and Visual Computing Laboratory, New York University Abu Dhabi 129188, United Arab Emirates (e-mail: [email protected]). J. Paisley is with the Department of Electrical Engineering and the Data Science Institute, Columbia University, New York, NY 10027 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2018.2799324 sensing. The aim of HSI classification is to categorize each hyperspectral pixel vector into a discrete set of meaningful classes according to the image contents. In the last few decades, many methods have been proposed for HSI classification. These can be roughly divided into two categories: spectral based methods and spectral-spatial based methods. We first briefly review these HSI classifi- cation approaches and then discuss the contributions of our proposed method. A. Related Work: Spectral Vs Spectral-Spatial Based Methods Many classical HSI classification approaches are only based on spectral information [2], [39], [61]. Among these methods, pure spectral classification method without any band reduction has often been proposed in the literature [6], [11], [18], [43]. Besides, other pure spectral methods have also been proposed. Specifically, these methods extract spectral features via feature extraction methods, such as principal component analysis [39], independent component analysis [61] and linear discriminant analysis [2]. However, these approaches only consider the spectral information and ignore the spatial correlations among distinct pixels in the image, which tends to decrease their classification performance relative to those which consider both. In this paper we instead focus on designing a spectral- spatial based classification method. Spectral-spatial based methods can help improve the classi- fication performance since they incorporate additional spatial information from the HSI. It has often been observed that the spatial information is more crucial than the spectral signatures in the HSI classification task [23], [48], [49], [58]. Therefore, spectral-spatial methods have been proposed that additionally consider spatial correlation information. One approach is to extract the spatial dependence in advance using a spatial fea- ture extraction method before learning a classifier. The patch- based feature extraction method is a representative example of this approach [15], [16], [54], [56]. Here, features are extracted from groups of neighboring pixels using a subspace learning technique such as low-rank matrix factorization [63], dictio- nary learning [15], [54] or subspace clustering [31]. Compared with the original spectral vector, the features extracted by patch-based methods have higher spatial smoothness [12]. Another popular approach for incorporating spatial infor- mation is to use a Markov random field (MRF) to post- process the classification map. The MRF is an undirected graphical model [4] that has been applied in a variety of fields from physics to computer vision and machine learning. 1057-7149 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 14-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2354 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 5 ...jwp2128/Papers/CaoZhouetal2018.pdf · In particular, they have been widely used for image processing tasks such as image

2354 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 5, MAY 2018

Hyperspectral Image Classification With MarkovRandom Fields and a Convolutional Neural Network

Xiangyong Cao , Feng Zhou , Member, IEEE, Lin Xu, Deyu Meng , Member, IEEE,Zongben Xu, and John Paisley, Member, IEEE

Abstract— This paper presents a new supervised classificationalgorithm for remotely sensed hyperspectral image (HSI) whichintegrates spectral and spatial information in a unified Bayesianframework. First, we formulate the HSI classification problemfrom a Bayesian perspective. Then, we adopt a convolutionalneural network (CNN) to learn the posterior class distributionsusing a patch-wise training strategy to better use the spatial infor-mation. Next, spatial information is further considered by placinga spatial smoothness prior on the labels. Finally, we iterativelyupdate the CNN parameters using stochastic gradient decent andupdate the class labels of all pixel vectors using α-expansion min-cut-based algorithm. Compared with the other state-of-the-artmethods, the classification method achieves better performanceon one synthetic data set and two benchmark HSI data sets ina number of experimental settings.

Index Terms— Hyperspectral image classification, Markov ran-dom fields, convolutional neural networks.

I. INTRODUCTION

HYPERSPECTRAL remote sensors capture digital imagesin hundreds of continuous narrow spectral bands to

produce a high-dimensional hyperspectral image (HSI). SinceHSI provides detailed information on spectral and spatialdistributions of distinct materials [48], it has been used formany applications, such as land-use [10], [33], [47] and land-cover mapping [33], forest inventory [42], and urban areaproblems [52]. All these applications require the material classlabel of each hyperspectral pixel vector and thus HSI classifi-cation has been an active research topic in the field of remote

Manuscript received June 21, 2017; revised November 12, 2017 andDecember 30, 2017; accepted January 11, 2018. Date of publicationJanuary 29, 2018; date of current version February 21, 2018. This work wassupported in part by the National Natural Science Foundation of China underGrant 61661166011, Grant 61603292, Grant 11690011, Grant 61721002,Grant 91330204, Grant 11131006, and Grant 61603235, and in partby the National Basic Research Program (973 Program) of China underGrant 2013CB329404. The work of X. Cao was supported by the ChinaScholarship Council under Grant 2016013581. The associate editor coordi-nating the review of this manuscript and approving it for publication wasProf. Peter Tay. (Corresponding author: Deyu Meng.)

X. Cao, D. Meng, and Z. Xu are with the School of Mathematicsand Statistics, Xi’an Jiaotong University, Xi’an 710049, China (e-mail:[email protected]; [email protected]; [email protected]).

F. Zhou is with the National Laboratory of Radar Signal Processing, XidianUniversity, Xi’an 710071, China (e-mail: [email protected]).

L. Xu is with the NYU Multimedia and Visual Computing Laboratory,New York University Abu Dhabi 129188, United Arab Emirates (e-mail:[email protected]).

J. Paisley is with the Department of Electrical Engineering and the DataScience Institute, Columbia University, New York, NY 10027 USA (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2018.2799324

sensing. The aim of HSI classification is to categorize eachhyperspectral pixel vector into a discrete set of meaningfulclasses according to the image contents.

In the last few decades, many methods have been proposedfor HSI classification. These can be roughly divided intotwo categories: spectral based methods and spectral-spatialbased methods. We first briefly review these HSI classifi-cation approaches and then discuss the contributions of ourproposed method.

A. Related Work: Spectral Vs Spectral-SpatialBased Methods

Many classical HSI classification approaches are only basedon spectral information [2], [39], [61]. Among these methods,pure spectral classification method without any band reductionhas often been proposed in the literature [6], [11], [18], [43].Besides, other pure spectral methods have also been proposed.Specifically, these methods extract spectral features via featureextraction methods, such as principal component analysis [39],independent component analysis [61] and linear discriminantanalysis [2]. However, these approaches only consider thespectral information and ignore the spatial correlations amongdistinct pixels in the image, which tends to decrease theirclassification performance relative to those which considerboth. In this paper we instead focus on designing a spectral-spatial based classification method.

Spectral-spatial based methods can help improve the classi-fication performance since they incorporate additional spatialinformation from the HSI. It has often been observed that thespatial information is more crucial than the spectral signaturesin the HSI classification task [23], [48], [49], [58]. Therefore,spectral-spatial methods have been proposed that additionallyconsider spatial correlation information. One approach is toextract the spatial dependence in advance using a spatial fea-ture extraction method before learning a classifier. The patch-based feature extraction method is a representative example ofthis approach [15], [16], [54], [56]. Here, features are extractedfrom groups of neighboring pixels using a subspace learningtechnique such as low-rank matrix factorization [63], dictio-nary learning [15], [54] or subspace clustering [31]. Comparedwith the original spectral vector, the features extracted bypatch-based methods have higher spatial smoothness [12].

Another popular approach for incorporating spatial infor-mation is to use a Markov random field (MRF) to post-process the classification map. The MRF is an undirectedgraphical model [4] that has been applied in a variety offields from physics to computer vision and machine learning.

1057-7149 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: 2354 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 5 ...jwp2128/Papers/CaoZhouetal2018.pdf · In particular, they have been widely used for image processing tasks such as image

CAO et al.: HSI CLASSIFICATION WITH MARKOV RANDOM FIELDS AND A CNN 2355

In particular, they have been widely used for image processingtasks such as image registration [69], image restoration [5],image compression [50] and image segmentation [13]. In theimage segmentation task, MRFs encourage neighboring pixelsto have the same class label [38]. This has been shownto greatly improve the classification accuracy in the HSIclassification task [12], [37], [60].

Another family of methods is based on spatial regularization(anisotropic smoothing) prior to spectral classification of theregularized image [22], [44]. Other spatial-spectral methodsinclude using 3-dimensional discrete wavelets [12], [49],3-dimensional Gabor wavelets [53], morphological pro-files [3], attribute profiles [20] and manifold learning meth-ods [19], [41].

Although the previous approaches perform well, a drawbackof many of these approaches is that the extracted featuresare hand-crafted. Specifically, these approaches extract thefeatures of the HSI by pre-specified strategies without usingthe label information. Thus they highly depend on prior knowl-edge of the specific domain and are often sub-optimal [67].In the last few years, deep learning has been a powerfulmachine learning technique for learning feature representa-tions from the data itself [25]. Such methods have been widelyapplied to image processing and computer vision problems,such as image classification [36], image segmentation [13],action recognition [30] and object detection [57].

Recently, a few deep learning methods have been introducedfor the HSI classification task. For example, unsupervisedfeature learning methods such as stacked autoencoders [14]and deep belief networks [17] have been proposed. Althoughthese two unsupervised learning models can extract deephierarchical features, the 3-dimensional (3D) patch must befirst flattened into a 1-dimensional (1D) vector in order tosatisfy the input requirement, thereby losing some spatialinformation. A supervised autoencoder method that uses labelinformation during feature learning was also proposed [40].Similar methods have been proposed based on the convolu-tional neural network (CNN). For example, using a five-layerCNN trained on a 1D spectral vector without using spatialinformation [29], or using a modified spectral-spatial deepCNN (SS-DCNN) [66] with 3D patch input, or deep CNNswith spatial pyramid pooling (SPP) [27] (SPP-DCNN) [65].Other methods based on CNNs [46], [67] and recurrent neuralnetworks (RNNs) [45] have also been proposed.

B. Contributions of Our Approach

As mentioned, compared with the traditional spectral-spatialHSI classification methods, deep learning can directly learndata-dependent and hierarchical feature representations fromraw data. However, none of the previously mentioned deepmethods formulate the HSI classification task in a Bayesianframework, where deep learning and MRF are consideredsimultaneously. In this paper, we propose a new Bayesiansupervised HSI classification model based on deep learningand Markov random fields. Our contributions are twofold:

First, we formulate the HSI classification problem from aBayesian perspective and derive its corresponding optimizationalgorithm. We propose a new model which combines a CNN

Fig. 1. The network structure of the CNN used as our discriminative classifier.

with a smooth MRF prior. The CNN is used to extractspectral-spatial features from 3D patches, and a smooth MRFprior is placed on the labels to further exploit spatial infor-mation. Optimizing the final model can be done by iterativelyupdating CNN parameters and class labels of all the pixelvectors. In this way, we integrate the MRF with the CNN.To our knowledge, this is the first approach that integrates theMRF with the CNN for the HSI classification problem.

Second, our experimental results on one synthetic HSIdataset and two real HSI datasets demonstrate that theproposed method outperforms other state-of-the-art methodsfor the HSI classification problem, in particular the threedeep learning methods SS-DCNN [66], SPP-DCNN [65] andDC-CNN [67].

The rest of the paper is organized as follows: InSection 2 we formulate the HSI classification problem froma Bayesian perspective. In Section 3 we describe our pro-posed approach. In Section 4 we conduct experiments onone synthetic dataset and two benchmark real HSI datasetsand compare with state-of-the-art methods. We conclude inSection 5.

II. PROBLEM FORMULATION

Before formulating the HSI classification problem, we firstdefine the problem-related notations used throughout the paper.

A. Problem Notation

Let the HSI dataset be H ∈ Rh×w×d , where h and w are theheight and width of the spatial dimensions, respectively, andd is the number of spectral bands. The set of class labelsis defined as K = {1, 2, . . . , K }, where K is the numberof classes for the given HSI dataset. The set of all patchesextracted from HSI data H is denoted as X = {x1, x2, . . . , xn},where xi ∈ Rk×k×d , k is the patch size in the spatialdimension, n = hw represents the total number of extractedpatches (sliding step is 1 and padding is used). Each patchxi is first fed into a CNN (shown in Figure 1) and then aflattened vector zi is output. The corresponding label of zi isyi ∈ K, which is the label of the spectral vector correspondingto the center of xi . For simplicity, we denote (xi , yi ) as asample, which doesn’t mean all pixel vectors in xi belong tothe same class yi but means that the extracted spatial-spectralfeature vector zi belongs to class yi . In other words, we usethe 3D patches only to obtain spatial-spectral feature for thecorresponding central pixel vector of this patch. The label

Page 3: 2354 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 5 ...jwp2128/Papers/CaoZhouetal2018.pdf · In particular, they have been widely used for image processing tasks such as image

2356 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 5, MAY 2018

set is defined as y = {y1, y2, . . . , yn}. We define the trainingset for class k as D(k)

l(k) = {(x1, y1), . . . , (xl(k) , yl(k) )} and thus

the entire training set is Dl = {D(1)

l(1) ,D(2)

l(2) , . . . ,D(K )

l(K ) }, where

l = ∑Kk=1 l(k) � n is the total number of training patches.

B. Problem Setup

The goal of HSI classification is to assign a label yi tothe central pixel vector of each patch xi , i = 1, 2, . . . , n.In the discriminative classification framework, the estimationof labels y for observations X can be obtained by maximizinga distribution P(y|X ,�), where � are the parameters ofthe classifier. In our framework, we seek to combine thespatial modeling power of a Markov random field with thediscriminative power of deep learning. Therefore, we let thisdistribution be of the form

P(y|X ,�) =∑

y

P(y, y|X ,�)

=∑

y

P(y|y)P(y|X ,�)

=∑

y

P(y|y)

n∏

i=1

P(yi |xi ,�), (1)

where y = [yT1 ; yT

2 ; . . . ; yTn ] ∈ Rn×K are intermediate latent

variables and each yi ∈ RK provides an initial probabilistic“pseudo-label” for each patch xi . In our work, we define thedistribution P(yi |xi ,�) as

P(yi |xi ,�) ={

1 yi = f (xi ; �)

0 otherwise(2)

where f is a classifier. After the pseudo-labels y are obtained,a second classifier P(y|y) takes this collection of pseudo-labelsy and outputs the classification label y = [y1; y2; . . . ; yn],which can be made as y = arg maxy∈Kn P(y|X ,�).

However, since a sum over all y is computationally inten-sive, we simplify this in two ways for computational con-venience. First, instead of only learning y by maximizingP(y|X ,�), we learn a point estimate of the pairs (y, y) bymaximizing P(y, y|X ,�). The goal is then to solve

(y, y) = arg maxy,y

{

log P(y|y) +n∑

i=1

log P(yi |xi ,�)

}

. (3)

Our second simplification takes the form of breaking theoptimization problem in Eq. (3) into two subproblems. First,we calculate label pseudo-labels y based on the classifier f .Then, given the values y, the class labels y can be obtainedby maximizing log P(y|y). Thus Eq. (3) can be solved by thefollowing steps:

yi = f (xi ; �∗), i = 1, 2, . . . , n (4)

y = arg maxy∈Kn

log P(y|y). (5)

Here it should be emphasized that classifier f is first trainedon the training data set Dl and �∗ are the learned parametersof the classifier. After obtaining the current classification labelsy on the whole HSI, next we train the classifier again using all

the patches X and their corresponding labels y. The two stepsare iteratively implemented until some criteria are satisfied.In this way, we can make full use of the unlabeled patches andthus more information can be incorporated in our framework.

In this paper, in order to fully use both spatial and spectralinformation of HSI, we design a new approach based on thisapproach. Specifically, we firstly adopt a convolutional neuralnetwork (CNN) as the classifier f , which not only helpsextract features for HSI, but can also provide pseudo-labelsyi for each xi . Then, we place a smoothness prior on the finallabels y to further enforce spatial consistency. In the followingsection we introduce these two aspects in detail.

III. PROPOSED APPROACH

In this section, we first introduce the convolutional neuralnetwork (CNN) classifier for HSIs. Then, we further considerspatial information by placing a smoothness prior on the labelsy and formulate the classification task as a labeling problemin a Markov random field (MRF).

A. Discriminative CNN-Based Classifier

Convolutional neural networks (CNN) play an importantrole in many computer vision tasks. They consist of com-binations of convolutional layers and fully connected layers.Compared with the standard fully connected feed-forwardneural network (also called multi-layer perceptrons), CNNsexploit spatial correlation by enforcing a local connectivitypattern between neurons of adjacent layers to achieve betterperformance in many image processing problem, such asimage denoising [62], image super-resolution [21], imagederaining [24] and image classification [36].

In this paper, we adopt the CNN structure shown in Figure 1.This network contains one input layer, two pairs of convolutionand max pooling layers, two fully connected layers and oneoutput layer (The tuning of the network structure will bediscussed in Section IV). The detailed parameter settings ineach layer are also shown in Figure 1. For the hyperspectralimage classification task, each sample is a 3D patch of sizek × k × d . Next, we introduce the flow of processing eachsample patch xi at each layer of the CNN.

We note that there is no need to flatten the sample patchxi into a 1-dimensional vector before it is fed into the inputlayer. Therefore, the input size of the first layer is k × k × d .The sample patch is input into the first convolutional layer,followed by a max pooling operation. The first convolutionallayer filters the k × k × d sample patch using 100 filters ofsize 5 × 5 × d . After this convolution, we have 100 featuremaps each n1 × n1, where n1 = k − 4. We then perform maxpooling on the combined set of features of size n1 ×n1 ×100.The kernel size of the max pooling layer is 2 × 2 and thusthe pooled feature maps have size of n2 × n2 × 100, wheren2 = �n1/2�. This set of pooled feature maps is then passedthrough a second pair of convolutional and max pooling layers.The convolutional layer contains 300 filters of size 3×3×100,which filters the pooled feature maps into new feature mapsof size n3 × n3 × 300, where n3 = n2 − 2. Again these newfeature maps are input into the second max pooling layer with

Page 4: 2354 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 5 ...jwp2128/Papers/CaoZhouetal2018.pdf · In particular, they have been widely used for image processing tasks such as image

CAO et al.: HSI CLASSIFICATION WITH MARKOV RANDOM FIELDS AND A CNN 2357

kernel size 2×2 and turned into a second set of pooled featuremaps of size n4 × n4 × 300, where n4 = �n3/2�.

The second pooled feature maps are flattened into a1-dimensional vector xpool2 and input to the fully connectedlayers. The computation of the next three fully connectedlayers are as follows:

f (5)(xpool2) = σ(W(5)xpool2 + b(5)), (6)

f (6)(xpool2) = σ(W(6)f (5)(xpool2) + b(6)), (7)

f (7)(xpool2) = W(7)f (6)(xpool2) + b(7), (8)

where W(5), W(6) and W(7) are weight matrices, b(5), b(6)

and b(7) are the biases of the nodes, and σ(.) is the non-linearactivation function. In our network, the activation functionsin the convolutional layers and fully connected layers are therectified linear unit (ReLU). In order to simplify the notation,we denote W = {W(1), W(3), W(5), W(6), W(7)} and b ={b(1), b(3), b(5), b(6), b(7)}, where {W(1), W(3)} and {b(1), b(3)}are the weight matrices and biases in the convolutional layers,respectively. Thus the aforementioned classifier parameters �

are {W, b} in the CNN.The final vector f (7) ∈ RK is then passed to the softmax

(exponential followed by normalization) function, which givesa distribution on the label. As defined previously, the pseudo-label is yi for a given sample xi . Then the pseudo-classificationlabel can be obtained according to yc

i = arg maxk∈K yik .Therefore, given the training set (Dl in the first updating andthen the whole (X , y)) where we represent each scalar labelyi ∈ K of a training sample as a one-hot vector yi ∈ RK ,the cross-entropy (CE) loss function can be computed as

E(W, b) = 1

l

i

C E(yi , y(W,b)i ),

= −1

l

i

K∑

k=1

yik log y(W,b)ik , (9)

where W and b is the parameter set defined above and herewe denote yi as y(W,b)

i in order to emphasize that the pseudo-labels yi s are based on the learned CNN. The optimization ofthe loss function (9) is conducted by using the stochastic gra-dient descent (SGD) algorithm. In the t-th iteration, the weightand bias are updated by

Wt+1 = Wt − α∂ E(W, b)

∂W|Wt , (10)

bt+1 = bt − α∂ E(W, b)

∂b|bt , (11)

where the gradients with respect to W and b are calculatedusing the back-propagation algorithm [51] and α is the learn-ing rate, which is set to 0.001 in our experiments. Here wenote that unlike some CNN applications, where the parametersare first pre-trained on some samples prepared in advance andthen fine-tuned on the new dataset, our proposed CNN-MRFmethod is directly applied to each dataset without pre-training.In all our experiments, the CNN weight parameters W areinitialized by random standard normal distribution and the biasparameters b are initialized with zeros. In order to alleviatethe issue of overfitting of the proposed CNN in the training

phase, we use dropout [55] in the 5th and 6th fully connectedlayers, and we set the dropout rate to 0.5 in our experiments.

B. Label Smoothness Prior and MRF Optimization

After the label pseudo-labels y are obtained using thecurrent trained CNN. Then, we begin to present our coremodel based on the simplifications from (1) to (5). Specifically,in order to obtain the classification results y we need to solvethe following optimization problem

y = arg maxy∈Kn

log P(y|y)

= arg maxy∈Kn

log P(y|y) + log P(y)

= arg maxy∈Kn

n∑

i=1

log P(yi |yi ) + log P(y) (12)

where y can be regarded as new features for this problem,log P(yi |yi ) is the log-likelihood and P(y) is the label prior.Specifically, in our work, the log-likelihood log P(yi |yi ) isdefined as

log P(yi |yi) =K∑

k=1

1{yi = k} log yik, (13)

where 1{·} is an indicator function.In image segmentation tasks, it is probable that adjacent

pixels have the same label. The exploitation of this naive priorinformation can often dramatically improve the segmentationperformance. In this paper, we enforce a spatially smooth prioron labels y to encourage neighboring pixels to belong to thesame class. This smoothness prior distribution on labels y isdefined as

P(y) = 1

Zeμ

∑ni=1

∑j∈N (i) δ(yi−y j ), (14)

where Z is a normalization constant for the distribution, μis the label smoothness parameter, N (i) is the neighboringpixels of pixel i and δ(·) is a function defined as: δ(0) = 1and δ(y) = −1 for y �= 0. We note that the pairwiseinteraction terms δ(yi − y j ) obtain higher probability whenneighboring labels are equal than when they are not equal.In this way, this smoothness prior can encourage piecewisesmooth segmentations.

Based on Eq. (13) and Eq. (14), the final classificationmodel (12) is thus given by

y = arg maxy∈Kn

⎧⎨

n∑

i=1

K∑

k=1

1{yi = k} log yik

+ μ

n∑

i=1

j∈N (i)

δ(yi − y j )

⎫⎬

⎭. (15)

This objective function contains many pairwise interactionterms and is a challenging combinatorial optimization prob-lem. It can also be regarded as an MRF model in which anundirected model graph G = <V, E> is defined on the wholeimage, where graph node sets V correspond to pixels, and theundirected edge set E represents the neighboring relationshipbetween the pixels [9]. We define a random variable yi on each

Page 5: 2354 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 5 ...jwp2128/Papers/CaoZhouetal2018.pdf · In particular, they have been widely used for image processing tasks such as image

2358 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 5, MAY 2018

Algorithm 1 CNN-MRF Classification Algorithm for HSI

Fig. 2. The flowchart of the proposed CNN-MRF algorithm.

node vi ∈ V , thus allowing the labels y to form a Markovrandom field. The objective function is its energy function,of which the first term represents the cost of a pixel beingassigned with different classes. The larger the probability of apixel belonging to a certain class, the more probable that thepixel is assigned the corresponding label. The second termof the energy function encourages the labels of neighboringpixels to be the same.

This optimization problem of the MRF is NP-hard. Manyapproximating algorithms have been proposed, such as thegraph cut method [8], [35], belief propagation [64] and mes-sage passing [34]. In this paper, we use the α-expansion min-cut method [8] because of its good performance and fastcomputation speed.

C. CNN-MRF Classification Algorithm

After introducing how to update CNN’s parameters andcompute class labels respectively, we can summarize ourCNN-MRF classification algorithm in detail in Algorithm 1.Besides, the corresponding flowchart of this algorithm is alsodepicted in Figure 2.

IV. EXPERIMENTS

To test the effectiveness of our proposed CNN-MRFclassification algorithm in different scenarios, we conduct

experiments on one synthetic dataset and two real-worldbenchmark datasets.1 For comparison, we consider severalstate-of-the-art HSI classification methods, including sup-port vector machine graph-cut method (SVM-GC), subspacemultinomial logistic regression with multilevel logistic prior(MLRsubMLL)2 [37] and support vector machine based onthe 3-dimensional discrete wavelet transform method (SVM-3DDWT-GC)3 [12]. It should be noted that the compet-ing methods SVM-GC and MLRsubMLL are methods thatintegrate the MRF into the SVM and MLRsub methods,respectively. In order to distinguish the classification methodswith MRF and without MRF, we denote them as classificationmethods and regularized classification methods, respectively.Besides, we also compare with three deep learning methods:SS-DCNN [66], SPP-DCNN [65] and DC-CNN [67]. Ourproposed CNN-MRF approach is implemented in Python usingthe Tensorflow [1] library on a server with Nvidia GeForceGTX 1080 and Tesla K40c. All non-deep algorithms are runin Matlab R2014b. For comparisons with other deep learningmodels, we use the best reported results for these algorithms.

All methods are compared numerically using the follow-ing three criteria [12]: overall accuracy (OA), average accu-racy (AA) and the kappa coefficient (κ). OA represents thenumber of correctly classified samples divided by the totalnumber of test samples, AA denotes the average of individualclass accuracies, and κ involves both omission and commissionerrors and gives a good representation of the the overallperformance of the classifier. For all the three criteria, a largervalue indicates a better classification performance.

A. Synthetic HSI Data

In this section, we first generate a synthetic HSI dataset.Then, we use this dataset to evaluate in detail the sensitivityof performance to different parameters setting of our CNNstructure. Finally, we compare our method with other compet-ing methods on this dataset using the tunning CNN structure.

1) Generation of Synthetic HSI Data: To generate asynthetic HSI data, five endmembers are first extractedrandomly from a real scene with 162 bands in ranges400–2,500 nm, and then 40,000 vectors are generated asa sum of Gaussian fields with constraints so as to respectthe abundance-nonnegative-constraint (ANC) and abundance-sum-to-one-constraint (ASC). Finally, this dataset is generatedusing a Generalized Bilinear Mixing Model (GBM) [68]:

z =K∑

i=1

aiei +K−1∑

i=1

K∑

j=i+1

γi j ai a j ei � e j + n, (16)

with class number K = 5. Here z is the simulated pixel vector,γi j are selected uniformly from [0, 1], ei , i = 1, . . . , 5 arethe five end-members, n is the Gaussian noise with an SNRof 30 dB, ai ≥ 0 and

∑Ki=1 ai = 1.

1Code is available at: https://github.com/xiangyongcao/CNN_HSIC_MRF.2Code is available at: https://www.lx.it.pt/~jun/demos.html3Code is available at: https://github.com/xiangyongcao/3DDWT-SVM-GC

Page 6: 2354 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 5 ...jwp2128/Papers/CaoZhouetal2018.pdf · In particular, they have been widely used for image processing tasks such as image

CAO et al.: HSI CLASSIFICATION WITH MARKOV RANDOM FIELDS AND A CNN 2359

TABLE I

OVERALL ACCURACY (%) WITH DIFFERENT KERNEL SIZES

TABLE II

OVERALL ACCURACY (%) WITH DIFFERENT NETWORK WIDTHS

TABLE III

OVERALL ACCURACY (%) WITH DIFFERENT NETWORK DEPTHS

2) Impact of Parameter Settings: In this section, we eval-uate in detail the sensitivity of performance to differentparameters settings of our CNN structure. In the followingexperiments, we use the synthetic dataset and randomly choose1% training samples from each class as training data and theremaining for testing.

a) Kernel size: First, we test the impact of differentkernel sizes. We fix the kernel size of the second layer as 3 andevaluate the performance by varying the kernel size in thefirst layer. Table I shows these results. From this table we canconclude that larger kernel sizes can obtain better results. Thisis because more structure and texture can be captured using alarger kernel size. Therefore, in our experiments, we set thekernel size of the first layer as 5.

b) Network width: Next, we evaluate the impact of net-work width in the convolutional layer on the classificationresults. Fixing the network width of the first convolutionallayer to 100, we test the performance by changing the widthof the second convolutional layer. These results are displayedin Table II. It can be observed that the results are not verysensitive to the network width and thus in our experiments,we select 200 as the default setting of the network width inthe second convolutional layer.

c) Network depth: We also conduct experiments to testthe impact of different network depths by reducing or addingnon-linear layers. We train and test on 5 networks withdepths 5, 7, 9, 11 and 13. The experimental results aresummarized in Table III. From this table it can be seenthat increasing the network depth does not always generatebetter results, which may be a result of the gradient vanishingproblem [28] encountered by deeper neural networks. For thisexperiment, the best performance can be achieved by settingthe network depth to 7, which we use as the default settingfor the network depth in our experiments.

d) Patch size: We also investigate the performances ofour proposed method with respect to different data patch sizesk = {1, 3, 5, 9, 13}. These results are shown in Table IV.As can be seen, a larger patch size of the sample generatesbetter results. This is because more spatial information isincorporated into the training process. However, it also takes

TABLE IV

OVERALL ACCURACY (%) WITH DIFFERENT PATCH SIZES

TABLE V

OVERALL ACCURACY (%) WITH DIFFERENT SMOOTH PARAMETERS

more computation time to train the network with an increasingpatch size. Thus we choose 9 as the default setting of patchsize as a tradeoff between performance and the running time.

e) Smoothness parameter: Finally, we test the impact ofdifferent smoothness parameter μ on the obtained classifi-cation results. Table V shows these results, where it can beobserved that setting μ = 20 produces the best classificationresults. Thus we use this value in our experiments.

For other parameters, such as the kernel size in max poolinglayer which is set as 2, the number of nodes in the two fullyconnected layers, which are set as 200 and 100 respectively,we observe consistent performance under variations. There-fore, we did not report their impact to the performance inthis section. The learning rate α is set as 0.001 and the batchsize nbatch is set as 100 for synthetic and Indian Pines datasetsince consistent performance is also observed under variations.While for Pavia University dataset the learning rate is set to0.0001 and the batch size nbatch is set as 150. For the followingexperiments, we adopt the same network structure.

3) Experimental Result of the Synthetic HSI Data: Usingthe parameter settings above for the CNN structure, we com-pare our CNN-MRF method with other competing meth-ods. The final classification results are shown in Table VI.From Table VI, we see that the proposed CNN-MRF methodachieves better performance with respect to OA, AA and κthan other methods. We emphasize that all non-deep learn-ing methods (namely SVM-GC, MLRsubMLL and SVM-3DDWT-GC) in this experiment use the MRF to post-processthe classification map, and the differences between them andour method are the way to extract features and the iterativelyupdating the CNN parameters and class labels. Thus theimprovement in our method here is due to the use of CNN andthe integration of CNN with MRF. Comparing with other deep-learning methods, our CNN-MRF method also outperformsthem. For better visualization, we also demonstrate the finalclassification map in Figure 3. From Figure 3, we can see thatthe classification maps obtained by our method are visuallycloser to the ground truth map than the other methods, whichis consistent with the quantitative metric OA.

B. Real HSI Data

We next test our method on two real HSI benchmarkdatasets, the Indian Pines dataset and Pavia University dataset,in a variety of experimental settings.

1) AVIRIS Indian Pines Data: This data set was gathered bythe Airborne Visible/Infrared Imaging Spectrometer (AVIRIS)

Page 7: 2354 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 5 ...jwp2128/Papers/CaoZhouetal2018.pdf · In particular, they have been widely used for image processing tasks such as image

2360 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 5, MAY 2018

TABLE VI

OVERALL ACCURACIES (%), AVERAGE ACCURACIES (%), KAPPA COEFFICIENT (κ ) ANDRUNNING TIME OF ALL COMPETING METHODS ON THE SYNTHETIC DATASET

Fig. 3. Classification maps obtained by all competing methods on the synthetic dataset (overall accuracies are reported in parentheses).

sensor over the Indian Pines test site in North-westernIndiana in June 1992. The original dataset contains 220 spec-tral reflectance bands in the wavelength range 0.4–2.5 μm,of which 20 bands cover the region of water absorption. Unlikeother methods which remove the 20 polluted bands, we keepall 220 bands in our experiments. This HSI has a spectralresolution of 10 nm and a spatial resolution of 20 m bypixel, and the spatial dimension is 145 × 145. The groundtruth contains 16 land cover classes. This dataset poses achallenging problem because of the significant presence ofmixed pixels in all available classes and also because ofthe unbalanced number of available labeled pixels per class.A sample band of this dataset and the related ground truthcategorization map are shown in Figure 4.

To first evaluate our proposed CNN-MRF method in thescenario of limited training samples, we randomly choose10% of the available labeled samples for each class from thereference data, which is an imbalanced training sample case,and use the remaining samples in each class for testing. Thetraining and testing sets are summarized in Table VII. Thisexperiment is repeated 20 times for each method and the bestresult is reported.

Additionally, some parameters need to be set in advance inthese experiments. For the competing methods, their parame-ters are set as their papers suggest: For SVM-based methods,the RBF kernel parameter γ and the penalty parameter C are

Fig. 4. Indian Pines image and related ground truth categorization infor-mation. (a) The original HSI. (b) The ground truth categorization map (Thisfigure is better seen by zooming on a computer screen.)

tuned through 5-fold cross validation (γ = 2−8, 2−7, . . . , 28,C = 2−8, 2−7, . . . , 28). For SVM-GC and SVM-3DDWT-GC,the spatial smoothness parameter β is set as 0.75 advisedby [12] and [60]. For MLRsubMLL, the smoothness parame-ter μ and the threshold parameter τ are set following [37]. Forour proposed CNN-MRF method, we adopt the same networkstructure as the synthetic HSI data.

For the above experimental settings, in order to measure theperformance improvement due to including spatial contextualinformation with the MRF, we also report the classification

Page 8: 2354 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 5 ...jwp2128/Papers/CaoZhouetal2018.pdf · In particular, they have been widely used for image processing tasks such as image

CAO et al.: HSI CLASSIFICATION WITH MARKOV RANDOM FIELDS AND A CNN 2361

Fig. 5. Classification maps obtained by all methods on the Indian Pines dataset (overall accuracies are reported in parentheses).

Fig. 6. Pavia University image and related ground truth categorizationinformation. (a) The original HSI. (b) The ground truth categorization map.(This figure is better seen by zooming on a computer screen.)

results of each method without using the MRF prior, andcall these methods SVM, MLRsub, SVM-3DDWT and CNN,respectively. Classification maps on the Indian dataset areshown for all methods in Figure 5, and the accuracies(i.e. individual class accuracy, OA, AA and κ) are reportedin Table VIII.

From Table VIII, we highlight two main results. First,CNN-MRF and CNN, which are both deep learning meth-ods, achieve the first and third best performance in termsof OA; CNN-MRF attains about 3% improvement in term ofOA in this scenario, with SVM-3DDWT-GC trailing slightlybehind (94.36%). And CNN falls behind SVM-3DDWT-GConly 0.08%. This means that using a CNN for classificationwithout the MRF performs better than a MRF-based model

TABLE VII

STATISTICS OF THE INDIAN PINES DATA SET, INCLUDING THE

NAME, THE NUMBER OF TRAINING, TEST ANDTOTAL SAMPLES FOR EACH CLASS

with other classifier except the SVM classifier with 3DDWTfeatures. The CNN therefore significantly helps for this prob-lem. Moreover, as depicted in Figure 5, the classificationmaps of the CNN-based methods are noticeably closer tothe ground truth map. Finally, directly comparing the MRFand non-MRF based methods, we can conclude that using anMRF prior significantly improves the classification accuracyof any particular classifier because it further embeds thespatial smoothness information into the segmentation stage.Therefore, the superior performance of our proposed CNN-MRF method can be explained by using the CNN and MRFstrategies simultaneously to fully exploit the spectral andspatial information in a HSI.

Page 9: 2354 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 5 ...jwp2128/Papers/CaoZhouetal2018.pdf · In particular, they have been widely used for image processing tasks such as image

2362 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 5, MAY 2018

TABLE VIII

INDIVIDUAL CLASS, OVERALL, AVERAGE ACCURACIES (%) AND KAPPA COEFFICIENT (κ ) OF ALL METHODS ON THE INDIAN PINES IMAGE TEST SET

Fig. 7. Classification maps obtained by all competing methods on the Pavia University dataset (overall accuracies are reported in parentheses).

2) ROSIS Pavia University Data: We perform similar exper-iments on a second real HSI dataset. This HSI was acquired bythe Reflective Optics System Imaging Spectrometer (ROSIS)over the urban area of the University of Pavia in northern Italyon July 8, 2002. The original dataset consists of 115 spectralbands ranging from 0.43 to 0.86 μm, of which 12 noisybands are removed and only 103 bands are retained in our

experiments. The scene has a spatial resolution of 1.3 m perpixel, and the spatial dimension is 610×340. There are 9 landcover classes in this scene and the number of each classis displayed in Table IX. We show a sample band and thecorresponding ground truth class map in Figure 6.

To evaluate the performance of our proposed CNN-MRFmethod using only a small number of labeled training samples,

Page 10: 2354 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 5 ...jwp2128/Papers/CaoZhouetal2018.pdf · In particular, they have been widely used for image processing tasks such as image

CAO et al.: HSI CLASSIFICATION WITH MARKOV RANDOM FIELDS AND A CNN 2363

Fig. 8. Overall accuracy (%) obtained by all competing methods with different proportions of training samples on Pavia University dataset. (a) Classificationresults. (b) Regularized classification results.

TABLE IX

STATISTICS OF THE PAVIA UNIVERSITY DATA SET, INCLUDING THENAME, THE NUMBER OF TRAINING, TEST AND

TOTAL SAMPLES FOR EACH CLASS

we randomly chose 40 samples for each class from the groundtruth data for training, which gives a balanced training sample,and the remaining samples are used for testing. The relatedstatistics are also summarized in Table IX. As previously,we repeat this experiment 20 times for each method and reportthe best result. All parameters involved in the compared meth-ods are tuned in the same way as the previous Indian Pinesexperiment. The network structure settings of the CNN arealso the same. Classification and segmentation maps obtainedby all methods on this dataset are illustrated in Figure 7 andaccuracies (i.e. individual class accuracy, OA, AA and κ) aresummarized in Table X.

From Table X, we can conclude that, for this dataset, ourCNN-MRF approach again achieves the best performance interms of the three quantitative criteria. It is also worth notingthat the CNN without MRF again performs second best withrespect to OA, which means that the CNN plays an importantrole in improving the classification accuracy. By using theMRF prior, CNN-MRF obtains about 2% improvement interms of the OA compared with the CNN. However, for otherclassification algorithms we observe that the MRF can greatlyboost classification accuracy. For example, MLRsubMLL hasabout 17% improvement according to OA compared withMLRsub. SVM-3DDWT-GC has the third best classification

OA (94.31%) due to the 3-dimensional discrete wavelettransform (3DDWT), as has been previously studied in [12].Meanwhile, it can be seen from Figure 7 that CNN-MRFobtains much smoother classification map than other methods,which is consistent with the results shown in Table X. Con-sequently, improvement of our CNN-MRF approach can beexplained by the use of CNN and MRF models simultaneously.

C. Limited Training Data Scenarios

To analyze the sensitivity of the proposed method to trainingsets consisting of limited training samples, we conduct addi-tional experiments in which 0.1%, 0.2%, 0.3%, 0.4% and 0.5%of each class are randomly selected from the Pavia Universitydata as training samples and the remaining are used fortesting. For this experiment, we adopt the data augmentationtechnique [36] to help the training process of the CNN. TheOA of all methods are displayed in Figure 8. From Figure 8(a),we observe that the classification results of the CNN methodoutperform the other methods for each training set size. Addi-tionally, when the spatial prior is considered using the MRF,the classification results in Figure 8(b) significantly improvethe corresponding classification results in Figure 8(a), againindicating that the MRF is an important factor for improvingthe classification accuracy.

D. Study on the Interaction Between CNN and MRF

Furthermore, in order to clearly illustrate the interactiveinfluence between CNN and MRF, we also report the OA ofour method on the three datasets as a function of iteration. Theresults are shown in Table XI. Specifically, in our experiments,we first train our CNN for 30 epochs using the training data Dl .Then, we update the class labels y every 10 epochs. FromTable XI, it can be seen that the OA first increases quicklyin the first 30 epochs. Then, the OA increases slowly forabout 40 epochs. Finally, the OA decreases a little or has aslight fluctuation. Therefore, we conclude that the interactionbetween the CNN and MRF can help the final classifica-tion OA. In our experiments, we set the maximum epoch as60 and report the output as the results of our method.

Page 11: 2354 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 5 ...jwp2128/Papers/CaoZhouetal2018.pdf · In particular, they have been widely used for image processing tasks such as image

2364 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 5, MAY 2018

TABLE X

INDIVIDUAL CLASS, OVERALL, AVERAGE ACCURACIES (%) AND KAPPA COEFFICIENT (κ ) OFALL COMPETING METHODS ON THE PAVIA UNIVERSITY IMAGE TEST SET

E. Comparison With Other Deep Learning Methods

In order to further evaluate the performance of our pro-posed CNN-MRF method, we compare with three recentdeep learning HSI classification algorithms: SS-DCNN [66],SPP-DCNN [65] and DC-CNN [67]. We use the two previousreal datasets for comparison. For fair comparison, we choosethe training samples with same proportion as was done in [67].Specifically, for the Indian Pines dataset, we choose 10%samples from each class as training set, and for the PaviaUniversity dataset we choose 5% samples from each class astraining data. We also apply the data augmentation strategyin the two experiments since all deep learning algorithmscan adopt this strategy. The results for each algorithm aredisplayed in Table XII and Table XIII.4 Results for IndianPines are shown in Table XII, where it can be seen thatour proposed CNN-MRF method achieves improved perfor-mance compared with other deep methods in terms of OA,AA and κ . We show results for the Pavia University datain Table XIII. We again observe that our method achievesthe best performance compared with other methods. FromTable XII and XIII, we can conclude that the running timeof our proposed method is a potential drawback. Here therunning time reported in this experiment includes both trainingand testing time. (This could be further addressed with high-performance computing resources and techniques.)

F. Comparison With Other Label Regularization Methods

In order to evaluate the performance of MRF, we compareit with another label regularization method, namely majorityvoting [7]. This method is implemented just by replacingthe MRF optimization with majority voting in our algorithmframework (we call this method as CNN-MV). CNN-MV andCNN-MRF methods are compared on the above synthetic data,Indian Pines data and Pavia University data with the sameexperimental settings as shown earlier. For the majority votingmethod, we set the window size as {3, 5, 7} and report the

4The results for SS-DCNN, SPP-DCNN and DC-CNN are taken from [67].The running time of Table VIII and Table XII is different since dataaugmentation strategy is adopted in this experiment of Table XII.

TABLE XI

THE CHANGE OF OVERALL ACCURACY (%) AS EPOCH GOES

TABLE XII

OVERALL ACCURACY, AVERAGE ACCURACY AND κ (%) OF ALL THE DEEPLEARNING METHODS ON THE INDIAN PINES DATASET

TABLE XIII

OVERALL ACCURACY, AVERAGE ACCURACY AND κ (%) OF ALL THE DEEPLEARNING METHODS ON THE PAVIA UNIVERSITY DATASET

TABLE XIV

OVERALL ACCURACY (%) OF ALL THE LABEL REGULARIZATIONMETHODS ON THE SYNTHETIC, INDIAN PINES AND

PAVIA UNIVERSITY DATASETS

best result. The experimental results are shown in Table XIVand Figure 9. From Table XIV and Figure 9, we can easilyobserve that in CNN-MRF method obtains better performancecomparing with CNN-MV method on all the datasets. Thus weadopt MRF as the label regularization method in our algorithm.

Page 12: 2354 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 5 ...jwp2128/Papers/CaoZhouetal2018.pdf · In particular, they have been widely used for image processing tasks such as image

CAO et al.: HSI CLASSIFICATION WITH MARKOV RANDOM FIELDS AND A CNN 2365

Fig. 9. Classification maps obtained by all competing methods on the Pavia University dataset (overall accuracies are reported in parentheses).

V. CONCLUSION

In this paper, we proposed a novel technique for HSIclassification that incorporates both spectral and spatial infor-mation in a unified Bayesian framework. Specifically, we usea convolutional neural network (CNN) in combination witha Markov random field to classify HSI pixel vectors in away fully takes spatial and spectral information into account.We then efficiently learn the classification result by iterativelyupdating the CNN parameters and the class labels of allHSI pixel vectors. Experimental results on one synthetic HSIdataset and two real benchmark HSI datasets show that ourmethod outperforms state-of-the-art methods, including deepand non-deep models. In the future, we will further considerthe HSI classification task in unsupervised settings. Besides,we will also try to extend our model to the popular deepgenerative models, such as variational autoencoder (VAE) [32]and generative adversarial network (GAN) [26]. Furthermore,we hope to design more powerful regularization regimes,extending the employed MRF one, for different applicationscenarios of this problem in our future research. We will alsoconsider dimensionality reduction techniques in our methodsto further make the method more efficiently computed, espe-cially on large-scaled datasets.

ACKNOWLEDGEMENTS

The authors would like to thank the anonymous reviewersfor their careful reading of our paper and their insightfulcomments and suggestions, which help us to improve thequality of the paper.

REFERENCES

[1] M. Abadi et al. (2016). “TensorFlow: Large-scale machine learn-ing on heterogeneous distributed systems.” [Online]. Available:https://arxiv.org/abs/1603.04467

[2] T. V. Bandos, L. Bruzzone, and G. Camps-Valls, “Classification of hyper-spectral images with regularized linear discriminant analysis,” IEEETrans. Geosci. Remote Sens., vol. 47, no. 3, pp. 862–873, Mar. 2009.

[3] J. A. Benediktsson, J. A. Palmason, and J. R. Sveinsson, “Classificationof hyperspectral data from urban areas based on extended morphologicalprofiles,” IEEE Trans. Geosci. Remote Sens., vol. 43, no. 3, pp. 480–491,Mar. 2005.

[4] J. Besag, “Spatial interaction and the statistical analysis of lattice sys-tems,” J. Roy. Statist. Society. B (Methodol.), vol. 36, no. 2, pp. 192–236,1974.

[5] M. R. Bhatt and U. B. Desai, “Robust image restoration algorithm usingMarkov random field model,” CVGIP, Graph. Models Image Process.,vol. 56, no. 1, pp. 61–74, 1994.

[6] H. Bischof and A. Leonardis, “Finding optimal neural networks for landuse classification,” IEEE Trans. Geosci. Remote Sens., vol. 36, no. 1,pp. 337–341, Jan. 1998.

[7] R. S. Boyer and J. S. Moore, MJRTY—A Fast Majority Vote Algorithm.Dordrecht, The Netherlands: Springer, 1991.

Page 13: 2354 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 5 ...jwp2128/Papers/CaoZhouetal2018.pdf · In particular, they have been widely used for image processing tasks such as image

2366 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 5, MAY 2018

[8] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy min-imization via graph cuts,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 23, no. 11, pp. 1222–1239, Nov. 2001.

[9] Y. Y. Boykov and M.-P. Jolly, “Interactive graph cuts for optimalboundary & region segmentation of objects in N-D images,” in Proc.IEEE Int. Conf. Comput. Vis. (ICCV), Jul. 2001, pp. 105–112.

[10] G. Camps-Valls and L. Bruzzone, “Kernel-based methods for hyperspec-tral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 43,no. 6, pp. 1351–1362, Jun. 2005.

[11] G. Camps-Valls et al., “Robust support vector method for hyperspectraldata classification and knowledge discovery,” IEEE Trans. Geosci.Remote Sens., vol. 42, no. 7, pp. 1530–1542, Jul. 2004.

[12] X. Cao, L. Xu, D. Meng, Q. Zhao, and Z. Xu, “Integration of3-dimensional discrete wavelet transform and Markov random fieldfor hyperspectral image classification,” Neurocomputing, vol. 226,pp. 90–100, Feb. 2017.

[13] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, andA. L. Yuille. (2014). “Semantic image segmentation with deepconvolutional nets and fully connected CRFs.” [Online]. Available:https://arxiv.org/abs/1412.7062

[14] Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu, “Deep learning-basedclassification of hyperspectral data,” IEEE J. Sel. Topics Appl. EarthObserv. Remote Sens., vol. 7, no. 6, pp. 2094–2107, Jun. 2014.

[15] Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Hyperspectral imageclassification using dictionary-based sparse representation,” IEEE Trans.Geosci. Remote Sens., vol. 49, no. 10, pp. 3973–3985, Oct. 2011.

[16] Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Hyperspectral imageclassification via kernel sparse representation,” IEEE Trans. Geosci.Remote Sens., vol. 51, no. 1, pp. 217–231, Jan. 2013.

[17] Y. Chen, X. Zhao, and X. Jia, “Spectral–spatial classification of hyper-spectral data based on deep belief network,” IEEE J. Sel. Topics Appl.Earth Observ. Remote Sens., vol. 8, no. 6, pp. 2381–2392, Jun. 2015.

[18] D. L. Civco, “Artificial neural networks for land-cover classification andmapping,” Int. J. Geogr. Inf. Sci., vol. 7, no. 2, pp. 173–186, 1993.

[19] M. Crawford and W. Kim, “Manifold learning for multi-classifiersystems via ensembles,” in Multiple Classifier Systems. Berlin, Germany:Springer, 2009, pp. 519–528.

[20] M. D. Mura, A. Villa, J. A. Benediktsson, J. Chanussot, and L. Bruzzone,“Classification of hyperspectral images by using extended morphologicalattribute profiles and independent component analysis,” IEEE Geosci.Remote Sens. Lett., vol. 8, no. 3, pp. 542–546, May 2011.

[21] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution usingdeep convolutional networks,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 38, no. 2, pp. 295–307, Feb. 2015.

[22] J. M. Duarte-Carvajalino, G. Sapiro, M. Vélez-Reyes, and P. E. Castillo,“Multiscale representation and segmentation of hyperspectral imageryusing geometric partial differential equations and algebraic multi-grid methods,” IEEE Trans. Geosci. Remote Sens., vol. 46, no. 8,pp. 2418–2434, Aug. 2008.

[23] M. Fauvel, Y. Tarabalka, J. A. Benediktsson, J. Chanussot, andJ. C. Tilton, “Advances in spectral-spatial classification of hyperspectralimages,” Proc. IEEE, vol. 101, no. 3, pp. 652–675, Mar. 2013.

[24] X. Fu, J. Huang, X. Ding, Y. Liao, and J. Paisley, “Clearing the skies:A deep network architecture for single-image rain removal,” IEEE Trans.Image Process., vol. 26, no. 6, pp. 2944–2956, Jun. 2017.

[25] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning.Cambridge, MA, USA: MIT Press, 2016. [Online]. Available:http://www.deeplearningbook.org

[26] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. NeuralInf. Process. Syst., 2014, pp. 2672–2680.

[27] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling indeep convolutional networks for visual recognition,” in Proc. Eur. Conf.Comput. Vis. (ECCV), 2014, pp. 346–361.

[28] S. Hochreiter, “The vanishing gradient problem during learning recur-rent neural nets and problem solutions,” Int. J. Uncertainty, FuzzinessKnowl.-Based Syst., vol. 6, no. 2, pp. 107–116, 1998.

[29] W. Hu, Y. Huang, L. Wei, F. Zhang, and H. Li, “Deep convolutionalneural networks for hyperspectral image classification,” J. Sensors,vol. 2015, Jan. 2015, Art. no. 258619.

[30] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networksfor human action recognition,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 35, no. 1, pp. 221–231, Jan. 2013.

[31] S. Jia, X. Zhang, and Q. Li, “Spectral–spatial hyperspectral imageclassification using 1/2 regularized low-rank representation and sparserepresentation-based graph cuts,” IEEE J. Sel. Topics Appl. EarthObservat. Remote Sens., vol. 8, no. 6, pp. 2473–2484, Jun. 2015.

[32] D. P. Kingma and M. Welling. (2013). “Auto-encoding variationalbayes.” [Online]. Available: https://arxiv.org/abs/1312.6114

[33] K. Kitada and K. Fukuyama, “Land-use and land-cover mappingusing a gradable classification method,” Remote Sens., vol. 4, no. 6,pp. 1544–1558, 2012.

[34] V. Kolmogorov, “Convergent tree-reweighted message passing forenergy minimization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28,no. 10, pp. 1568–1583, Oct. 2006.

[35] V. Kolmogorov and R. Zabin, “What energy functions can be minimizedvia graph cuts?” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 2,pp. 147–159, Feb. 2004.

[36] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Proc. Adv. Neural Inf.Process. Syst., 2012, pp. 1097–1105.

[37] J. Li, J. M. Bioucas-Dias, and A. Plaza, “Spectral–spatial hyperspectralimage segmentation using subspace multinomial logistic regression andMarkov random fields,” IEEE Trans. Geosci. Remote Sens., vol. 50,no. 3, pp. 809–823, Mar. 2012.

[38] J. Li, J. M. Bioucas-Dias, and A. Plaza, “Semisupervised hyperspec-tral image segmentation using multinomial logistic regression withactive learning,” IEEE Trans. Geosci. Remote Sens., vol. 48, no. 11,pp. 4085–4098, Nov. 2010.

[39] G. Licciardi, P. R. Marpu, J. Chanussot, and J. A. Benediktsson, “Linearversus nonlinear PCA for the classification of hyperspectral data basedon the extended morphological profiles,” IEEE Geosci. Remote Sens.Lett., vol. 9, no. 3, pp. 447–451, May 2012.

[40] Z. Lin, Y. Chen, X. Zhao, and G. Wang, “Spectral–spatial classificationof hyperspectral image using autoencoders,” in Proc. IEEE 9th Int. Conf.Inf., Commun. Signal Process. (ICICS), Dec. 2013, pp. 1–5.

[41] L. Ma, M. M. Crawford, and J. Tian, “Local manifold learning-basedk-nearest-neighbor for hyperspectral image classification,” IEEE Trans.Geosci. Remote Sens., vol. 48, no. 11, pp. 4099–4109, Nov. 2010.

[42] T. Matsuki, N. Yokoya, and A. Iwasaki, “Hyperspectral tree speciesclassification of Japanese complex mixed forest with the aid of lidardata,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens, vol. 8,no. 5, pp. 2177–2187, May 2015.

[43] F. Melgani and L. Bruzzone, “Classification of hyperspectral remotesensing images with support vector machines,” IEEE Trans. Geosci.Remote Sens., vol. 42, no. 8, pp. 1778–1790, Aug. 2004.

[44] R. Mendez-Rial and J. Martin-Herrero, “Efficiency of semi-implicitschemes for anisotropic diffusion in the hypercube,” IEEE Trans. ImageProcess., vol. 21, no. 5, pp. 2389–2398, May 2012.

[45] L. Mou, P. Ghamisi, and X. X. Zhu, “Deep recurrent neural networks forhyperspectral image classification,” IEEE Trans. Geosci. Remote Sens.,vol. 55, no. 7, pp. 3639–3655, Jul. 2017.

[46] S. Paisitkriangkrai, J. Sherrah, P. Janney, and A. Van-Den Hengel,“Effective semantic pixel labelling with convolutional networks andconditional random fields,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit. Workshops (CVPRW), Jun. 2015, pp. 36–43.

[47] G. P. Petropoulos, C. Kalaitzidis, and K. P. Vadrevu, “Support vectormachines and object-based classification for obtaining land-use/covercartography from hyperion hyperspectral imagery,” Comput. Geosci.,vol. 41, pp. 99–107, Apr. 2012.

[48] A. Plaza et al., “Recent advances in techniques for hyperspectralimage processing,” Remote Sens. Environ., vol. 113, pp. S110–S122,Sep. 2009.

[49] Y. Qian, M. Ye, and J. Zhou, “Hyperspectral image classification basedon structured sparse logistic regression and three-dimensional wavelettexture features,” IEEE Trans. Geosci. Remote Sens., vol. 51, no. 4,pp. 2276–2291, Apr. 2013.

[50] M. G. Reyes, X. Zhao, D. L. Neuhoff, and T. N. Pappas, “Lossycompression of bilevel images based on Markov random fields,”in Proc. IEEE Int. Conf. Image Process. (ICIP), Sep./Oct. 2007,pp. II-373–II-376.

[51] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internalrepresentations by error propagation,” Inst. Cognit. Sci., DTIC Docu-ment, California Univ. San Diego, San Diego, CA, USA, Tech. Rep. ICS-8506, 1985.

[52] H. Z. M. Shafri, E. Taherzadeh, S. Mansor, and R. Ashurov, “Hyper-spectral remote sensing of urban areas: An overview of techniquesand applications,” Res. J. Appl. Sci., Eng. Technol., vol. 4, no. 11,pp. 1557–1565, 2012.

[53] L. Shen and S. Jia, “Three-dimensional Gabor wavelets for pixel-basedhyperspectral imagery classification,” IEEE Trans. Geosci. Remote Sens.,vol. 49, no. 12, pp. 5039–5046, Dec. 2011.

Page 14: 2354 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 5 ...jwp2128/Papers/CaoZhouetal2018.pdf · In particular, they have been widely used for image processing tasks such as image

CAO et al.: HSI CLASSIFICATION WITH MARKOV RANDOM FIELDS AND A CNN 2367

[54] A. Soltani-Farani, H. R. Rabiee, and S. A. Hosseini, “Spatial-awaredictionary learning for hyperspectral image classification,” IEEE Trans.Geosci. Remote Sens., vol. 53, no. 1, pp. 527–541, Jan. 2015.

[55] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: A simple way to prevent neural networksfrom overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958,2014.

[56] X. Sun, Q. Qu, N. M. Nasrabadi, and T. D. Tran, “Structured priors forsparse-representation-based hyperspectral image classification,” IEEEGeosci. Remote Sens. Lett., vol. 11, no. 7, pp. 1235–1239, Jul. 2014.

[57] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for objectdetection,” in Proc. Adv. Neural Inf. Process. Syst., 2013, pp. 2553–2561.

[58] Y. Tarabalka, J. A. Benediktsson, and J. Chanussot, “Spectral–spatialclassification of hyperspectral imagery based on partitional cluster-ing techniques,” IEEE Trans. Geosci. Remote Sens., vol. 47, no. 8,pp. 2973–2987, Aug. 2009.

[59] Y. Tarabalka, M. Fauvel, J. Chanussot, and J. A. Benediktsson, “SVM-and MRF-based method for accurate classification of hyperspectralimages,” IEEE Geosci. Remote Sens. Lett., vol. 7, no. 4, pp. 736–740,Oct. 2010.

[60] Y. Tarabalka and A. Rana, “Graph-cut-based model for spectral-spatialclassification of hyperspectral images,” in Proc. IEEE Int. Geosci.Remote Sens. Symp. (IGARSS), Jul. 2014, pp. 3418–3421.

[61] A. Villa, J. A. Benediktsson, J. Chanussot, and C. Jutten, “Hyperspectralimage classification with independent component discriminant analysis,”IEEE Trans. Geosci. Remote Sens., vol. 49, no. 12, pp. 4865–4876,Dec. 2011.

[62] J. Xie, L. Xu, and E. Chen, “Image denoising and inpainting withdeep neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2012,pp. 341–349.

[63] Y. Xu, Z. Wu, and Z. Wei, “Spectral–spatial classification of hyperspec-tral image based on low-rank decomposition,” IEEE J. Sel. Topics Appl.Earth Observ. Remote Sens., vol. 8, no. 6, pp. 2370–2380, Jun. 2015.

[64] J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Constructing free-energyapproximations and generalized belief propagation algorithms,” IEEETrans. Inf. Theory, vol. 51, no. 7, pp. 2282–2312, Jul. 2005.

[65] J. Yue, S. Mao, and M. Li, “A deep learning framework for hyperspectralimage classification using spatial pyramid pooling,” Remote Sens. Lett.,vol. 7, no. 9, pp. 875–884, 2016.

[66] J. Yue, W. Zhao, S. Mao, and H. Liu, “Spectral–spatial classification ofhyperspectral images using deep convolutional neural networks,” RemoteSens. Lett., vol. 6, no. 6, pp. 468–477, 2015.

[67] H. Zhang, Y. Li, Y. Zhang, and Q. Shen, “Spectral-spatial classificationof hyperspectral imagery using a dual-channel convolutional neuralnetwork,” Remote Sens. Lett., vol. 8, no. 5, pp. 438–447, 2017.

[68] W. Zhu et al., “Unsupervised classification in hyperspectral imagerywith nonlocal total variation and primal-dual hybrid gradient algorithm,”IEEE Trans. Geosci. Remote Sens., vol. 55, no. 5, pp. 2786–2798,May 2017.

[69] D. Zikic et al., “Linear intensity-based image registration by Markovrandom fields and discrete optimization,” Med. Image Anal., vol. 14,no. 4, pp. 550–562, 2010.

Xiangyong Cao received the B.Sc. degree fromXi’an Jiaotong University, Xi’an, China, in 2012,where he is currently pursuing the Ph.D. degree.He was a Visiting Scholar with Columbia University,New York, NY, USA, from 2016 to 2017. His currentresearch interests include Bayesian method, deeplearning, and hyperspectral image processing.

Feng Zhou received the M.S. and Ph.D. degreesin signal and information processing from XidianUniversity, Xi’an, China, in 2004 and 2007, respec-tively, where he is currently a Professor with theNational Laboratory of Radar Signal Processing. Hisresearch interests are radar imaging and jammingsuppression. He was granted the program for NewCentury Excellent Talents in University in China,and was a recipient of the Young Scientist Awardfrom the XXXI URSI GASS Committee.

Lin Xu received the Ph.D. degree from theInstitute for Information and System Sciences,School of Electronic and Information Engineering,Xi’an Jiaotong University, Xi’an, China. He is cur-rently a Post-Doctoral Associate with the Multime-dia and Visual Computing Laboratory, New YorkUniversity Abu Dhabi. His research interests includeneural networks, learning algorithms, and applica-tions in computer vision.

Deyu Meng received the B.Sc., M.Sc., andPh.D. degrees from Xi’an Jiaotong University, Xi’an,China, in 2001, 2004, and 2008, respectively. He iscurrently an Associate Professor with the Institutefor Information and System Sciences, School ofMathematics and Statistics, Xi’an Jiaotong Univer-sity. From 2012 to 2014, he took his two-yearsabbatical leave in Carnegie Mellon University. Hiscurrent research interests include self-paced learn-ing, noise modeling, and tensor sparsity.

Zongben Xu received the Ph.D. degree in math-ematics from Xi’an Jiaotong University, China,in 1987. He serves as the Chief Scientist of NationalBasic Research Program of China (973 Project), andthe Director of the Institute for Information andSystem Sciences with the Xi’an Jiaotong Univer-sity. His current research interests include intelligentinformation processing and applied mathematics.He was a recipient of the National Natural ScienceAward of China in 2007 and the CSIAM Su BuchinApplied Mathematics Prize in 2008. He delivered a

talk at the International Congress of Mathematicians 2010. He was elected asa member of Chinese Academy of Science in 2011.

John Paisley received the B.S., M.S., andPh.D. degrees in electrical engineering from DukeUniversity. He was a Post-Doctoral Researcher withthe Computer Science Department, University ofCalifornia at Berkeley and Princeton University.He is currently an Assistant Professor with theDepartment of Electrical Engineering, ColumbiaUniversity, and also a member of the Data Sci-ence Institute, Columbia University. His currentresearch is machine learning, focusing on mod-els and inference techniques for text and imageprocessing applications.