spatial–spectral feature extraction via deep convlstm

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 58, NO. 6, JUNE 2020 4237

Spatial–Spectral Feature Extraction viaDeep ConvLSTM Neural Networks for

Hyperspectral Image ClassificationWen-Shuai Hu, Heng-Chao Li , Senior Member, IEEE, Lei Pan , Wei Li , Senior Member, IEEE,

Ran Tao , Senior Member, IEEE, and Qian Du , Fellow, IEEE

Abstract— In recent years, deep learning has presented agreat advance in the hyperspectral image (HSI) classification.Particularly, long short-term memory (LSTM), as a special deeplearning structure, has shown great ability in modeling long-termdependencies in the time dimension of video or the spectraldimension of HSIs. However, the loss of spatial information makesit quite difficult to obtain better performance. In order to addressthis problem, two novel deep models are proposed to extractmore discriminative spatial–spectral features by exploiting theconvolutional LSTM (ConvLSTM). By taking the data patch ina local sliding window as the input of each memory cell bandby band, the 2-D extended architecture of LSTM is consideredfor building the spatial–spectral ConvLSTM 2-D neural network(SSCL2DNN) to model long-range dependencies in the spectraldomain. To better preserve the intrinsic structure informationof the hyperspectral data, the spatial–spectral ConvLSTM 3-Dneural network (SSCL3DNN) is proposed by extending LSTMto the 3-D version for further improving the classificationperformance. The experiments, conducted on three commonlyused HSI data sets, demonstrate that the proposed deep modelshave certain competitive advantages and can provide better clas-sification performance than the other state-of-the-art approaches.

Index Terms— Classification, convolutional long short-termmemory (ConvLSTM), deep learning, feature extraction,hyperspectral image (HSI).

I. INTRODUCTION

THE hyperspectral remote sensing image is a 3-D datacube, which integrates the spectral information with the

2-D spatial information of land covers. With the develop-ment of remote sensing technology, together with the con-tinuous improvement of hyperspectral sensors, hyperspectralimages (HSIs) can provide more opportunities to manageand analyze the information from the Earth’s surface [1]–[3].

Manuscript received July 23, 2019; revised October 20, 2019; acceptedDecember 10, 2019. Date of publication January 15, 2020; date of currentversion May 21, 2020. This work was supported by the National NaturalScience Foundation of China under Grant 61871335. (Corresponding author:Heng-Chao Li.)

Wen-Shuai Hu, Heng-Chao Li, and Lei Pan are with the Sichuan ProvincialKey Laboratory of Information Coding and Transmission, Southwest JiaotongUniversity, Chengdu 610031, China (e-mail: [email protected]).

Wei Li and Ran Tao are with the School of Information and Electronics,Beijing Institute of Technology, Beijing 100081, China.

Qian Du is with the Department of Electrical and Computer Engineering,Mississippi State University, Mississippi State, MS 39762 USA.

Color versions of one or more of the figures in this article are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TGRS.2019.2961947

Correspondingly, HSIs have been widely used in many fields,such as environmental sciences [2], precision agriculture[4], [5], ecological science [6], and geological exploration [7].

As a basic and important research topic, HSI classifica-tion has attracted plenty of attention. The support vectormachine (SVM) is the most widely used classifier and showsgreat success in HSI classification [8]. In particular, a compos-ite kernel SVM (SVM-CK) was proposed to exploit the spatialand spectral information simultaneously [9]. Inspired by thesuccessful application on face recognition [10], sparse repre-sentation (SR) [11] has also been applied in HSI classification,which was further improved to exploit the spatial informationby Chen et al. [12], resulting in a joint SR classifier (JSRC).Subsequently, more and more classifiers based on joint sparsemodel were developed, such as nonlocal weighted JSRC(NLW-JSRC) [13], nearest regularized JSRC (NRJSRC) [14],and correntropy-based robust JSRC (RJSRC) [15].

In recent years, deep learning has presented a great advancefor feature extraction and classification in the field of computervision, such as object detection [16], object tracking [17], andbehavior recognition in crowd scene [18], and also shownthe effectiveness in HSI classification task [19]. An increas-ing number of feature extraction and classification methodsdeep learning-based have been designed for HSIs. In [20],a classification method based on Stacked AutoEncoder (SAE)was exploited for HSI classification for the first time byextracting spatial–spectral features. Ever since Hu et al. [21]and Chen et al. [22] introduced the Convolutional Neural Net-work (CNN) into HSI classification, many new classificationmodels based on CNN have emerged and provided satisfactoryperformance in HSI classification. Li et al. [23] integrated thepixel-pair model into CNN to further obtain discriminative fea-tures. Shao and Du [24] utilized CNN and the balanced localdiscriminant embedding to fuse the spatial and spectral infor-mation. For the sake of joint learning of the spatial–spectralfeatures, many classification models [25]–[28] were built bytaking the 3-D data cube as the input and deepening thenetwork to obtain more discriminative features. In order tocompute the distance between features of the same classesand that between features of different classes more effectivelyand quickly, Fang et al. [29] integrated the hashing learninginto CNN to realize the transformation from high-dimensionalfeatures to low-dimensional features. In addition, to overcome

0196-2892 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Mississippi State University Libraries. Downloaded on May 23,2020 at 17:42:15 UTC from IEEE Xplore. Restrictions apply.

https://orcid.org/0000-0002-9735-570X

https://orcid.org/0000-0002-4215-9509

https://orcid.org/0000-0001-7015-7335

https://orcid.org/0000-0002-5243-7189

https://orcid.org/0000-0001-8354-7500

4238 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 58, NO. 6, JUNE 2020

the problem of the limitation of available labeled samples, twoself-taught learning frameworks based on the multiscale inde-pendent component analysis (MICA) and the stacked convolu-tional autoencoder (SCAE) [30], semi-supervised CNN [31],and a supervised deep feature extraction method based onsiamese CNN (S-CNN) [32] were proposed. In additionto the single-channel deep network models, there are alsosome novel works [27], [33] based on multi-channel CNNto improve feature extraction and classification. In particular,Li et al. [34] made a systematic review of HSI classificationmethods, and not only comprehensively classified the existingmodels based on deep learning but also presented a review ofthe strategies that are used to improve the performance whenthe labeled samples are limited, which is meaningful for thefuture research.

Obviously, CNN has provided an extremely effective andbasic structure for various tasks, and become the core archi-tecture of various models, such as VGG16 [16], ResNet [25],CapsNets [35], DenseNet [36], among which the convolutionallayer is the core backbone for feature extraction. To addressthe gradient vanishing problem caused by deeper and deeperstructures [37], effective feature extraction and classificationmodels [25], [35], [36] were proposed by promoting the CNNfilters to improve the classification performance for the HSIs.

In order to analyze sequential data, recurrent neural net-work (RNN) [38], as an effective deep learning model, hasbeen widely concerned for modeling long-range dependencies,and also expected to extract features for HSI classification.By using a special activation function and an improved gatedrecurrent unit (GRU), Mou et al. [39] built a RNN modelfor pixel-level spectral classification. Considering that thespatial information has a positive impact on HSI classificationperformance, Liu et al. [40] used multiply RNNs to modellong-term dependencies between central and neighborhoodpixels. Zhang et al. [41] built a local spatial sequential RNN(LSS-RNN) model, in which the low-level features, suchas texture and different morphological profile features, wereused to construct LSS features, and then input to RNN toextract high-level features. Based on GRU, Hang et al. [42]designed an end-to-end RNN model for extracting the usefulinformation from adjacent and nonadjacent spectral bandsand integrated the convolutional layers into RNN to improvethe classification accuracy. In addition, Mou et al. [43] usedRNN to model the temporal dependence of the outputs ofconvolutional subnetwork for change detection in multispectralimagery.

Long short-term memory (LSTM), as a special RNN struc-ture, has demonstrated its stability and power in modelinglong-term dependencies in various studies [44], [45]. Theinitially proposed structure of LSTM utilizes the special“memory cell” instead of logistic or tanh hidden units [44],and there are three significant gate mechanisms in this kind ofstructure: input gate, output gate, and forget gate, which areused for implementing information protection, transmission,and control, respectively. Specifically, the input gate is appliedto control when the input is allowed to be added to the memorycell, the output gate is designed to decide when the inputdata have an influence on the output of the memory cell,

and the forget gate is mainly used for modeling long-rangedependencies. It is the design of this special memory cell withfixed weight and self-connected circular edges that ensuresthe gradient can pass across many time steps without gradientvanishing or explosion problems [46]. However, there is aninherent drawback in LSTM, in which the spatial structureinformation will be lost when unfolding the input to the1-D form. To better model the spatiotemporal relationships,Shi et al. [47] extended this data processing method in LSTMto the convolution operation and proposed ConvolutionalLSTM (ConvLSTM), which consists of convolution structuresin both the input-to-state and state-to-state transitions. Due tothe characteristics of LSTM, much attention has been paid toit, and the most common way is to use LSTM in combinationwith CNN. Wu and Prasad [48] built a hybrid model for HSIclassification, in which the convolutional layers followed bythe recurrent layers were used to extract spectrally contex-tual features. By utilizing the convolutional recurrent layers,Rußwurm and Körner [49] built an encoder framework forland cover classification. Song et al. [50] proposed a recurrent3-D fully convolutional network for change detection in HSIs,in which the ConvLSTM layer is used to perform long-termdependencies modeling on the outputs of a fully convolutionalnetwork in the temporal field. Moreover, Seydgar et al. [51]proposed a two-stage model, in which ConvLSTM, cascadingwith 3-D CNN, was utilized to extract spatial–spectral features.

In addition to the above combination with CNN,there are some works to build deep models usingLSTM alone. Zhou et al. [52] designed a spatial–spectralLSTMs (SSLSTMs) model based on two independent LSTMsin the way of spatial LSTM (SaLSTM) and spectral LSTM(SeLSTM). However, this way not only insufficiently fusesthe spatial and spectral information, but also loses the spatialstructure. Rußwurm and Körner [53] employed multilevelcascaded LSTMs for land cover classification. Inspired byspatial similarity measurements [54], Ma et al. [55] built aLSTM-based model, in which the spatial–spectral featuresare extracted by these measurement strategies. Moreover,a bidirectional-ConvLSTM (Bi-CLSTM) model [56] was pro-posed by cascading all outputs in each layer for extractingspatial–spectral features. However, this simple cascade fails tofully utilize the correlation between different spectral bands.

Inspired by the above works, the main purpose of thisarticle is to construct two novel deep ConvLSTM neuralnetworks for HSI feature extraction and classification. Themain contributions of this article are listed as follows.

1) In order to address the problem of underutilizationof the correlations between different spectral bands inBi-CLSTM and the issues of insufficient fusion of thespatial and spectral information and the loss of thespatial structure information in SSLSTMs, an effectivefeature extraction model, i.e., spatial–spectral ConvL-STM 2-D neural network (SSCL2DNN), is proposed bymodeling long-term dependencies in the spectral fieldfor joint learning of spatial–spectral features, in whichthe local window patch is decomposed into a spectralsequence and then input to each memory cell band byband.


HU et al.: SPATIAL–SPECTRAL FEATURE EXTRACTION VIA DEEP CONVLSTM NEURAL NETWORKS 4239

2) To better preserve the intrinsic structure of hyperspec-tral data, the 3-D structure (namely ConvLSTM3D) isfurther developed from the basic ConvLSTM cell, withwhich a novel deep model, i.e., spatial–spectral ConvL-STM 3-D neural network (SSCL3DNN), is constructed.Different from the way of band-by-band processing,the local patch directly takes the form of 3-D cube asthe input of SSCL3DNN, which enables it to furtherimprove classification performance of SSCL2DNN.

The remainder of this article is organized as follows.Section II presents CNN, LSTM, and ConvLSTM, and dis-cusses the applications on HSI feature extraction and clas-sification. In Sections III and IV, the proposed deep modelswith the extraction of spatial–spectral features for HSI clas-sification are described in detail, respectively. Comprehensivequantitative analysis and evaluation of the proposed modelsare implemented in Section V. And conclusively, Section VIsummarizes this article.

II. RELATED WORK

A. CNN

The fundamental CNN mainly consists of the followingparts: convolutional layer, pooling layer, full connection layer,and classification layer. Based on different convolution opera-tions, we can construct various networks to meet a variety ofpractical requirements. Originally, the calculation formula ofeach convolutional layer in CNN can be expressed as

or+1 = φ(Wr ∗ or + br ) (1)

where or+1 is the output with M feature maps of the r thconvolutional layer, Wr denotes the convolution filter, and br

is the bias of the r th convolutional layer, or is the output ofthe (r − 1)th convolutional layer, and φ(·) is the nonlinearactivation function. Wr has a size of k(r) × k(r) for 2-D CNN.With respect to 3-D CNN, Wr is a convolution filter in the r thconvolutional layer with the size of k(r) ×k(r) ×d(r), in whichk(r) and d(r) denote the size and depth of the convolution filter,respectively.

Since Hu et al. [21] and Chen et al. [22] proposed HSI fea-ture extraction and classification models based on CNN, moreand more improved CNN-based models have been introduced.CNN has become an extremely effective and basic structurefor various tasks. It is noteworthy that a sliding windowis still used to extract spatial features in CNN, which is atraditional way of exploiting spatial information. Moreover,the data transmission in CNN exists only between adjacentlayers, lacking information interaction inside each layer, whichmay make it difficult to extract more effective features.

B. LSTM

LSTM is proposed to deal with the issues that RNN isnot suitable for learning long-term dependencies and proneto bring about gradient vanishing and exploding problems[46]. It is obvious that the data transmission and processing inLSTM are realized by three key gate units: input gate, outputgate, and forget gate, which are used for implementing infor-mation protection and control [44]. The calculation formulas

between these three gate structures in LSTM are written as

it = σ(Wxi xt + Whi ht−1 + Wci ◦ ct−1 + bi )

ft = σ(Wx f xt + Wh f ht−1 + Wc f ◦ ct−1 + b f )

c̃t = tanh(Wxcxt + Whcht−1 + bc)

ct = ft ◦ ct−1 + it ◦ c̃t

ot = σ(Wxoxt + Whoht−1 + Wco ◦ ct + bo)

ht = ot ◦ tanh(ct ) (2)

where xt , ht−1, and ct−1 represent the input of the currentcell, the output and state of the last cell in LSTM, respectively.it , ft , and ht denote the input gate, forget gate, and output gateof LSTM. W·i and bi are the weight and bias of the input gate,W· f and b f are the weight and bias of the forget gate, andW·o and bo are the weight and bias of the output gate, where ·indicates x , h, and c. ◦ denotes the Hadamard product, and σis the nonlinear activation function.

Zhou et al. [52] first attempted to apply LSTM to HSIclassification and advanced the SSLSTMs. The experimentalresults show that LSTM can also be used well for modelinglong-range dependencies in the spectral domain. Nevertheless,it is worth noting that the two branches in SSLSTMs areindependent of each other. In addition, the unfolding of theoriginal HSI data to one-dimensional vectors as the inputof SSLSTMs will actually lose the intrinsic structure of thehyperspectral data, since spatial information is not consideredin LSTM.

C. ConvLSTM

Considering the shortcomings of LSTM, a modification andextended version of LSTM, i.e., ConvLSTM [47], is devel-oped. Different from LSTM, the input-to-state and state-to-state transitions in ConvLSTM are realized by convolution,and ConvLSTM holds the same structure as LSTM and canbe used to model long-term dependencies in the time domainor spectral domain, in which the calculation formulas can beexpressed as

it = σ(Wxi ∗ Xt + Whi ∗ Ht−1 + Wci ◦ Ct−1 + bi)

ft = σ(Wx f ∗ Xt + Wh f ∗ Ht−1 + Wc f ◦ Ct−1 + b f )

C̃t = tanh(Wxc ∗ Xt + Whc ∗ Ht−1 + bc)

Ct = ft ◦ Ct−1 + it ◦ C̃t

ot = σ(Wxo ∗ Xt + Who ∗ Ht−1 + Wco ◦ Ct + bo)

Ht = ot ◦ tanh(Ct ) (3)

where Xt denotes the input of the current cell, Ct−1 and Ht−1are state and output of the last cell, respectively. * means theconvolution operation. W denotes the 2-D convolution filterwith a k ×k kernel, and k is the size of the convolution kernel,respectively. Specially, the definitions of W·i , W· f , W·o, bi , b f ,and bo are similar to that in (2), however, the data dimensionsand processing methods are different.

Different from the traditional way of extracting spatialfeatures based on the sliding window in CNN, the ConvLSTMcell holds three special gate mechanisms to complete datatransmission and processing, which makes it possible to utilizespatial information of HSIs more effectively. Simultaneously,



Fig. 1. Proposed SSCL2DNN is labeled by 1� with blue. In particular, SSCL2DNN can be transformed into another deep model when K equals 1 and τequals 1, which is called SaCL2DNN labeled by 2� with red dotted box.

the ConvLSTM layer, constructed by the ConvLSTM cell in(3) as the basic unit, can not only implement data trans-mission and processing of the inter-layer, but also executethose of the intra-layer, which is another great differencefrom CNN. This special structure enables ConvLSTM layerto extract the more effective feature representation than CNN.Furthermore, compared with LSTM, the implementations ofthree gate mechanisms are extended from one-dimensional tomultidimensional convolution operation, and this change cannot only capture the spatial context information of the originaldata similar to the convolutional layer in CNN but also modelthe long-range dependencies in the time domain of video orthe spectral domain of HSIs.

III. SSCL2DNN

In SSLSTMs [52], the spatial and spectral features arenot effectively fused, and the spatial structure informationis not well preserved. As for Bi-CLSTM [56], the correla-tions between different spectral bands are not fully utilized.In order to address the above issues, a novel ConvLSTM-basedspatial–spectral feature extraction model is proposed. It shouldbe noted that ConvLSTM given in Section II-C is actually a2-D extension structure of LSTM, and in what follows, to dis-tinguish the 3-D extension form in Section VI, ConvLSTMinvolved in this section is named ConvLSTM2D.

A. SSCL2DNN

On account of the above analysis, a joint spatial–spectralfeature extraction and classification model based on ConvL-STM2D is constructed, which is shown in Fig. 1 labeled by1� with blue. The core structure of our proposed SSCL2DNN

model consists of ConvLSTM2D layer and pooling layer.Based on the basic ConvLSTM2D cell in (3), a ConvLSTM2Dlayer will be built, and a 3-D data cube needs to be decom-posed into a spectral sequence, which is then input into eachmemory cell of the ConvLSTM2D layer one by one. However,there is much redundant information in original HSI data cube,and the more data used, the greater the complexity involved.Unlike [22], PCA is selected as the preprocessing method toimplement dimension reduction of data.

First of all, the size of the original HSI data cube canbe denoted as W × H × D, where D indicates the numberof the spectral bands, and W and H are the width and

height of HSI, respectively. In order to reduce computationalcomplexity, the first K components after PCA are selectedas the spectral information of each pixel xi . Furthermore, foreach ConvLSTM2D layer in SSCL2DNN, considering thatthe spatial context information is beneficial to HSI classifi-cation, the data in a local spatial window with the size ofs × s is extracted as the spatial information for extractingspatial–spectral features, which is the input of each memorycell in the ConvLSTM2D layer. After data preprocessing,a 3-D input data denoted by s × s × K is constructed.In particular, suppose that the special dimension time_step inConvLSTM2D is expressed as a variable τ , and it needs to befixed as K to maintain the same dimension as the input data.Concretely, the 3-D input of each pixel xi is decomposed intoK 2-D components and converted into a sequence with thelength of K , i.e., {X1

i , . . . , Xki , . . . , X K

i }, k ∈ {1, 2, . . . , K },where Xk

i denotes the kth component of the pixel xi . Then,this sequence is fed into the ConvLSTM2D layer one by one.

In order to compress the feature maps generated by theConvLSTM2D layer and reduce the computational complexity,the pooling layer is also used in the proposed deep model.After cascading multiple ConvLSTM2D layers and poolinglayers, the final output of the last ConvLSTM2D layer inthe proposed SSCL2DNN model is the desired feature rep-resentation that is subsequently fed into the classificationlayer to obtain the final result. Specifically, according to [22]and [26], small convolution kernels are efficient for yieldingbetter classification performance. As such, the size of theconvolution kernels in our proposed deep model can be setas 4 × 4 or 3 × 3, and the kernel size of 2 × 2 is adopted toimplement the operation of the pooling layer.

Due to the exceptional inner structure of LSTM and itsextended architecture, it is evident that if we set τ to 1 and Kto 1 in each ConvLSTM2D layer, and convert the input from a3-D data to a 2-D data, the proposed deep model is reduced tospatial ConvLSTM 2-D neural network (SaCL2DNN), whichis shown in Fig. 1 labeled by 2� with the red dotted boxes.Specially, this transform is similar to that from 3-D CNN to 2DCNN, however, SaCL2DNN can utilize the spatial structureinformation of hyperspectral data more effectively to obtainbetter classification performance than 2-D CNN.

Particularly, the proposed deep models can not only imple-ment the same data transmission between layers as CNNdoes but also accomplish effective data transmission and



processing within each ConvLSTM2D layer. It is knownthat adjacent spectral bands are highly correlated, and theremay also be some correlations between nonadjacent spectralbands [48]. Therefore, by effectively modeling long-rangedependencies in the spectral field, SSCL2DNN can jointlyconsider the spatial–spectral information, which can addressthe problems of insufficient feature fusion caused by using twosubbranches independently and the loss of spatial informationin SSLSTMs [52] and underutilization of correlation betweendifferent spectral bands in Bi-CLSTM [56].

B. Loss Function and Optimization Method

Suppose that there are L ConvLSTM2D layers andpooling layers in SSCL2DNN. In the first ConvLSTM2Dlayer, each 2-D component Xk

i in the spectral sequence{X1

i , . . . , Xki , . . . , X K

i } is the input of the kth memory cell,which is actually Xt in (3), and then there will be two kindsof outputs, i.e., the outputs {H 1

1i, . . . , H k1i , . . . , H K

1i } of Kmemory cells and the state and output {C1i , H1i} yielded bymodeling long-term dependencies, which are actually {Ct , Ht}in (3). The outputs of K memory cells are retained as theinputs of the next layer to extract high-level features. After LConvLSTM2D layers and pooling layers, the output HLi of thelast ConvLSTM2D layer generated by modeling long-rangedependencies in the spectral field is the desired spatial–spectralfeatures. After reducing dimensions through the last poolinglayer, the spatial–spectral features HLi are converted into the1-D vectors hLi and then fed into a fully connected layer tomap the feature space to class label space. Finally, the featurevectors hLi are input to a softmax function to predict theconditional probability distribution P(y = c|hLi , W, b) =(e(WchLi +bc))/(

∑Nj=1 e(W j hLi +b j )) of each class c, where

c ∈ 1, 2, . . . , N , and N is the number of classes in the HSIdata sets.

In addition, the cross entropy [52] is used as the lossfunction to obtain the final classification results, which isoptimized by adaptive momentum (ADAM) algorithm.

IV. SSCL3DNN

In order to better preserve the intrinsic structure infor-mation of the hyperspectral data, in this section, the 3-Dextended structure (ConvLSTM3D) is further developed, andanother novel deep feature extraction and classification model,i.e., SSCL3DNN, is proposed, which can yield more dis-criminative spatial–spectral features for further improving theclassification performance.

A. ConvLSTM3D

On the basis of ConvLSTM2D, the 3-D extended version(ConvLSTM3D) is further developed, in which there are threegate units, and the calculation formulas are similar to thatin ConvLSTM2D. However, different from it, the whole 3-Ddata cube is taken as the input of each memory cell inConvLSTM3D. In particular, the input Xt , the state Ct−1 andCt , the output Ht−1 and Ht , and the gate units it , ft , and ot ofConvLSTM3D are 4-D tensors, whose last three dimensions

are spectral dimension and two spatial dimensions, and theconvolution filters W·i , W· f , and W·o are 3-D tensors. Thestructure illustration of the ConvLSTM3D layer is shownin Fig. 3. Specifically, the equations of the ConvLSTM3D cellcan be written as

it = σ(Wx i � Xt + Whi � Ht−1 + Wci ◦ Ct−1 + bi)

ft = σ(Wx f � Xt + Wh f � Ht−1 + Wcf ◦ Ct−1 + b f )

C̃t = tanh(Wxc � Xt + Whc � Ht−1 + bc)

Ct = ft ◦ Ct−1 + it ◦ C̃t

ot = σ(Wxo � Xt + Who � Ht−1 + Wco ◦ Ct + bo)

Ht = ot ◦ tanh(Ct ) (4)

where � is the defined 3-D convolution between 4-D input oroutput and 3-D convolution filter.

Compared with the LSTM cell in (2) and the ConvL-STM2D cell in (3), the extended ConvLSTM3D cell has asimilar calculational model, however, the main difference isthe calculation of the convolution in each gate unit. Take theinput gate of the ConvLSTM3D cell as an example. Supposethat Xt ∈ Rτt ×wt ×ht ×st is the input of it, which is the tthcomponent in a sequence decomposed from the input of theConvLSTM3D layer according to the dimension time_step,and Wx i ∈ Rk1×k2×d , where τt , wt , ht , st , k1, k2, and dare the dimension time_step, width, height, the number of thespectral bands, kernel size and depth, respectively. We definethe 3-D convolution of Xt and Wx i as Wx i � Xt , where theoutput uτt lx ly lz yielded by Xt at position (lx , ly, lz) in the inputgate is defined as

uτt lx ly lz =k1∑

j=1

k2∑

p=1

d∑

q=1

Wx ij pqXt

τt (lx + j )(ly+p)(lz+q). (5)

B. SSCL3DNN

According to the extraordinary structure of ConvLSTM3D,we further extend SSCL2DNN to a new deep model,i.e., SSCL3DNN, based on ConvLSTM3D. Different fromSSCL2DNN, the whole data cube must be taken as the inputof the memory cell in SSCL3DNN. Therefore, the intrinsicstructure of HSIs can be captured well, and the spatial–spectralfeatures can be well learned to further improve the HSIclassification performance. Fig. 2 clearly shows the structureof the proposed deep model, and the parameter τ is fixed as 1.

In particular, PCA is also used to reduce the redundant infor-mation in this section, and similarly, the first K componentsare chosen. With regard to the selection of the size of theconvolution kernel, the small kernel size is also considered,which can be set as 3 × 3 × 3 or 5 × 5 × 5 in the proposeddeep model. Besides, the size of the pooling kernel in thepooling layer is set as 2 ×2 ×2. After the cascade of multipleConvLSTM3D layers and pooling layers, the spatial–spectralfeatures are extracted from the 3-D data cube and subsequentlyfed into the classification layer to obtain the classificationresults.

Although the loss function and optimization method in thissection are the same as those of SSCL2DNN in Section III,



Fig. 2. Proposed SSCL3DNN structure.

Fig. 3. Illustration of the inner structure of ConvLSTM3D.

the spatial–spectral features HLi and its 1-D vector form hLi ofeach pixel xi and data processing at each layer in SSCL3DNNare quite different.

V. EXPERIMENTAL RESULTS

To quantitatively and qualitatively analyze the classifica-tion performance of the proposed models, SVM [57], 2-DCNN [22], 3-D CNN [22], SaLSTM [52], SSLSTMs [52],and Bi-CLSTM [56] are used as the comparative algorithms.In particular, the special case (SaCL2DNN) of SSCL2DNNis used to compare fairly with SVM, 2-D CNN, and SaL-STM. Three commonly used quantitative metrics are adopted,i.e., overall accuracy (OA), average accuracy (AA), and Kappacoefficient (κ). To eliminate the bias introduced by randomlychoosing training samples, each experiment is repeated 10times, and the mean values of each evaluation criterion arepresented.

A. Hyperspectral Data Sets

Three common HSI data sets, i.e., Indian Pines, SalinasValley, and the University of Pavia, are considered in ourexperiments, whose false-color maps, ground-truth maps, andthe corresponding training size are shown in Fig. 4, respec-tively.

1) Indian Pines: The Indian Pines data set was capturedin 1992 by the Airborne Visible/Infrared Imaging Spectrom-eter (AVIRIS) sensor in Northwestern Indiana, USA, whichis mainly composed of multiple agricultural fields. The spatialsize of this data set is 145×145 pixels with a spatial resolutionof 20 meters per pixel (mpp), and there are 224 spectralbands in the wavelength range from 0.4 to 2.5 μm. Since

some of them cannot be reflected by water together withfour null bands, there are generally 200 bands remaining forstudy. After removing the background pixels, 10 249 pixels arereserved, which contain useful ground-truth information from16 different class labels.

2) Salinas Valley: The Salinas Valley data set was collectedby the 224-band AVIRIS sensor over Salinas Valley, Califor-nia. This data set is made up of 512 lines and 217 columns,and contains 16 ground-truth classes. After removing thewater absorption bands and noise-affected bands, there are204 spectral bands preserved.

3) University of Pavia: The University of Pavia data set wasacquired by the Reflective Optics System Imaging Spectrom-eter (ROSIS) sensor over University of Pavia, Northern Italy.There are 103 spectral bands in the spectral range from 0.43 to0.86 μm by removing several noise-corrupted bands, and itpresents a size of 610 × 340 pixels with a spatial resolutionof 1.3 mpp. Different from the Indian Pines and Salinas Valleydata sets, this data set contains nine distinguishable classes.

B. Experimental Settings

In order to reduce the computational complexity, PCA isused as a preprocessing method to reduce data dimensions. ForSVM, 2-D CNN, SaLSTM, and SaCL2DNN, the first principalcomponent is preserved as the input. As far as SSCL2DNNand SSCL3DNN are concerned, the top K components areextracted as the spectral features.

All the parameters of the compared methods are confirmedaccording to [22], [52], [56], and [57] to achieve the quasi-optimal performance. Specifically, for SVM, the radial basisfunction (RBF) is used in the Libsvm toolbox [57]. There aretwo key parameters in SVM, i.e., C and γ , which denote theregularization parameter and the kernel function parameter,respectively. According to [57], fivefold cross-validation isadopted to tune C and γ from the range of {2−5, 2−4, . . . , 219}and {2−15, 2−14, . . . , 24}, respectively. For Bi-CLSTM [56],the kernel size and the number of feature maps in eachConvLSTM2D layer are fixed as 3 and 32, respectively.As for SeLSTM, SaLSTM, and SSLSTMs [52], the numbersof output nodes in SeLSTM and SaLSTM are set to 64 and128, respectively, for the Indian Pines and Salinas Valley datasets, while set to 128 and 256, respectively, for the Universityof Pavia data set.

For SSCL2DNN and SSCL3DNN, there are four key para-meters to be determined, i.e., the size (s × s) of the local



Fig. 4. (Left) False-color maps and (Right) ground-truth maps of three HSI data sets. (a) Indian Pines (bands 20, 40, and 60). (b) Salinas Valley (bands 46,27, and 10). (c) University of Pavia (bands 47, 27, and 13). (d)–(f) Number of training samples.

TABLE I

SENSITIVITY COMPARISON AND ANALYSIS UNDER DIFFERENTSIZE (s × s) OF LOCAL WINDOWS

window, the number (K ) of the principal components afterPCA, the size (k × k) of the convolution operation, andthe number (M) of the feature maps at each ConvLSTMlayer. First, K is fixed as 10. Then, s is generated from{21, 23, 25, 27, 29, 31, 33}, k is from {3, 4, 5}, and M issearched from three given combinations {16, 32}, {32, 64}, and{64, 128}. The experimental results under different sizes of thelocal window are reported in Table I, from which it can beseen that for these three data sets, the optimal s in SSCL2DNNis 27, while for SSCL3DNN, it is 27, 31, and 31, respectively.Although SSCL3DNN can obtain 0.46% and 0.32% gainswhen s varies from 27 to 31 for the Salinas Valley andUniversity of Pavia data sets, the computational complexityand runtime will obviously increase. Based on the aboveanalysis, the size of the local window for the proposed modelsis fixed as 27 × 27, which can not only achieve satisfactoryclassification performance but also provide convenience forthe practical application of the proposed models. In particular,the size of the local window in SaCL2DNN is the same asthat in SSCL2DNN.

TABLE II

SENSITIVITY COMPARISON AND ANALYSIS UNDER DIFFERENT NUMBER(K ) OF PRINCIPAL COMPONENTS AFTER PCA

After that, the experiments for analyzing the influence ofdifferent K on the performance are further completed, andK is searched from {5, 10, 15, 20} for the Indian Pines andUniversity of Pavia data sets, while from {5, 10, 15} for theSalinas Valley data set because of the memory problems.The results in Table II reveal that for these three data sets,the optimal K in SSCL3DNN is 10, while for SSCL2DNN,it is 5, 10, and 5, respectively.

In order to ensure the convergence of the loss function,the number of training epochs is fixed as 2000, and thelearning rate is set as 0.0001 from epochs 1 to 2000 for theproposed deep models. More detailed parameter settings aresummarized in Tables III–V.

For hardware system configuration, all the following exper-iments are completed on a desktop with an 8th GenerationIntel Core i7-8700 processor and up to 3.7 GHz, 16 GB ofDDR4 RAM with a serial speed of 2400 MHz, a NvidiaGeForce GTX 1080ti GPU with 11-GB memory, an Inter SSDD3-S4510 with 240 GB. For software system configuration,we adopt Ubuntu 16.04.4 × 64 as our operating system for allexperiments. CUDA 8.0 and cuDNN 7.0.5, Tflearn with 0.3.2,Tensorflow-GPU with 1.4.0, and python 3.5.4 are the main



TABLE III

PARAMETER SETTINGS OF SACL2DNN

TABLE IV

PARAMETER SETTINGS OF THE PROPOSED SSCL2DNN

TABLE V

PARAMETER SETTINGS OF THE PROPOSED SSCL3DNN

programming environment. Specially, all methods involved inour experiments are completed in Anaconda 3.4.

It should be noted that Tflearn is a modular and transparentdeep learning library built on Tensorflow, and can provide ahigher-level API than Tensorflow. A combination of Tflearnand Tensorflow constructs the core framework of the proposeddeep models.

C. Classification Performance

In order to show the superiority of the proposed deep mod-els, we randomly select 1% of the available labeled samplesfor training for both the Salinas Valley and the Universityof Pavia data sets while 10% for the Indian Pines data set,respectively. And the remaining samples are used for testing.

According to the experimental settings in Section V-B,the quantitative assessments based on all models are reportedin Tables VI–VIII, from which it can be seen that theproposed SSCL2DNN and SSCL3DNN models can providebetter classification performance than other considered models.First of all, three special gate mechanisms make it possiblefor ConvLSTM to take full use of both the spatial andspectral information of HSIs than CNN that is operated bythe traditional method based on the sliding window. What’smore, the implementations of the gate mechanisms in ConvL-STM are extended from one-dimensional to multidimensionalconvolution operation, which enables ConvLSTM to better

preserve and capture spatial context information of HSIs, andfuse the spatial and spectral information more effectively thanLSTM. Specifically, on the one hand, compared with 2-DCNN, SaCL2DNN improves OA by 0.86%, 5.03%, and 4.80%,respectively, for the Indian Pines, Salinas Valley, and Univer-sity of Pavia data sets, which can better capture spatial contextinformation from hyperspectral data. It should be noted that itis the loss of the spatial information that leads to unsatisfactoryperformance for SVM and SaLSTM. In addition, SaCL2DNNis a special case of SSCL2DNN, and compared with it,SSCL2DNN can obtain 0.96%, 1.62%, and 4.78% gains inOA for these three data sets. For Bi-CLSTM, although Con-vLSTM2D is also the core backbone for the feature extraction,this way of simply cascading all the outputs not only does notfully exploit the correlations between different spectral bandsbut also easily leads to overfitting problem. Different fromit, SSCL2DNN is built by alternating cascades of multilayerConvLSTM2D layers and pooling layers, in which the outputsyielded by modeling long-range dependencies are the finalfeature representations. To some extent, this approach can notonly make great use of the characteristics of ConvLSTM2Dbut also reduce the number of features and the complexityof the model, and the gains in OA yielded by SSCL2DNNare 2.41%, 0.58%, and 6.25% for these three HSI data sets,respectively, which verifies the effectiveness of SSCL2DNN,and demonstrates that joint learning of spatial–spectral featuresby modeling long-term dependencies in the spectral field canprovide higher classification performance.

On the other hand, SSCL3DNN fuses the spatial and spec-tral information more effectively by the special 3-D operation,which can preserve the intrinsic structure of hyperspectraldata to further improve the classification performance ofSSCL2DNN, and obtains 0.76%, 2.99%, and 5.63% gains inOA for these three HSI data sets, respectively. Compared with3-D CNN, the improvements in OA provided by SSCL3DNNare 0.51%, 2.12%, and 7.96% for three data sets, respec-tively. In particular, SSCL3DNN produces remarkable gainsfor the Salinas Valley and University of Pavia data sets.From Tables VI–VIII, it can be seen that the 3-D extendedarchitecture of LSTM makes it possible for SSCL3DNN togenerate better classification performance by preserving theintrinsic structure of hyperspectral data, and the special gatestructures enable SSCL3DNN to extract more discriminativespatial–spectral features. In addition, the classification perfor-mance of some classes with high correlation, such as class 10,class 11, and class 12 in the Indian Pines data set, class 8 andclass 15 in the Salinas Valley data set, and class 1, class 6,and class 8 in the University of Pavia data set, is improved,which demonstrates the superiority of the proposed models.

Corresponding to Tables VI–VIII, similar conclusions canbe drawn from the classification maps presented in Figs. 5–7from which it is obvious that the maps provided by theproposed models are closest to the ground-truth maps for thesethree data sets. In addition, there are only fewer misclassifica-tions, and the boundaries of each class are better recognized,especially for class 2, class 10, and class 11 in Fig. 5, class 8and class 15 in Fig. 6, and class 1, class 6, and class 8 in Fig. 7,which further verifies the effectiveness of the proposed models.



TABLE VI

CLASSIFICATION RESULTS FOR THE INDIAN PINES DATA SET USING 10% TRAINING SAMPLES

TABLE VII

CLASSIFICATION RESULTS FOR THE SALINAS VALLEY DATA SET USING 1% TRAINING SAMPLES

TABLE VIII

CLASSIFICATION RESULTS FOR THE UNIVERSITY OF PAVIA DATA SET USING 1% TRAINING SAMPLES

As we all know, it is greatly expensive and difficult to obtainsamples with labels. Therefore, it is necessary to investigatethe performance under a small training size.

D. Sensitivity Comparison and Analysis UnderSmall Samples

In order to further demonstrate the performance of theproposed deep models, we randomly select ten samples fromeach labeled class to construct smaller training sets.

The experimental results are reported in Tables IX and X,from which it can be observed that even in the case of a

small size of training samples, the proposed deep models canalso show better classification performance. Compared with2-D CNN, the improvements in OA yielded by SaCL2DNNare 4.39%, 5.34%, and 16.46%, respectively, and SSCL2DNNimproves the performance of SaCL2DNN and obtains 3.12%,15.43%, and 3.83% gains in OA for these three HSI data sets,respectively, which further verifies the validity of joint learningof the spatial–spectral features. However, Bi-CLSTM obtainsbetter performance than SSCL2DNN, and yields 1.32%,0.96%, and 1.90% gains in OA, respectively. The main reasonis the way of cascading all the outputs of each ConvLSTM2Dlayer determines that Bi-CLSTM can provide more available



Fig. 5. Classification maps for the Indian Pines data set. (a) Ground-truth map. (b) SVM. (c) 2-D CNN. (d) SaLSTM. (e) SaCL2DNN. (f) SSLSTMs.(g) Bi-CLSTM. (h) SSCL2DNN. (i) 3-D CNN. (j) SSCL3DNN.

TABLE IX

CLASSIFICATION RESULTS OF TRAINING UNDER SMALL SAMPLES (TEN SAMPLES FOR EACH CLASS)

TABLE X

CLASSIFICATION RESULTS OF TRAINING UNDER SMALL SAMPLES (TEN SAMPLES FOR EACH CLASS)

features than SSCL2DNN. In particular, it is the defects ofBi-CLSTM and SSCL2DNN that lead to the destruction of theintrinsic structure of the hyperspectral data when putting eachcomponent of the local patch as the input of each memory cell.

SSCL3DNN overcomes this shortcoming, and yields the bestclassification performance, which obtains 2.07%, 4.96%, and22.61% gains in OA for these three data sets, respectively,when compared with 3-D CNN. More detailed results are



Fig. 6. Classification maps for the Salinas Valley data set. (a) Ground-truth map. (b) SVM. (c) 2-D CNN. (d) SaLSTM. (e) SaCL2DNN. (f) SSLSTMs.(g) Bi-CLSTM. (h) SSCL2DNN. (i) 3-D CNN. (j) SSCL3DNN.

Fig. 7. Classification maps for the University of Pavia data set. (a) Ground-truth map. (b) SVM. (c) 2-D CNN. (d) SaLSTM. (e) SaCL2DNN. (f) SSLSTMs.(g) Bi-CLSTM. (h) SSCL2DNN. (i) 3-D CNN. (j) SSCL3DNN.

reported in Tables IX and X, which further demonstrate theadvantages of the proposed models.

To further show the effectiveness of the proposed deepmodels, 20, 30, and 40 samples from each class are randomlyextracted for training in these three data sets. In particular,the number of training samples for classes 7 and 9 in the IndianPines data set is fixed as 10. When taking different numbersof training samples into account, the OA curves of all modelsare provided in Fig. 8, from which it can be seen that for

these three HSI data sets, SSCL3DNN provides the highestclassification accuracy. It is worth noting that SSCL3DNNobtains significant performance improvements in the SalinasValley and University of Pavia data sets. In particular, com-pared with Bi-CLSTM, SSCL2DNN achieves higher accuracywhen the number of training samples is greater than or equalto 20, which means the way of using ConvLSTM2D to modellong-range dependencies between different spectral bands maybe more effective than that of simply cascading all the outputs.



Fig. 8. OA (%) of considered methods with different number of training samples for three HSI data sets. (a) Indian Pines. (b) Salinas Valley. (c) Universityof Pavia.

The results in Tables IX and X and Fig. 8 further demonstratethe effectiveness of the proposed deep models, and the designof the special gate structure makes it possible for them to bettercapture spatial information and preserve the intrinsic structureinformation of the original data, which is extremely importantfor improving the classification performance.

VI. CONCLUSION

In this article, two novel deep ConvLSTM neural networks,i.e., SSCL2DNN and SSCL3DNN, have been proposed toextract more effective and discriminative spatial–spectral fea-tures for HSI classification. In SSCL2DNN, by taking thelocal patch as a spectral sequence and feeding into each mem-ory cell band by band, the outputs by modeling long-rangedependencies in the spectral domain are the spatial–spectralfeatures, which can reduce the number of features to solvethe overfitting problem to a certain extent while achievingsatisfactory classification performance. By further developingthe 3-D extended structure of LSTM, SSCL3DNN can betterpreserve the intrinsic structure of hyperspectral data, and thespecial gate mechanisms enable it to extract more discrim-inative spatial-spectral features, which can further improveclassification performance. The experimental results conductedon three widely used HSI data sets show that the proposeddeep models offer competitive advantages over state-of-the-art approaches, especially in the case of small training size.

ACKNOWLEDGMENT

The authors would like to thank the associate editor andthe anonymous reviewers for their valuable comments andsuggestions that greatly helped to improve the quality andpresentation of this article.

REFERENCES

[1] D. Landgrebe, “Hyperspectral image data analysis,” IEEE SignalProcess. Mag., vol. 19, no. 1, pp. 17–28, Jan. 2002.

[2] J. M. Bioucas-Dias, A. Plaza, G. Camps-Valls, P. Scheunders,N. M. Nasrabadi, and J. Chanussot, “Hyperspectral remote sensing dataanalysis and future challenges,” IEEE Geosci. Remote Sens. Mag., vol. 1,no. 2, pp. 6–36, Jun. 2013.

[3] G. Camps-Valls, D. Tuia, L. Bruzzone, and J. A. Benediktsson,“Advances in hyperspectral image classification: Earth monitoring withstatistical learning methods,” IEEE Signal Process. Mag., vol. 31, no. 1,pp. 45–54, Jan. 2014.

[4] X. Zhang, Y. Sun, K. Shang, L. Zhang, and S. Wang, “Crop classificationbased on feature band set construction and object-oriented approachusing hyperspectral images,” IEEE J. Sel. Topics Appl. Earth Observ.Remote Sens., vol. 9, no. 9, pp. 4117–4128, Sep. 2016.

[5] F. M. Lacar, M. M. Lewis, and I. T. Grierson, “Use of hyperspectralimagery for mapping grape varieties in the Barossa Valley, SouthAustralia,” in Proc. IEEE Int. Geosci. Remote Sens. Symp., Sydney,NSW, Australia, Jul. 2001, pp. 2875–2877.

[6] A. Ghiyamat and H. Z. Shafri, “A review on hyperspectral remote sens-ing for homogeneous and heterogeneous forest biodiversity assessment,”Int. J. Remote Sens., vol. 31, no. 7, pp. 1837–1856, 2010.

[7] F. van der Meer, “Analysis of spectral absorption features in hyperspec-tral imagery,” Int. J. Appl. Earth Observ. Geoinf., vol. 5, no. 1, pp. 55–68,Feb. 2004.

[8] L. Pan, H.-C. Li, W. Li, X.-D. Chen, G.-N. Wu, and Q. Du, “Dis-criminant analysis of hyperspectral imagery using fast kernel sparse andlow-rank graph,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 11,pp. 6085–6098, Nov. 2017.

[9] G. Camps-Valls, L. Gomez-Chova, J. Munoz-Mari, J. Vila-Frances,and J. Calpe-Maravilla, “Composite kernels for hyperspectral imageclassification,” IEEE Geosci. Remote Sens. Lett., vol. 3, no. 1, pp. 93–97,Jan. 2006.

[10] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust facerecognition via sparse representation,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009.

[11] M. Cui and S. Prasad, “Class-dependent sparse representation classifierfor robust hyperspectral image classification,” IEEE Trans. Geosci.Remote Sens., vol. 53, no. 5, pp. 2683–2695, May 2015.

[12] Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Hyperspectral imageclassification using dictionary-based sparse representation,” IEEE Trans.Geosci. Remote Sens., vol. 49, no. 10, pp. 3973–3985, Oct. 2011.

[13] H. Zhang, J. Li, Y. Huang, and L. Zhang, “A nonlocal weighted jointsparse representation classification method for hyperspectral imagery,”IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 6,pp. 2056–2065, Jun. 2014.

[14] C. Chen, N. Chen, and J. Peng, “Nearest regularized joint sparserepresentation for hyperspectral image classification,” IEEE Geosci.Remote Sens. Lett., vol. 13, no. 3, pp. 424–428, Mar. 2016.

[15] J. Peng and Q. Du, “Robust joint sparse representation based onmaximum correntropy criterion for hyperspectral image classification,”IEEE Trans. Geosci. Remote Sens., vol. 55, no. 12, pp. 7152–7164,Dec. 2017.

[16] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towardsreal-time object detection with region proposal networks,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017.

[17] G. Zhu, F. Porikli, and H. Li, “Robust visual tracking with deepconvolutional neural network based object proposals on PETS,” in Proc.IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, Las Vegas,NV, USA, Jun. 2016, pp. 1265–1272.



[18] A. N. Shuaibu, A. S. Malik, and I. Faye, “Adaptive feature learning CNNfor behavior recognition in crowd scene,” in Proc. IEEE Int. Conf. SignalImage Process. Appl., Kuching, Malaysia, Sep. 2017, pp. 357–361.

[19] F. Ratle, G. Camps-Valls, and J. Weston, “Semisupervised neuralnetworks for efficient hyperspectral image classification,” IEEE Trans.Geosci. Remote Sens., vol. 48, no. 5, pp. 2271–2282, May 2010.

[20] Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu, “Deep learning-basedclassification of hyperspectral data,” IEEE J. Sel. Topics Appl. EarthObserv. Remote Sens., vol. 7, no. 6, pp. 2094–2107, Jun. 2014.

[21] W. Hu, Y. Huang, L. Wei, F. Zhang, and H. Li, “Deep convolutionalneural networks for hyperspectral image classification,” J. Sensors,vol. 2015, pp. 1–12, 2015.

[22] Y. Chen, H. Jiang, C. Li, X. Jia, and P. Ghamisi, “Deep feature extrac-tion and classification of hyperspectral images based on convolutionalneural networks,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 10,pp. 6232–6251, Oct. 2016.

[23] W. Li, G. Wu, F. Zhang, and Q. Du, “Hyperspectral image classificationusing deep pixel-pair features,” IEEE Trans. Geosci. Remote Sens.,vol. 55, no. 2, pp. 844–853, Feb. 2017.

[24] W. Shao and S. Du, “Spectral-spatial feature extraction for hyper-spectral image classification: A dimension reduction and deep learn-ing approach,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 8,pp. 4544–4554, Oct. 2016.

[25] Z. Zhong, J. Li, Z. Luo, and M. Chapman, “Spectral-spatial residualnetwork for hyperspectral image classification: A 3-D deep learn-ing framework,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 2,pp. 847–858, Feb. 2018.

[26] Y. Li, H. Zhang, and Q. Shen, “Spectral-spatial classification of Hyper-spectral imagery with 3D convolutional neural network,” Remote Sens.,vol. 9, no. 1, p. 67, 2017.

[27] M. E. Paoletti, J. M. Haut, J. Plaza, and A. Plaza, “A new deepconvolutional neural network for fast hyperspectral image classification,”ISPRS J. Photogramm. Remote Sens., vol. 145, pp. 120–147, Nov. 2018.

[28] W. Song, S. Li, L. Fang, and T. Lu, “Hyperspectral image classificationwith deep feature fusion network,” IEEE Trans. Geosci. Remote Sens.,vol. 56, no. 6, pp. 3173–3184, Jun. 2018.

[29] L. Fang, Z. Liu, and W. Song, “Deep hashing neural networksfor hyperspectral image feature extraction,” IEEE Trans. Geosci.Remote Sens. Lett., vol. 16, no. 9, pp. 1412–1416, Sep. 2019, doi:10.1109/LGRS.2019.2899823.

[30] R. Kemker and C. Kanan, “Self-taught feature learning for hyperspectralimage classification,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 5,pp. 2693–2705, May 2017.

[31] B. Liu, X. Yu, P. Zhang, X. Tan, A. Yu, and Z. Xue, “A semi-supervisedconvolutional neural network for hyperspectral image classification,”Remote Sens. Lett., vol. 8, no. 9, pp. 839–848, Sep. 2017.

[32] B. Liu, X. Yu, P. Zhang, A. Yu, Q. Fu, and X. Wei, “Super-vised deep feature extraction for hyperspectral image classification,”IEEE Trans. Geosci. Remote Sens., vol. 56, no. 4, pp. 1909–1921,Apr. 2018.

[33] H. Zhang, Y. Li, Y. Zhang, and Q. Shen, “Spectral-spatial classificationof hyperspectral imagery using a dual-channel convolutional neuralnetwork,” Remote Sens. Lett., vol. 8, no. 5, pp. 438–447, 2017.

[34] S. Li, W. Song, L. Fang, Y. Chen, P. Ghamisi, and J. A. Benediktsson,“Deep learning for hyperspectral image classification: An overview,”IEEE Trans. Geosci. Remote Sens., vol. 57, no. 9, pp. 6690–6709,Sep. 2019, doi: 10.1109/TGRS.2019.2907932.

[35] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing betweencapsules,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 3859–3869.

[36] M. E. Paoletti, J. M. Haut, J. Plaza, and A. Plaza, “Deep & denseconvolutional neural network for hyperspectral image classification,”Remote Sens., vol. 10, no. 9, p. 1454, Sep. 2018.

[37] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training verydeep networks,” in Proc. Adv. Neural Inf. Process. Syst., 2015,pp. 2377–2385.

[38] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, andJ. Schmidhuber, “A novel connectionist system for unconstrained hand-writing recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31,no. 5, pp. 855–868, May 2009.

[39] L. Mou, P. Ghamisi, and X. X. Zhu, “Deep recurrent neural networks forhyperspectral image classification,” IEEE Trans. Geosci. Remote Sens.,vol. 55, no. 7, pp. 3639–3655, Jul. 2017.

[40] B. Liu, X. Yu, A. Yu, P. Zhang, and G. Wan, “Spectral-spatial classi-fication of hyperspectral imagery based on recurrent neural networks,”Remote Sens. Lett., vol. 9, no. 12, pp. 1118–1127, 2018.

[41] X. Zhang, Y. Sun, K. Jiang, C. Li, L. Jiao, and H. Zhou, “Spatialsequential recurrent neural network for hyperspectral image classifica-tion,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 11,no. 11, pp. 4141–4155, Nov. 2018.

[42] R. Hang, Q. Liu, D. Hong, and P. Ghamisi, “Cascaded recurrent neuralnetworks for hyperspectral image classification,” IEEE Trans. Geosci.Remote Sens., vol. 57, no. 8, pp. 5384–5394, Aug. 2019, doi: 10.1109/TGRS.2019.2899129.

[43] L. Mou, L. Bruzzone, and X. X. Zhu, “Learning spectral-spatial-temporal features via a recurrent convolutional neural network forchange detection in multispectral imagery,” IEEE Trans. Geosci. RemoteSens., vol. 57, no. 2, pp. 924–935, Feb. 2019.

[44] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” NeuralComput., vol. 9, no. 8, pp. 1735–1780, 1997.

[45] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learningwith neural networks,” in Proc. Adv. Neural Inf. Process. Syst., Montreal,QC, Canada, 2014, pp. 3104–3112.

[46] Z. C. Lipton, J. Berkowitz, and C. Elkan, “A critical review of recur-rent neural networks for sequence learning,” 2015, arXiv:1506.00019.[Online]. Available: https://arxiv.org/abs/1506.00019

[47] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-C. Woo,“Convolutional LSTM network: A machine learning approach for pre-cipitation nowcasting,” in Proc. Adv. Neural Inf. Process. Syst., 2015.

[48] H. Wu and S. Prasad, “Convolutional recurrent neural networks forhyperspectral data classification,” Remote Sens., vol. 9, no. 3, p. 298,2017.

[49] M. Rußwurm and M. Körner, “Multi-temporal land cover classificationwith sequential recurrent encoders,” ISPRS Int. J. Geo-Inf., vol. 7, no. 4,p. 129, Mar. 2018.

[50] A. Song, J. Choi, Y. Han, and Y. Kim, “Change detection in hyperspec-tral images using recurrent 3D fully convolutional networks,” RemoteSens., vol. 10, no. 11, p. 1827, Nov. 2018.

[51] M. Seydgar, A. A. Naeini, M. Zhang, W. Li, and M. Satari,“3-D convolution-recurrent networks for spectral-spatial classification ofhyperspectral images,” Remote Sens., vol. 11, no. 7, p. 883, Jan. 2019.

[52] F. Zhou, R. Hang, Q. Liu, and X. Yuan, “Hyperspectral image classifi-cation using spectral-spatial LSTMs,” Neurocomputing, vol. 328, no. 7,pp. 39–47, Feb. 2017.

[53] M. Rußwurm and M. Körner, “Temporal vegetation modelling usinglong short-term memory networks for crop identification from medium-resolution multi-spectral satellite images,” in Proc. Conf. Comput.Vis. Pattern Recognit. Workshops, Honolulu, HI, USA, Jul. 2017,pp. 1496–1504.

[54] B. Romera-Paredes and P. H. S. Torr, “Recurrent instance segmentation,”in Proc. Eur. Conf. Comput. Vis., Amsterdam, The Netherlands, 2016,pp. 312–329.

[55] A. Ma, A. M. Filippi, Z. Wang, and Z. Yin, “Hyperspectral imageclassification using similarity measurements-based deep recurrent neuralnetworks,” Remote Sens., vol. 11, no. 2, p. 194, Jan. 2019.

[56] Q. Liu, F. Zhou, R. Hang, and X. Yuan, “Bidirectional-convolutionalLSTM based spectral-spatial feature learning for hyperspectral imageclassification,” Remote Sens., vol. 9, no. 12, pp. 17–28, Dec. 2017.

[57] C. C. Chang and C. J. Lin, “LIBSVM: A library for support vectormachines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 1–27,2011.

Wen-Shuai Hu received the B.Sc. degree in elec-tronic information science and technology fromShanxi Agricultural University, Jinzhong, China,in 2016. He is currently pursuing the Ph.D. degreein information and communication engineering withthe School of Information Science and Technology,Southwest Jiaotong University, Chengdu, China.

His research interest includes remote sensingimage processing.


http://dx.doi.org/10.1109/LGRS.2019.2899823

http://dx.doi.org/10.1109/TGRS.2019.2907932




Heng-Chao Li (Senior Member, IEEE) receivedthe B.Sc. and M.Sc. degrees in information andcommunication engineering from Southwest Jiao-tong University, Chengdu, China, in 2001 and 2004,respectively, and the Ph.D. degree in information andcommunication engineering from the Graduate Uni-versity of Chinese Academy of Sciences, Beijing,China, in 2008.

From 2013 to 2014, he was a Visiting Scholaralong with Prof. W. J. Emery at the Universityof Colorado at Boulder, Boulder, CO, USA. He is

currently a Professor with the Sichuan Provincial Key Laboratory of Informa-tion Coding and Transmission, Southwest Jiaotong University. His researchinterests include the statistical analysis of synthetic aperture radar images,remote sensing image processing, and signal processing in communications.

Dr. Li received several scholarships or awards, including the Special Gradeof the Financial Support from the China Post-Doctoral Science Foundationin 2009 and the New Century Excellent Talents in University from the Min-istry of Education of China in 2011. He serves as an Associate Editor for theIEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS

AND REMOTE SENSING. He has been a Reviewer for several internationaljournals and conferences, such as the IEEE TRANSACTIONS ON GEO-SCIENCE AND REMOTE SENSING, the IEEE JOURNAL OF SELECTED TOPICS

IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, the IEEEGEOSCIENCE AND REMOTE SENSING LETTERS, the IEEE TRANSACTIONS

ON IMAGE PROCESSING, IET Radar, Sonar and Navigation, IET SignalProcessing, IET Image Processing, Pattern Recognition, the InternationalJournal of Remote Sensing, Remote Sensing, and the Canadian Journal ofRemote Sensing.

Lei Pan received the B.Sc. degree in commu-nication engineering from the Shandong Univer-sity of Science and Technology, Qingdao, China,in 2010, and the M.Sc. degree in communicationand information system and the Ph.D. degree insignal and information processing from the Schoolof Information Science and Technology, SouthwestJiaotong University, Chengdu, China, in 2013 and2019, respectively.

His research interest includes remote sensingimage analysis and processing.

Wei Li (Senior Member, IEEE) received the B.Sc.degree in telecommunications engineering fromXidian University, Xi’an, China, in 2007, the M.Sc.degree in information science and technology fromSun Yat-sen University, Guangzhou, China, in 2009,and the Ph.D. degree in electrical and computer engi-neering from Mississippi State University, Starkville,MS, USA, in 2012.

Subsequently, he spent one year as a Post-DoctoralResearcher at the University of California at Davis,Davis, CA, USA. He was a Professor with the

College of Information Science and Technology, Beijing University of Chem-ical Technology, Beijing, China, from 2013 to 2019. He is currently aProfessor with the School of Information and Electronics, Beijing Instituteof Technology, Beijing. His research interests include hyperspectral imageanalysis, pattern recognition, and data compression.

Dr. Li has served as a Guest Editor for special issue of the Journalof Real-Time Image Processing, Remote Sensing, and the IEEE JOURNAL

OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE

SENSING (JSTARS). He received the 2015 Best Reviewer Award from theIEEE GEOSCIENCE AND REMOTE SENSING SOCIETY (GRSS) for his servicefor IEEE JSTARS and the Outstanding Paper Award at the IEEE InternationalWorkshop on Hyperspectral Image and Signal Processing: Evolution inRemote Sensing (Whispers), in 2019. He is also serving as an AssociateEditor for the IEEE SIGNAL PROCESSING LETTERS and the IEEE JSTARS.

Ran Tao (Senior Member, IEEE) received the B.Sc.degree from the Electronic Engineering Institute ofPLA, Hefei, China, in 1985, and the M.Sc. and Ph.D.degrees from the Harbin Institute of Technology,Harbin, China, in 1990 and 1993, respectively.

He has been a Senior Visiting Scholar with theUniversity of Michigan, Ann Arbor, MI, USA, andthe University of Delaware, Newark, DE, USA,in 2001 and 2016, respectively. He is currentlya Professor with the School of Information andElectronics, Beijing Institute of Technology, Beijing,

China. He has three books and more than 100 peer-reviewed journal articles.His current research interests include fractional Fourier transform and itsapplications, theory, and technology for radar and communication systems.

Dr. Tao is a fellow of the Institute of Engineering and Technology andthe Chinese Institute of Electronics. He is also a member of the WirelessCommunication and Signal Processing Commission, URSI. He was a recip-ient of the National Science Foundation of China for Distinguished YoungScholars in 2006 and a Distinguished Professor of the Changjiang ScholarsProgram in 2009. He was a recipient of the First Prize of Science andTechnology Progress awarded by the Ministry of Education in 2006 and 2007,and the First Prize of Natural Science awarded by the Ministry of Educationin 2013. He has been a Chief Professor of the Creative Research Groups withthe National Natural Science Foundation of China since 2014. He was a ChiefProfessor of the Program for Changjiang Scholars and Innovative ResearchTeam in University from 2010 to 2012. He is also the Vice-Chair of the IEEEChina Council and the International Union of Radio Science (URSI) ChinaCouncil.

Qian Du (Fellow, IEEE) received the Ph.D. degreein electrical engineering from the University ofMaryland at Baltimore, Baltimore, MD, USA,in 2000.

She is currently the Bobby Shackouls Professorwith the Department of Electrical and ComputerEngineering, Mississippi State University, Starkville,MS, USA. She is also an Adjunct Professor with theCollege of Surveying and Geo-informatics, TongjiUniversity, Shanghai, China. Her research interestsinclude hyperspectral remote sensing image analy-

sis and applications, pattern classification, data compression, and neuralnetworks.

Dr. Du is a fellow of the SPIE-International Society for Optics andPhotonics. She was a recipient of the 2010 Best Reviewer Award from theIEEE Geoscience and Remote Sensing Society (GRSS). She has served asthe Co-Chair for the Data Fusion Technical Committee of the IEEE GRSSfrom 2009 to 2013. She was the Chair of the Remote Sensing and MappingTechnical Committee, International Association for Pattern Recognition, from2010 to 2014. She was the General Chair for the fourth IEEE GRSS Workshopon Hyperspectral Image and Signal Processing: Evolution in Remote Sensing,Shanghai, in 2012. She has served as an Associate Editor for the IEEEJOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND

REMOTE SENSING (JSTARS), the Journal of Applied Remote Sensing, andthe IEEE SIGNAL PROCESSING LETTERS. Since 2016, she has been theEditor-in-Chief of the IEEE JSTARS.


spatial–spectral feature extraction via deep convlstm

Documents