manuscript to ieee j-bhi 1 classiﬁcation of medical images

MANUSCRIPT TO IEEE J-BHI 1

Classification of Medical Images in the BiomedicalLiterature by Jointly Using Deep and Handcrafted

Visual FeaturesJianpeng Zhang, Yong Xia, Member, IEEE, Yutong Xie, Michael Fulham, and David Dagan Feng, Fellow, IEEE

Abstract—The classification of medical images and illustra-tions from the biomedical literature is important for automatedliterature review, retrieval and mining. Although deep learning iseffective for large-scale image classification, it may not be the opti-mal choice for this task as there is only a small training dataset.We propose a combined deep and handcrafted visual feature(CDHVF) based algorithm that uses features learned by threefine-tuned and pre-trained deep convolutional neural networks(DCNNs) and two handcrafted descriptors in a joint approach.We evaluated the CDHVF algorithm on the ImageCLEF 2016Subfigure Classification dataset and it achieved an accuracy of85.47%, which is higher than the best performance of otherpurely visual approaches listed in the challenge leaderboard. Ourresults indicate that handcrafted features complement the imagerepresentation learned by DCNNs on small training datasetsand improve accuracy in certain medical image classificationproblems.

Index Terms—Medical image classification, deep convolu-tional neural network (DCNN), back-propagation neural network(BPNN), ensemble learning.

I. Introduction

THE indispensable role of digital medical imaging in mod-ern healthcare has resulted in a proliferation of digital

images in electronic biomedical publications. These imagesvary in the degree of noise, spatial resolution, contrast, imageintensity and there is marked anatomical variability acrossmodalities but also within modalities related the region beingexamined. The large image data repository and the imagequality pose major challenges for image retrieval, review and

This work was supported in part by the National Natural Science Foun-dation of China under Grants 61471297 and 61771397, in part by the SeedFoundation of Innovation and Creation for Graduate Students in NorthwesternPolytechnical University under Grants Z2017041, and in part by the AustralianResearch Council (ARC) Grants. (Corresponding author: Y. Xia)

J. Zhang, Y. Xia and Y. Xie are with the Shaanxi Key Lab of Speech& Image Information Processing (SAIIP), School of Computer Science andEngineering, Northwestern Polytechnical University, Xi’an 710072, China,and with the Centre for Multidisciplinary Convergence Computing (CMCC),School of Computer Science and Engineering, Northwestern PolytechnicalUniversity, Xi’an 710072, China (e-mail: [email protected]).

M. Fulham is with the Department of Molecular Imaging, Royal PrinceAlfred Hospital, NSW 2050, Australia, with the Sydney Medical School,University of Sydney, NSW 2006, Australia, and with the Centre for Multi-disciplinary Convergence Computing (CMCC), School of Computer Scienceand Engineering, Northwestern Polytechnical University, Xi’an 710072, China(e-mail: [email protected]).

D. D. Feng is with the Biomedical and Multimedia Information Technology(BMIT) Research Group, School of Information Technologies, University ofSydney, NSW 2006, Australia, and with the Centre for MultidisciplinaryConvergence Computing (CMCC), School of Computer Science and Engi-neering, Northwestern Polytechnical University, Xi’an 710072, China (e-mail:[email protected]).

assimilation for clinical care and research. Hence there hasbeen considerable research directed at image classification toimprove data mining in this area [1]. A number of investigatorshave reported solutions, in which visual feature extractionplays a pivotal role [3][4][5].

Commonly used visual features include single descriptorsfor color, texture and shape and combined descriptors, such asthe fuzzy color texture histograms [6] and color edge directiondescriptor [7]. Pelka et al. [8] combined different visualdescriptors into a more discriminatory one. Megalooikonomouet al. [9] used the depth-first string and Prfer encoding toobtain the symbolic string representation of a trees branchingtopology for the classification of tree-like structures in medicalimages. Recently, the bag-of-features (BoF) model has beenused extensively [10][11][12][13][14]. Lazebnik et al. [13]proposed an extension to the BoF, called the spatial pyramidmatching (SPM), which uses the spatial order of local descrip-tors by partitioning an image into segments in different scalesand computing the BoF histogram within each segment. Yanget al. [10] further improved upon this approach by computinga spatial-pyramid image representation based on sparse codingscale-invariant feature transform (SIFT) descriptors. Wang etal. [11] presented a simple but effective coding scheme, calledthe locality-constrained linear coding (LLC), to replace vectorquantization. Although BoF and its variants are widely used,they suffer from inadequate ability to characterize images andlack of robust structures upon visual words [12]. To overcomethese drawbacks, Xie et al. [12] combined texture and edge-based local features at the feature extraction level and builtgeometric visual phrases to model spatial context for midlevelimage representation.

Despite these improvements, the classification of medicalimages according to the different modalities by which theywere produced and the classification of illustrations accordingto their production attributes are the most challenging imageclassification problems. As an example (see Fig. 1), in theseparation of images from computed tomography (CT) andmagnetic resonance (MR) imaging scanners: (1) Both CTand MR are anatomical imaging techniques that providethe information about structure in the body parts that areimaged. Hence, these images share many visual similaritiesand non-professionals can have difficulty in separating them.(2) Images from the same modality will differ dependingupon the anatomical location and individual variability. Theintra-class variation and inter-class similarity [15] pose majorchallenges for this image classification problem. (3) Image

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/JBHI.2017.2775662

Copyright (c) 2017 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].


Fig. 1: Transaxial images from left to right of CT brain,upper abdominal CT, flair sequence of a brain MR and upperabdominal MR (images obtained from literature).

spatial resolution, dynamic range and contrast are usuallydegraded by the publication process. In Fig. 1, there are twoadjacent CT and MR images. From the visual appearance,the CT brain is more similar to the MR brain but if groupedby modality the brain images should be grouped with thecounterpart upper abdominal scan.

Since Hinton’s breakthroughs [18] in ImageNet 2012 [16],[17], deep learning techniques have won this image classifica-tion challenge each year since. Deep learning techniques arewidely acknowledged as a powerful tool for image classifica-tion and development has continued. Wu et al. [19] developeda convolutional neural network (CNN) for natural imageclassification and auto-annotation. Xu et al. [20] combineddeep learning with multi-instance learning for medical imageanalysis. Deep models have distinct advantages over tradi-tional solutions: (1) they provide a uniform feature extraction-classification framework to free users from troublesome hand-crafted feature extraction and (2) they are particularly suitablefor solving large-scale learning problems. These techniqueshave some drawbacks: (1) they can hardly use heuristics toguide feature extraction for each specific task due to theautomated feature learning and (2) they may suffer from over-fitting when the training dataset is not large enough.

Due to the time-consuming nature of medical image anno-tation, medical image classification problems usually have arelatively small training dataset. We suggest that, when deepmodels are applied to small-sample learning problems, theyshould be regularized by the visual descriptors extracted withthe guidance of heuristics. Many investigators have exploredneural network-based image representation and its connectionto handcrafted features. Jarrett et al. [21] studied learningfilter banks in an unsupervised and supervised manner andconcluded that unsupervised pre-training followed by super-vised refinement produces good classification accuracy onthe Caltech-101 and MNIST datasets. Richard and Gall [22]proposed a transformation of the standard BoF model into aneural network, enabling discriminative training of the visualvocabulary on large action recognition datasets. Importantly,Hu et al. and Oquab et al. reported that image representationlearned from large-scale datasets, such as ImageNet, canbe efficiently transferred to generic visual recognition tasks,where there are limited training data [23] [24] [43].

In this paper, we propose a combined deep and handcraftedvisual feature (CDHVF) based algorithm to classify diagnosticmedical images and illustrations in the biomedical literature.We fine tune three pre-trained deep convolutional neuralnetworks (DCNNs) to extract deep features and estimate the

TABLE I: Distribution of training and testing images over 30categories in the ImageCLEF 2016 Subfigure ClassificationDataset.

Acronym D3DR DMEL DMFL DMLI DMTR DRAN

#Training 201 208 906 696 300 17

#Testing 96 88 284 405 96 76

Acronym DRCO DRCT DRMR DRPE DRUS DRXR

#Training 33 61 139 14 26 51

#Testing 17 71 144 15 129 18

Acronym DSEC DSEE DSEM DVDM DVEN DVOR

#Training 10 8 5 29 16 55

#Testing 8 3 6 9 8 21

Acronym GCHE GFIG GFLO GGEL GGEN GHDR

#Training 61 2954 20 344 179 136

#Testing 14 2085 31 224 150 49

Acronym GMAT GNCP GPLI GSCR GSYS GTAB

#Training 15 88 1 33 91 79

#Testing 3 20 2 6 75 13

BoF and local binary patterns (LBP) [25] as handcrafted visualfeatures. We apply the principal component analysis (PCA)[26] to each of five features for dimension reduction. Wesample a fixed number of feature components from each type,to avoid the imbalance due to different sizes of feature typesand then use them to train a back-propagation neural network(BPNN) as a weak classifier. We classify each image using anensemble classifier, in which the weight of each weak classifieris determined by its performance on the validation dataset. Wecompare our CDHVF algorithm to the six best-performingsolutions in the ImageCLEF 2016 Subfigure ClassificationChallenge on the challenge dataset [2].

II. Data Set

The ImageCLEF 2016 Subfigure Classification dataset con-tains 6776 training and 4166 test images saved in JPEGformat, from 18 categories of diagnostic images and 12categories of illustrative tables and figures. The improved ad-hoc hierarchy with 30 categories, including diagnostic imagesand illustrations, was created based on the work of Muller etal [27]. These images, tables and figures are derived fromthe biomedical literature and distributed, in their originalformat, by PubMed Central1 [2][28]. Examples from eachcategory with the full name and category acronym are shownin Fig. 2. The distribution of training and testing images overall categories is given in Table I and emphasizes that theImageCLEF 2016 dataset is highly unbalanced and diverse.

III. Method

Our CDHVF algorithm has four main steps: (1) fine tuningthree pre-trained DCNN models for deep feature extraction,(2) calculating the BoF and LBP descriptors on each image,(3) reducing the dimension of each feature type using PCA,and (4) the joint use of deep and handcrafted visual descriptors

1http://www.ncbi.nlm.nih.gov/pmc/




Gene sequence

(GGEN)

Chemical structure

(GCHE)

3D reconstructions

(D3DR)

Electron microscopy

(DMEL)Fluorescense microscopy

(DMFL)

Microscopy

(DMLI)

Transmission microscopy

(DMTR)

Angiography

(DRAN)Combined modalities

in on image (DRCO)

s Computerized

Tomography(DRCT)

Ultrasound

(DRUS)

X-Ray, 2D Radiography

(DRXR)

Electrocardiography

(DSEC)

Electroencephalography

(DSEE)

Electromyography

(DSEM)

Magnetic Resonance

(DRMR)

PET

(DRPE)Dermatology, skin

(DVDM)

Endoscopy

(DVEN)

Other Organs

(DVOR)

Mathematics, formula

(GMAT)Non-clinical photos

(GNCP)

Statistical figures,

graphs, charts

(GFIG)

Flowcharts

(GFLO)

Chromatography, gel

GGEL

Hand-drawn sketches

(GHDR)Program listing

(GPLI)Screenshots

(GSCR)

System overviews

(GSYS)Table and forms

(GTAB)

Fig. 2: Example images with name and acronym of each image category in the ImageCLEF 2016 Subfigure Classificationdataset.

to train a BPNN-based ensemble classifier for image classifi-cation. A diagram that summarizes the algorithm is shown inFig. 3.

A. Deep Feature Extraction

We use three pre-trained DCNN models - the Caffe-ref[29], VGG-f [30] and VGG-19 [31] - for feature extraction.As shown in Fig. 4, Caffe-ref, a reference implementationof AlexNet [16], contains five convolutional layers and threepooling layers, followed by three fully connected layers with4096, 4096 and 1000 neurons, respectively. VGG-f also com-prises five convolutional layers, three pooling layers and threefully connected layers, but has a different number of filters.VGG-19 is a deeper architecture, containing as many as 16convolutional layers with small filters of size 3x3, five poolinglayers and three fully connected layers with 4096, 4096 and1000 neurons, respectively. Each model takes an input imageof size 224 × 224 × 3, generates a prediction vector of 1000dimension, and has previously been trained on the ImageNettraining set, which is a 1000-category large-scale natural imagedatabase.

To adapt these models to our 30-category image classifica-tion problem, we first randomly select 30 neurons in the lastfully connected layer and remove the other output neurons andthe weights attached to them. Then, we resize our trainingimages into 224 × 224 × 3 using the bicubic interpolation[32] and input them to each model for fine tuning. We setthe maximum iteration number to 200 and choose the min-batch stochastic gradient descent with a batch size of 50as the optimizer. Since our training set is smaller than one2000th of the ImageNet dataset, we set the learning rate assmall as 0.0001 and further reduce it by one-tenth every 20epochs, aiming to prevent the models from over-fitting. Wealso randomly choose 20% of the training images to form avalidation set and terminate the training process even before

reaching the maximum iteration number, if the error on theother 80% of the training images continues to decline andthe error on the validation set stops decreasing. We definethe output of the second last layer of each fine-tuned DCNNmodel as a 4096-dimensional deep feature.

B. Handcrafted Feature Extraction

We use the BoF [14] and LBP [25] descriptors to character-ize each image. The BoF, used as a vector of occurrence countsof local descriptors over a visual vocabulary, can be calculatedin three steps. Initially, the speeded-up robust feature (SURF)algorithm [33] is used to detect key points and generate a128-dimensional descriptor for each key point based on itsneighboring gradients. Vector quantization [34] is then usedto assign SURF descriptors to KBoF clusters, referred to as avisual vocabulary. Then the distribution of SURF descriptorsover the visual vocabulary is counted as the BoF descriptor.

The computation of the LBP texture descriptor also consistsof three steps [25]. For each pixel, its eight neighbors are firstfollowed along a circle and the pixels value is used to thresholdeach neighbors value, resulting in eight binary numbers. Thosebinary numbers are then concatenated to form an 8-bit-codedan integer, which takes a value from the set {0,1,...,255}.Next, the histogram of the frequency of each integer occurringover the entire image is counted as a 256-dimensional LBPdescriptor.

C. Feature Dimension Reduction

For each image, there are three groups of deep modellearned features, a BoF descriptor and a LBP descriptor withdimension ranging from 256 to 4096. We use PCA to reducethe dimension of the features on a group-by-group basis toavoid possible bias caused by the unbalanced dimensions offeature groups. Let the i-th group of mean-subtracted featuresbe denoted by F(i)

DiN, where Di is the feature dimension and




Training set

Test set

Deep Feature extracting

Handcrafted Feature extracting Adaboost BP

Class_1

Class_30

P

C

A

P

C

A

P

C

A

P

C

A

P

C

A

1 1 0

1 0

1 1 11

1 1 0

1 0

1 1 1

1 1 1 1 0 0

1

.

.

.

243

.

.

.

75 62 55

64 57 54

64 63 6464

75 62 55

64 57 54

64 63 64

Fig. 3: Outline of the proposed CDHVF algorithm.

N is the number of training images. We apply the singularvalue decomposition (SVD) [26] to the covariance matrixof F(i)

DiNand obtain its eigenvalues {λ1,λ2,...,λD′ }, which are

sorted in descending order, and the corresponding eigenvectors{X1,X2,...,XD′ }. We project features into a lower-dimensionalspace spanned by the first Li eigenvectors, such that the sumof the corresponding eigenvalues is above p% of the sum ofall eigenvalues. The parameter p controls the dimension ofobtained features.

D. Classification via Ensemble LearningWe construct an ensemble classifier to us the five groups

of features jointly. For the i-th group of dimension-reducedfeatures F′(i)LiN

, the importance of each feature component isdefined as

ρi j =λ j∑L j

k=1 λk

, i = 1, 2, ..., 5; j = 1, 2, ..., Li (1)

We sample m feature components from each feature type usingthe roulette wheel selection based on the importance ρi j. Next,we concatenate the sampled feature components to form a 5m-dimensional combined feature and use the combined featuresextracted on 80% of the training data, which is randomlyselected to train a one-hidden-layer BPNN [35] as a weakclassifier ht. In the weak classifier, the activation functionof hidden and output neurons is the optimal logistic sigmoidfunction, and the number of hidden layer neurons is set to

NH =√

NI · NO (2)

where NI = 5m and NO = 30 are the numbers of input andoutput neurons. The trained BPNN is then tested on the other20% of the training data, the validation set, and achievesa classification error εt. We repeat this process T times togenerate T BPNN classifiers. Thus, the ensemble classifier canbe formally defined as

H(·) =

T∑t=1

αtht (3)

where the weight αt is calculated as

αt =12

ln(1 − εt

εt) (4)

E. Parameter SettingsThe parameters in our CDHVF algorithm are roughly

categorized as DCNN parameters and parameters related tofeature dimension reduction and ensemble learning. Given thatwe choose pre-trained DCNN models, we can only fine tunetheir kernels and weights, but not their architecture and otherparameters. Thus we focus on using the validation dataset,to empirically determine the settings of the second group ofparameters.

The dimension of the BoF is determined by KBoF , the sizeof the visual vocabulary. We set KBoF to different values,keep other parameters unchanged and plot the classificationaccuracy of the validation dataset in Fig. 5. It shows that ourCDHVF algorithm achieves the highest accuracy of 88.04%when visual vocabulary size is 500. Thus we empirically setthe size of the visual vocabulary to 500. (KBoF=500).

The number of principal components in a PCA-baseddimension reduction is usually determined by a thresholdover the accumulated eigenvalues. In Fig. 6, we show theclassification accuracy on the validation set over this thresholdwhen other parameters are unchanged. It reveals that settingthe threshold to 97% of the sum of all eigenvalues leads tothe highest accuracy, and using fewer principal components isworse than not using the PCA-based dimension reduction atall.

The next parameter m is the number of features sampledfrom each feature group. In Table II, the classification accuracywhen m is set to different values is shown and sampling 50feature components (m = 50) from each feature group has thehighest accuracy. If the number of sampled handcrafted visualand / or deep features changes, then the accuracy is reduced.

The next critical parameter T is the number of weak BPNNclassifiers used in the ensemble learning. We search this




11x11 conv, 64, pool/2


3x3 conv, 256

3x3 conv, 256


Fc, 4096

Fc, 4096

Fc, 1000

VGG-f model

Fig. 4: Architectures of the Caffe-ref (top-left), VGG-f(bottom-left) and VGG-19 (right) models: Each rectanglereflects a learnable layer. For a convolutional layer, the firstitem represents the size of the kernel, the second item is thenumber of channels, and the third item (if it exists) meansthere is a pooling layer following it. For a fully connectedlayer, the integer represents the number of neurons.

200 400 600 800 100085

86

87

88

89

90

500

Size of Vocabulary

Cla

ssific

ation A

ccura

cy (

%)

BoFLBP

DeepDeep+BoFDeep+LBP

BoFLBP


BoFLBP


Fig. 5: Classification accuracy of the validation set overthe dimension of the BoF descriptor when keeping otherparameters constant.

value from 1 to 75, and plot the variation of classificationaccuracy on the validation set in Fig. 7. Although the accuracyfluctuates, it generally improves with the increase of T. We

85 90 95 10087.0

87.5

88.0

88.5

WithoutPCA

97

Threshold (% Sum of eigenvalue)

Cla

ssific

ation A

ccura

cy (

%)

BoFLBP


BoFLBP


BoFLBP


74.4179.84

75.83

Fig. 6: Classification accuracy of validation set over thethreshold for eigenvectors in PCA.

TABLE II: Accuracy of the proposed algorithm in the valida-tion set with different number of features sampled to train theensemble classifier.

Number of Sampled Features ClassificationBoF LBP Caffe-ref VGG-f VGG-19 Accuracy(%)20 20 20 20 20 87.3450 50 50 50 50 88.04100 100 100 100 100 87.8120 20 50 50 50 87.9050 50 20 20 20 87.28

BoFLBPDeep

Deep+BoFDeep+LBP

BoFLBPDeep

Deep+BoFDeep+LBP

BoFLBPDeep

Deep+BoFDeep+LBP

0 5 10 15 20 25 30 35 40 45 50 55 600.85

0.86

0.87

0.88

Number of weak classifiers (BPNN)

Cla

ssific

ati

on

Accu

racy (

%)

Fig. 7: Classification accuracy with the number of weakclassifiers.

empirically use 50 weaker classifiers (T = 50), since thissetting has the highest accuracy. Further increases offer littlegain in performance but a marked increase in computationalcomplexity.

IV. Experiments and Results

We assessed the performance improvement from our jointuse of deep and handcrafted visual features across three groupsof image descriptors. The first group had two handcraftedfeatures - the BoF and LBP. The second group had theimage representation learned by a fully-trained DCNN andthe handcrafted features. The third group comprised the imagerepresentations learned by three fine-tuned and pre-trainedDCNNs and the joint combination of three pre-trained DCNNfeatures and two handcrafted features.

We constructed the fully-trained DCNN based on the LeNet-5 model (see Fig. 8) [41]. This DCNN has 11 learnable layers- 4 convolutional layers, 4 pooling layers and 3 fully connected




TABLE III: Performance of the proposed algorithm whenusing different visual features.

Features ClassificationAccuracy (%)

Handcrafted methods BoF 70.86LBP 71.29

Fully-trained DCNNs

Deep Feature 67.02Deep Feature + BoF 72.39Deep Feature + LBP 73.98Deep Feature + BoF + LBP 75.37

Pre-trained DCNNs

Deep Feature (Caffe-ref) 79.81Deep Feature (VGG-f) 80.77Deep Feature (VGG-19) 81.713 Deep Features + BoF + LBP 85.47(Proposed)

layers. We set the initial learning rate to 0.01 and reduced it byone-tenth every 30 epochs; set the maximum iteration numberto 600; chose the batch training style with the batch size of200 and adopt the default settings of Vedaldi and Lenc [42] forthe other parameters of momentum and weight decay. For eachinput image patch of size 128 × 128 × 3, we concatenated theoutput of the tenth layer into a 128-dimensional deep featureand used it to replace the features learned by the three pre-trained DCNNs in our CDHVF algorithm.

The classification accuracy on the 4166 test images is shownin Table III. The image representations learned by the pre-trained DCNNs produced more accurate classification thanthe handcrafted features. The fully-trained DCNNs, however,performed worse than the handcrafted features. The featureslearned by the fully-trained DCNN, when used alone or jointlywith other handcrafted features, performed worse than thefeatures learned by each pre-trained DCNN, although theCaffe-ref and VGG-f models have only eight learnable layers.These findings show that the training dataset is too smallto train an 11-layer DCNN and that pre-trained DCNNs cantransfer the knowledge of image representation learned fromlarge-scale natural image datasets to diagnostic images andillustrations used for this study. In addition, the joint use ofthree types of deep features and two handcrafted features, hada substantially better classification accuracy.

The confusion matrix of the CDHVF algorithm on the testdataset is shown in Fig. 9. In this matrix, an element (row:i,col:j) represents the percentage of images in the true categoryi that were classified into the category j, and the diagonalelements depict the accuracy of classifying each category.The CDHVF algorithm performed well on major categories,such as DMFL, DMLI, DVDM and GFIG, but is prone toclassifying images from minor categories, such as DSEC,DSEM and GMAT, mistakenly into major categories, suchas GFIG. This misclassification can be ascribed to: (1) Thetraining dataset is unbalanced. The category GFIG contains2954 images, which account for almost 43.6% of all the 6776training images of 30 categories, whereas minor categories,such as DSEC, DSEE, DSEM, GMAT and GPLI, have lessthan 20 training samples. (2) There are inter-class similaritiesin the dataset. The categories, DSEC, DSEM and GMAT, havemany visual similarities with the major category GFIG, asshown in Fig. 10. Thus it is problematic to correctly classify

TABLE IV: Accuracy of end-to-end DCNN models and theproposed algorithm with different features in each category.

CLASS BoF LBP Caffe-ref VGG-f VGG-19 Proposed(#) (E2E) (E2E) (E2E)

D3DR 51.04 58.33 72.92 69.79 83.33 81.25DMEL 10.23 7.95 28.41 34.09 31.82 27.27DMFL 90.14 88.73 88.38 89.44 91.26 92.25DMLI 78.77 78.52 90.86 89.38 91.11 96.30DMTR 13.54 15.03 43.75 57.29 48.96 68.75DRAN 0 0 17.11 28.95 30.26 36.84DRCO 0 0 41.18 23.53 23.53 29.41DRCT 45.07 23.94 73.24 71.83 90.14 88.73DRMR 62.50 54.17 84.03 83.33 81.25 88.89DRPE 0 0 40.00 20.00 13.33 40.00DRUS 19.38 8.53 65.12 57.36 69.77 75.97DRXR 33.33 0 33.33 44.44 38.89 33.33DSEC 0 0 0 0 12.50 0DSEE 0 0 33.33 33.33 0 33.33DSEM 0 0 0 0 0 0DVDM 22.22 0 100 100 66.67 100DVEN 0 0 37.50 12.50 25.00 12.50DVOR 0 0 33.33 52.38 42.86 80.95GCHE 21.43 0 85.71 85.71 92.86 85.71GFIG 98.71 98.66 97.94 98.27 96.50 99.28GFLO 0 0 12.90 12.96 3.23 41.96GGEL 30.36 60.71 66.96 71.88 76.79 79.02GGEN 8.00 12.00 39.33 39.33 45.33 38.00GHDR 10.20 0 22.45 36.73 24.49 32.65GMAT 0 0 0 0 0 0GNCP 5.00 15.00 45.00 35.00 30.00 55.00GPLI 0 0 0 0 0 50.00GSCR 0 0 0 0 33.33 16.67GSYS 0 0 21.33 10.67 44.00 24.00GTAB 30.77 15.38 53.83 53.85 61.54 46.15#Top1 0 0 5 5 8 15

these minor categories with handcrafted features, deep learningmethods or our method.

In Table IV, we listed the classification accuracy for eachcategory of images using handcrafted features, end-to-endDCNN models and the CDHVF algorithm. The highest clas-sification accuracy was highlighted in bold. ”#Top 1” in thebottom row represents the number of categories, in which onealgorithm ranked first. Our CDHVF algorithm performed thebest or tied for the best in 15 out of 30 categories whencompared to the three end-to-end DCNN models, where thetop rates were 5, 5 and 8 of 30.

We also compared the accuracy of the CDHVF algorithmto the competition results of the ImageCLEF 2016 Subfig-ure Classification Challenge. In this Challenge there werethree sub-problems, which allowed participants to use visual,textual and mixed information. Since not using any textualinformation, we compared our CDHVF algorithm to the best-performing solutions that used visual information only. Theclassification accuracy is shown in Table V. The top two per-formances in this challenge (3rd and 4th rows) were obtainedby a team using a pre-trained 152-layer ResNet and 11 typesof handcrafted visual descriptors with feature engineering,respectively. Our CDHVF algorithm, in comparison, achieveda slightly higher accuracy using three DCNN models with lessthan 20 learnable layers and two types of handcrafted features.




Fig. 8: Architecture of the fully-trained DCNN used for feature extraction.

Predicted Class

Actu

al

Cla

ss

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

Fig. 9: Confusion matrix of the test set obtained using theCDHVF algorithm.

Fig. 10: Samples from GFIG(1st row), DSEC(2st row),DSEM(3st row) and GMAT(4st row).

V. Discussion

A. Stability of the Proposed Algorithm

Suppose the percentage of correctly classified test data bythe CDHVF algorithm, denoted by a random variable X, fol-lows a Gaussian distribution N(µ, σ2). Let the highest accuracyof other purely visual approaches shown in the challengeleaderboard be a reference, which is regarded as unchanging,

TABLE V: Accuracy of the proposed algorithm and competi-tion results of the ImageCLEF 2016 Subfigure ClassificationChallenge.

Algorithm Run Type ClassificationAccuracy(%)

Proposed (deep + handcrafted features) Visual 85.47±0.003Koitka et al. (ResNet) [36] Visual 85.38Koitka et al. (11 handcrafted features) [36] Visual 84.46Valavanis et al. [37] Visual 84.01Kumar ea al. [38] Visual 77.55Li et al. [39] Visual 72.46Semedo et al. [40] Visual 65.31

since only the best performance of the corresponding methodwas announced in the challenge leaderboard. We performedthe proposed algorithm 20 times independently, and obtained20 observations of X, whose sample mean is µ = 85.4735 andsample standard deviation is σ = 0.003. Then, the probabilityof our algorithm achieving a classification accuracy higherthan the reference accuracy (85.38%) can be calculated as

P{X > 85.38%} = 1 − Φ(85.38% − µ

σ)

= Φ(µ − 85.38%

σ)

(5)

where

Φ(λ) =1√

2π

∫ λ

−∞

e−x22 dx (λ ≥ 0) (6)

Applying the value of µ and σ to Eq. (5) and (6), we haveP{X > 85.38%} = 95.6%. Hence, although its performanceimprovement is minor, our CDHVF algorithm is more accu-rate than the best-performing purely visual approach in thechallenge with a probability of 95.6%.

B. Performance on Smaller-sample Learning Problems

We randomly selected 80%, 60% and 40% of the trainingdata to determine the effectiveness of the joint use of deepand handcrafted visual descriptors, when the training datasetis even smaller. Since there are limited data, it is unwise tofine tune a very deep pre-trained DCNN. Hence, we chose afully-trained DCNN. Fig. 10 shows the classification accuracyobtained when using the BoF, LBP, deep features and ourCDHVF algorithm. As expected, reducing the number oftraining images leads deep models to more severe over-fitting,which results in a substantial drop in accuracy. The reductionin accuracy is smaller for handcrafted features as there is lower




BoFLBP


BoFLBP


BoFLBP


55

60

65

70

75

80

100% 80

%60

%40

%

DeepBoF

LBP CDHVF

The quantity of the training data

Cla

ssific

ation A

ccura

cy (

%)

Fig. 11: Comparison of performances of four algorithms onsmaller samples.

0 20 40 60 80 100

BoFLBP


BoFLBP


BoFLBP


Deep feature from 'VGG-f'

Deep feature from 'VGG-19'

Deep feature from 'Caffe-ref'

Classification Accuracy (%)

Fig. 12: Comparison of any combination of deep and hand-crafted features in the test set.

dependence on the number of training data. The joint useof deep and handcrafted features, showed a 4.5% decline inaccuracy when compared to the 9.2% reduction with a fully-trained DCNN, when the training dataset is 40% of the total(see Fig. 11). These results show that the joint approach helpsalleviate the deep model over-fitting when there is a smalllearning dataset.

C. Combination of Deep and Handcrafted Features

In Fig. 12 we show that regardless of the pair of deepand handcrafted visual features, the joint approach alwaysimproves the image classification accuracy.

D. Dimension of Deep Features

In each pre-trained DCNN, the second last layer consists of4096 neurons, which result in 4096-dimensional deep features.Such a high dimension increases the complexity and difficultyof classification. We reduced the number of neurons in this

BoFLBP


BoFLBP


BoFLBP


Caf

fe-re

f

VGG-f

VGG-1

9

0

20

40

60

80

100

512-dimension

4096-dimension

74.4179.84

75.8380.77

76.5481.71

Cla

ssific

ation A

ccura

cy (

%)

Fig. 13: Accuracy of the proposed algorithm when usingdifferent dimension of deep features.

DRPE DRXR

Fig. 14: Samples of training, validation and test sets from theclass ’DRPE’ and ’DRXR’.

layer to 512 in an attempt to generate low-dimensional deepfeatures. When the number of neurons is reduced, it impairsthe image representation ability of the pre-trained DCNNmodels and results in lower accuracy (see Fig. 13). Thereforewe suggest the pre-trained models are not changed.

E. Performance Gap Between Validation and Test Sets

In this study, the test dataset contained many images thatwere not visually similar to the images in the training datasetand examples from DRPE and DRXR are shown in Fig. 14.To adopt the early stop strategy and empirically determineparameter settings, we randomly sampled 20% of the trainingimages as the validation data, which meant that the validationdataset was similar to the training dataset rather than thetest dataset. Therefore, on the validation dataset we achievedan 88.04% classification accuracy, which is higher than thatachieved on the testing set.

F. Computational Complexity

Although we used pre-trained DCNN models, the time takento fine tune the models and perform visual feature extraction,dimension reduction and ensemble learning was not trivial. Weused an Intel Xeon E5-2678 V3 2.50 GHz x2 computer, withan NVIDIA Tesla K40c GPU, 128 GB Memory, 120 GB SSDand Matlab 2014b. In Table VI we outline the time cost of the




various steps. The bulk of the time is consumed during theoff-line training. However, using the trained model to classifya test image is relatively fast, as it takes about 1.2 second onaverage.

VI. Conclusions

In this paper, we presented findings using our CDHVF algo-rithm that jointly uses deep and handcrafted features to classifydiagnostic images and illustrations in the biomedical literature.The algorithm had a classification accuracy of 85.47% onthe ImageCLEF 2016 Subfigure Classification dataset, whichis higher than the best performance of other purely visualapproaches listed in the challenge leaderboard. We found thatthe BoF and LBP complemented the image representationlearned by DCNNs and that two features were better than usinga single one. The CDHVF algorithm performed slightly betterthan a pre-trained 152-layer ResNet [36] and substantiallybetter than the method reported by Koitka and Friedrich with11 handcrafted visual descriptors and feature engineering [36].The pre-trained DCNNs were able to transfer the knowledgeof image representation learned from large-scale natural imagedatasets to specific image classification tasks and outper-formed fully-trained DCNN models. The main limitation ofthe pre-trained DCNN models was the high dimension ofthe deep features. However, we addressed this with PCA-based dimension reduction and ensemble learning. In futurework, we will explore further improvements to our CDHVFalgorithm through data augmentation methods to reduce theinaccuracy caused by unbalanced data, the addition or morevisual features and better ways to reduce the dimensions ofthe features.

References[1] P. Ghosh, S. Antani, L. R. Long and G. R. Thoma, ”Review of medical

image retrieval systems and future directions,” in 24th InternationalSymposium on Computer-Based Medical Systems (CBMS), 2011, pp.1-6.

[2] A. G. S. de Herrera, R. Schaer, S. Bromuri and H. Muller, ”Overviewof the ImageCLEF 2016 Medical Classification Task,” in CLEF2016Working Notes. CEUR Workshop Proceedings, 2016.

[3] P. Perner, ”Image mining: issues, framework, a generic tool and itsapplication to medical-image diagnosis,” Engineering Applications ofArtificial Intelligence, vol. 15, no. 2, pp. 205-216, 2002.

[4] M. R. Ogiela, R. Tadeusiewicz, ”Artificial intelligence structural imagingtechniques in visual pattern analysis and medical data understanding,”Pattern Recognition, Volume 36, Issue 10, pp. 2441-2452, 2003, ISSN0031-3203.

[5] A. Rueda, F. A. Gonzalez, E. Romero, ”Extracting Salient Brain Patternsfor Imaging-Based Classification of Neurodegenerative Diseases,” IEEETransactions on Medical Imaging, vol. 33, no. 6, pp. 1262-1274, 2014.

[6] S. A. Chatzichristofis, and Y. S. Boutalis, ”FCTH: Fuzzy Color andTexture Histogram - A Low Level Feature for Accurate Image Retrieval,”in 2008 Ninth International Workshop on Image Analysis for MultimediaInteractive Services, 2008, pp. 191-196.

[7] S. Mustapha, and H. A. Jalab, ”Compact Composite Descriptors forContent Based Image Retrieval,” in the International Conference onAdvanced Computer Science Applications and Technologies, 2012, pp.37-42.

[8] O. Pelka, and C. M. Friedrich, ”FHDO Biomedical Computer ScienceGroup at Medical Classification Task of ImageCLEF 2015,” in CLEF2015Working Notes. CEUR Workshop Proceedings, 2015.

[9] V. Megalooikonomou, M. Barnathan, D. Kontos, P. R. Bakic and A. D.A. Maidment, ”A Representation and Classification Scheme for Tree-LikeStructures in Medical Images: Analyzing the Branching Pattern of DuctalTrees in X-ray Galactograms,” IEEE Transactions on Medical Imaging,vol. 28, no. 4, pp. 487-493, 2009.

[10] J. Yang, K. Yu, Y. Gong and T. Huang, ”Linear spatial pyramid matchingusing sparse coding for image classification,” in IEEE Conference onComputer Vision and Pattern Recognition, 2009, pp. 1794-1801.

[11] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang and Y. Gong, ”Locality-constrained Linear Coding for image classification,” in IEEE Conferenceon Computer Vision and Pattern Recognition, 2010, pp. 3360-3367.

[12] L. Xie, Q. Tian, M. Wang and B. Zhang, ”Spatial Pooling of Hetero-geneous Features for Image Classification,” IEEE Transactions on ImageProcessing, vol. 23, no. 5, pp. 1994-2008, 2014.

[13] S. Lazebnik, C. Schmid and J. Ponce, ”Beyond Bags of Features:Spatial Pyramid Matching for Recognizing Natural Scene Categories,”in IEEE Computer Society Conference on Computer Vision and PatternRecognition, 2006, pp. 2169-2178.

[14] L. Fei-Fei and P. Perona, ”A Bayesian hierarchical model for learningnatural scene categories,” in IEEE Computer Society Conference onComputer Vision and Pattern Recognition, 2005, pp. 524-531 vol. 2.

[15] Y. Song, W. Cai, H. Huang, Y. Zhou, D. D. Feng, Y. Wang, M.J. Fulham and M. Chen, ”Large Margin Local Estimate With Applicationsto Medical Image Classification,” IEEE Transactions on Medical Imaging,vol. 34, no. 6, pp. 1362-1377, 2015.

[16] A. Krizhevsky, I. Sutskever and G. Hinton, ”ImageNet Classificationwith Deep Convolutional Neural Networks,” Advances in Neural Infor-mation Processing Systems, pp. 1106-1114, 2012.

[17] J. Deng, W. Dong, R. Socher, L. J. Li, L. Kai and F.-F. Li, ”ImageNet:A large-scale hierarchical image database,” in IEEE Conference onComputer Vision and Pattern Recognition, 2009, pp. 248-255.

[18] G. E. Hintion and R. R. Salakhutdinov, ”Reducing the Dimensionalityof Data with Neural Networks,” Science, vol. 313, no. 5786, pp. 504-507,2006.

[19] J. Wu, Y. Yinan, H. Chang and Y. Kai, ”Deep multiple instance learningfor image classification and auto-annotation,” in IEEE Conference onComputer Vision and Pattern Recognition, 2015, pp. 3460-3469.

[20] Y. Xu, T. Mo, Q. Feng, P. Zhong, M. Lai and E. I. C. Chang,”Deep learning of feature representation with multiple instance learningfor medical image analysis,” in the IEEE International Conference onAcoustics, Speech and Signal Processing, 2014, pp. 1626-1630.

[21] K. Jarrett, K. Kavukcuoglu, M. Ranzato and Y. LeCun, ”What is the bestmulti-stage architecture for object recognition?,” in the IEEE InternationalConference on Computer Vision, 2009, pp. 2146-2153.

[22] A. Richard and J. Gall, ”A BoW-equivalent Recurrent Neural Networkfor Action Recognition,” in the British Machine Vision Conference, 2015.

[23] J. Hu, J. Lu, Y. P. Tan and J. Zhou, ”Deep Transfer Metric Learning,”IEEE Transactions on Image Processing, vol. 25, no. 12, pp. 5576-5588,2016.

[24] M. Oquab, L. Bottou, I. Laptev and J. Sivic, ”Learning and TransferringMid-level Image Representations Using Convolutional Neural Networks,”in the IEEE Conference on Computer Vision and Pattern Recognition,2014, pp. 1717-1724.

[25] T. Ojala, M. Pietikainen and D. Harwood, ”A Comparative Study ofTexture Measures with Classification Based on Feature Distributions,”Pattern Recognition, vol. 29, no. 1, pp. 51-59, 1996.

[26] I. T. Jolliffe, ”Principal Component Analysis,” SPringer-Verlag, pp. 487,1986.

[27] H. Muller, J. Kalpathy-Cramer, D. Demner-Fushman, S. Antani, ”Cre-ating a classification of image types in the medical literature for visualcategorization,” in the SPIE medical imaging, 2012.

[28] J. Kalpathy-Cramer, A. G. S. de Herrera, D. Demner-Fushman, S. An-tani, S. Bedrick and H. Muller, ”Evaluating Performance of BiomedicalImage Retrieval Systems -an Overview of the Medical Image Retrievaltask at ImageCLEF 2004 - 2013,” Computerized Medical Imaging andGraphics, 2014.

[29] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama and T. Darrell, ”Caffe: Convolutional Architecture forFast Feature Embedding,” arXiv preprint arXiv:1408.5093, 2014.

[30] K. Chatfield, K. Simonyan, A. Vedaldi and A. Zisserman, ”Return ofthe Devil in the Details: Delving Deep into Convolutional Nets,” in theBritish Machine Vision Conference, 2014.

[31] K. Simonyan and A. Zisserman, ”Very Deep Convolutional Networksfor Large-Scale Image Recognition,” arXiv:1409.1556, 2014.

[32] R. Keys, ”Cubic convolution interpolation for digital image processing,”IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 29,no. 6, pp. 1153-1160, 1981.

[33] H. Bay, A. Ess, T. Tuytelaars and L. Van Gool, ”Speeded-Up RobustFeatures (SURF),” Computer Vision and Image Understanding, vol. 110,no. 3, pp. 346-359, 2008.

[34] R. M. Gray, ”Vector quantization,” IEEE of the ASSP Magazine, vol. 1,no. 2, pp. 4-29, 1984.




TABLE VI: Time cost of major modules of the proposed algorithm.

StepsDeep Feature Extraction Visual Feature Extraction

PCA Ensemble Classification Total TimeCaffe-ref VGG-f VGG-19 BoF LBP

Train(h) 2 2 4 3 <0.001 0.5 0.5 12

Test(s) 0.100 0.100 0.500 0.500 0.030 <0.001 0.002 0.33

[35] D. E. Rumelhart, G. E. Hinton and R. J. Williams, ”Learning representa-tions by back-propagating errors,” Nature, vol. 323 (6088), pp. 533-536,1986.

[36] S. Koitka and C. M. Friedrich, ”Traditional Feature Engineering andDeep Learning Approaches at Medical Classification Task of ImageCELF2016,” in CLEF2016 Working Notes. CEUR Workshop Proceedings,2016.

[37] L. Valavanis, S. Stathopoulos and T. Kalamboukis, ”IPL at CLEF2016 Medical Task,” in CLEF2016 Working Notes. CEUR WorkshopProceedings, 2016.

[38] A. Kumar, D. Lyndon, J. Kim and D. Feng, ”Subfigure and Multi-Label Classification using a Fine-Tuned Convolutional Neural Network,”in CLEF2016 Working Notes. CEUR Workshop Proceedings, 2016.

[39] P. Li, S. Sorensen, A. Kolagunda, X. Jiang, X. Wang, C. Kambhamettuand H. Shatkay, ”UDEL CIS at ImageCLEF Medical Task 2016,” inCLEF2016 Working Notes. CEUR Workshop Proceedings, 2016.

[40] D. Semedo and J. Magalhaes, ”NovaSearch at ImageCLEFmed 2016Subfigure Classification Task,” in CLEF2016 Working Notes. CEURWorkshop Proceedings, 2016.

[41] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, ”Gradient-based learningapplied to document recognition,” Proceedings of the IEEE, vol. 86, no.11, pp. 2278-2324, 1998.

[42] A. Vedaldi and K. Lenc, ”MatConvNet: Convolutional Neural Networksfor MATLAB,” in Proceedings of the 23rd ACM International Conferenceon Multimedia, Brisbane, Australia, 2015, pp. 689-692.

[43] H. Chen, D. Ni, J. Qin, S. Li, X. Yang, T. Wang and P. A. Heng, ”Stan-dard Plane Localization in Fetal Ultrasound via Domain Transferred DeepNeural Networks”, IEEE Journal of Biomedical and Health Informatics,vol. 19, no. 5, pp. 1627-1636, 2015, ISSN 2168-2194.

[44] D. Ravi, C. Wong, F. Deligianni, M. Berthelot, J. Andreu-Perez, B. Loand G. Z. Yang, ”Deep Learning for Health Informatics”, IEEE Journalof Biomedical and Health Informatics, vol. 21, no. 1, pp. 4-21, 2017,ISSN 2168-2194.

Jianpeng Zhang received his B.E. in Computer Sci-ence and Technology from Northwestern Polytech-nical University (NPU), Xi’an, China, in 2016. He iscurrently a Master student at the School of ComputerScience and Engineering, NPU. His research areaincludes medical imaging, image processing andmachine learning.

Yong Xia (S’05-M’08) received his B.E., M.E.and Ph.D. in Computer Science and Technologyfrom Northwestern Polytechnical University (NPU),Xi’an, China, in 2001, 2004 and 2007, respectively.He is currently a Professor at the School of Com-puter Science and Engineering, NPU. He is alsowith the Shaanxi Key Lab of Speech & ImageInformation Processing (SAIIP) and Centre for Mul-tidisciplinary Convergence Computing (CMCC) atthe School. His research interests include medicalimage analysis, computer-aided diagnosis, pattern

recognition, machine learning, and data mining.

Yutong Xie received her B.E. in Computer Scienceand Technology from Northwestern PolytechnicalUniversity (NPU), Xi’an, China, in 2016. She iscurrently a PhD student at the School of ComputerScience and Engineering, NPU. Her research areainclude medical imaging, computer-aided diagnosisand machine learning.

Michael Fulham received the M.B.B.S. degree(Hons.) from the University of New South Wales,Sydney, Australia, in 1979. He is the Director ofthe Department of Positron Emission Tomography(PET) and Nuclear Medicine and a Senior Neu-rologist and Staff Specialist with the Royal PrinceAlfred Hospital, Sydney. He is the Clinical Directorof medical imaging for the Sydney South West AreaHealth Service, Sydney. He is an Adjunct Professorwith the School of Information Technologies and aClinical Professor with the Sydney Medical School,

University of Sydney, Sydney. His current research interests include neu-roimaging, PET- computerized tomography imaging, and the application ofinformation technologies for patient management.

David Dagan Feng (S’88-M’88-SM’94-F’03) re-ceived the M.E. degree in electrical engineering andcomputer science from Shanghai Jiao Tong Univer-sity, Shanghai, China, in 1982, the M.Sc. degree inbiocybernetics and the Ph.D. degree in computer sci-ence from the University of California, Los Angeles,CA, USA, in 1985 and 1988, respectively, where hereceived the Crump Prize for Excellence in MedicalEngineering. He is currently the Head of School ofInformation Technologies and the Director of theInstitute of Biomedical Engineering and Technology,

University of Sydney, Sydney, Australia, a Guest Professor of a number ofUniversities, and a Chair Professor of Hong Kong Polytechnic University,Hong Kong. He is a fellow of ACS, HKIE, IET, and the Australian Academyof Technological Sciences and Engineering.



manuscript to ieee j-bhi 1 classiﬁcation of medical images

Documents